Navigation
Learn About
Developing With
Ingres Talk
Information
Toolbox
Views
Batch Statement Execution Performance
From Ingres Community Wiki
Ingres Enhancement Requests
| DBMS Feature | Description |
|---|---|
| Improved Data Load | A number of issues need to be resolved with data load. One is performance. Also to be considered is more flexibility (varying types of delimiters (the ability for example to take output of a MySQL export). Also the ability to load into a partitioned table. |
| 132388 | Load into partitioned table |
| 134125 | Sometimes the DBA knows that s/he would prefer to fail rather than suffer the cost of a non-bulk copy in. This option would ensure that if the DBMS decides to use a non-bulk copy, an error is returned immediately. |
| 110179 | The JDBC standard provided "batch" support. The idea is to be able to improve performance by preparing an INSERT or UPDATE statement and then adding/updating many sets of parameter data to the prepared statement before executing it. |
| 133598 | Ingres .NET Data Provider - Bulk loading into a database table. |
A complementary project is Bulk Loading Projects.
Design Documents
The DDS for the DBMS part of batch execution is here
The DDS for client batch processing (primarily OpenAPI) is here
Sample Test Programs
OpenAPI query literals: Apibatch0.c
OpenAPI parameterized queries: Apibatch1.c
OpenAPI prepared queries: Apibatch2.c
OpenAPI database procedures: Apibatch3.c
JDBC Statement - general, non-parameterized statements: IngBat1.java
JDBC PreparedStatement - prepared statement with parameter sets: IngBat2.java
JDBC CallableStatement - database procedure with parameter sets: IngBat3.java
JDBC performance: IngBat0.java
Does COPY TABLE Have A Performance Problem
Karl: Note: this section was written before the Bulk-Load Performance Study project was started. Refer to that page for more recent results. For instance, it developed that COPY FROM (list) was in fact hugely inefficient in the front-end file handling. This is being addressed. (16-Dec-2009)
This discussion does not directly involve COPY TABLE but recently there have been a number of questions regarding performance problems with COPY TABLE. This is a rather disturbing report.
COPY TABLE implements a streaming protocol between the client and server. In theory it should be able to approximate the transport rate of a raw FTP transfer. In theory is sets an upper bound on how fast data can be transferred between an Ingres Client and Ingres Server.
A performance problem with COPY TABLE would fall into one or more of the following categories:
- GCA Performance Problem with streaming data
- Internal DBMS Processing Problem in layers above DMF
- Internal DBMS Processing Problem in layer at or below DMF (i.e. GWF or DI)
Ingres has a construct called the SELECT LOOP that efficiently streams data from a Server to a Client. This is, in general, the fastest way to retrieve data from an Ingres Server. It should set an upper bound on how fast GCA can transport data from a Server to a Client. The COPY TABLE INTO (Copy data from an Ingres Server to an Ingres Client INTO a file) data transfer rate should be very close to to the data transfer rate of the SELECT LOOP.
One assumption that i have made (correctly or incorrectly) is that the COPY TABLE INTO and the COPY TABLE FROM data transfer rates shoudl be close to each other. This may not be true based upon some reports.
If COPY TABLE does have a performance problem then I think we need to think about how to confirm this report and identify the root cause. This fundamentally affects what we can export from GCA.
Batch Processing in JDBC
There are three forms of batch processing (batch execution actually) in JDBC:
- Present a series of independent statements to be executed as a group. The entire group of statements is defined prior to being executed. The statements may not be a select statement and may only return an optional row count and error information. The use of transaction statements may also be restricted. The batch may be executed during autocommit or within a standard transaction. Execution of the batch is not all or nothing: individual statements may fail while others succeed. The DBMS is permitted to stop processing the batch if a serious error occurs, but the results of any statements preceding the error must be returned.
- A prepared statement is designated to be executed and multiple sets of input parameters are provided. One parameter set is used per execution and the group of sets are executed as a batch. Other conditions and restrictions are similar to batch statement processing.
- A database procedure is designated to be executed and multiple sets of input parameters are provided. One parameter set is used per execution and the group of sets are executed as a batch. Output (byref) parameters are not permitted, neither are row-producing procedures. The procedure may only return a single result value (integer). Other conditions and restrictions are similar to batch statement processing.
GCA Batch Processing Protocol
GCA statement processing generally entails a request from the client and a response from the server. A single statement is executed within a request/response exchange. Individual GCA messages are terminated by an end-of-data flag. Groups of GCA messages are terminated with an end-of-group flag. The end-of-data flag is provided by the GCA application. The end-of-group flag is commonly provided by GCA itself based on standard GCA protocols and is generally only referenced within GCA. It is possible for the GCA application to override the default end-of-group indication provided by GCA.
To support batch processing, it is proposed that the statement processing protocol be extended to allow a stream of requests to be sent to the server, only the last of which will have the end-of-group indicator set. Similarily, the server response will be a stream of response messages with only the last having the end-of-group indicator. These protocol changes will be associated with a new GCA protocol level: GCA_PROTOCOL_LEVEL_68.
Features in the GCA interface are associated with an API level. Control of the end-of-group indicator is associated with API level 5. The DBMS currently utilizes GCA API level 4. DBMS code dealing with GCA will need to be enhanced to allow utilization of GCA API level 5. The most significant change associated with API level 5 is the replacement of GCA_FORMAT and GCA_INTERPRET service calls with the information being provided directly by GCA_SEND and GCA_RECEIVE.
The end-of-group flag may be set directly by the GCA application by setting the GCA_EOG flag in gca_modifiers of GCA_SD_PARMS when issuing a GCA_SEND request, while end-of-group may be suppressed with the GCA_NOT_EOG flag. If neither of these flags are set, GCA will automatically determine the end-of-group setting based on the message type and standard GCA protocol conventions.
Note: because some GCA messages play different roles in the various GCA protocol scenarios, and because GCA does not know which particular scenario is in progress, it is not always possible for GCA to accurately set the end-of-group flag. For instance, GCA sets end-of-group for GCA_ERROR messages even though these message frequently precede other messages in a group. These inconsistencies do not pose a problem for the uses of end-of-group internal to GCA and existing GCA clients use protocol conventions rather than the end-of-group flag to determine proper client/server handshaking. Fortunately, GCA does properly set end-of-group for the client request messages which will be involved in batch processing. This will ensure that the server can correctly detect standard and batch processing requests. It will be critical that the server correctly set end-of-group when responding to batch requests. Response to non-batch requests will not require changes in end-of-group handling - conformance to batch responses is acceptable but optional.
Batch processing protocol in the presence of errors
The DBMS has two options for handling errors during batch processing. Recoverable errors related to a specific statement, such as a syntax error, may be handled and processing continued with the next request. Errors whose scope extends beyond a specific statement, such as a transaction abort, may terminate the processing of the batch. In the latter case, the DBMS must continue to read the batch requests until the end of the batch (end-of-group) is reached. Requests in the batch received after an unrecoverable error may be ignored.
The DBMS must return a response group for each request in a batch which completes successfully or with a recoverable error. The response groups correspond in order to the requests received. A response group may contain optional error messages (as well as standard GCA optional messages such as trace and event messages) and end with a GCA_RESPONSE message. Only the last GCA_RESPONSE message should have end-of-group set. If an unrecoverable error occurs, the associated response group is returned with end-of-group set and no additional messages associated with the batch should be follow.
The client associates each response group received in order with the requests sent. If end-of-group is seen on a response group prior to receiving all expected response groups, the associated remaining requests are assumed to have gone unprocessed and no further responses are expected.
GCA Batch Statement Execution
A client requests a non-select statement to be executed by sending a GCA_QUERY message to the server. The GCA_QUERY message contains query text and an optional set of query parameter values. The server responds with a GCA_RESPONSE message. If an error occurs executing the statement, the GCA_RESPONSE message may be preceded by one or more GCA_ERROR messages.
GCA_QUERY (EOG)
---------------->
[GCA_ERROR]
GCA_RESPONSE (EOG)
<-------------------
A batch request to execute multiple statements would consist of a series of GCA_QUERY messages, each containing a single statement with optional parameters. Only the last GCA_QUERY message should have end-of-group set. For a batch request of N statements, 1 <= N, the server responds with M GCA_RESPONSE messages, 1 <= M <= N. Only the last GCA_RESPONSE message should have end-of-group set. If M < N, the client assumes that the Mth statement failed in such a way that processing of the batch could not continue and that statements M+1 through N were ignored. The associated error message(s) (optional) should precede the Mth GCA_RESPONSE message. If a recoverable error occurs processing the Kth statement, 1 <= K <= M, the optional error messages would follow the GCA_RESPONSE for statement K-1 (if 1 < K) and precede the GCA_RESPONSE for the Kth statement.
GCA_QUERY (1)
---------------->
...
GCA_QUERY (N,EOG)
---------------->
[GCA_ERROR]
GCA_RESPONSE (1)
<-------------------
...
[GCA_ERROR]
GCA_RESPONSE (M,EOG)
<-------------------
Optimization Options
It is possible that the requests in a batch will consist of the repeated execution of a single statement, such as an insert statement, with differences only in the associated query parameter values. The (dbms, star, gateway, etc.) server could identify this situation and optimize execution to reduce parsing and query plan generation.
The client could explicitly request this optimization by providing the query text only in the first GCA_QUERY message. subsequent GCA_QUERY messages would only include the optional set of query parameter values. It may be desirable to these requests more explicit by adding additional GCA_QUERY modifier flags to identify the defining query and susbsequent parameter sets.
JDBC Specifics
JDBC does not explicitly support the repeated execution of a given statement but an application could submit a single statement multiple times within a batch. JDBC does not provide a way to specify parameter values for statements, so it is likely that requests will differ at least in literal values.
GCA Batch Prepared Statement Execution
The GCA protocol for executing a prepared statement is identical to that for executing general statements. The GCA_QUERY message contains the query text execute <stmt_name> [using ~V [, ~V]] and an optional set of dynamic parameter values. The protocol for batch execution of prepared statements is therefore identical to that of batch statement execution given above.
Optimization Options
It is likely that the requests in a batch will reference the same prepared statement with differences only in the associated dynamic parameter values. Given the limited variations of the execute statement, it should be easy for the server to identify this situation and optimize the execution.
The client could explicitly request this optimization in the same way as proposed for batch statement execution.
JDBC Specifics
JDBC only supports batch execution of a single prepared statement. The statement name, and therefore the query text, will be the same for all requests in a batch. The requests will therefore only differ in the associated dynamic parameter values.
GCA Batch Procedure Execution
A client requests a procedure to be executed (which does not return rows nor have OUT parameters) by sending a GCA_INVPROC or GCA1_INVPROC message to the server. The GCA_INVPROC message contains the procedure name (GCA1_INVPROC also contains the procedure owner) and an optional set of procedure parameter values. The server responds with optional GCA_ERROR messages (warnings and user messages), a GCA_RETPROC message containing the procedure result value, and a GCA_RESPONSE message. If an error occurs executing the procedure, the GCA_RESPONSE message may be preceded by one or more GCA_ERROR messages and the GCA_RETPROC message is omitted.
GCA{1}_INVPROC (EOG)
---------------->
[GCA_ERROR]
[GCA_RETPROC]
GCA_RESPONSE (EOG)
<-------------------
A batch request to execute multiple procedures would consist of a series of GCA{1}_INVPROC messages, each containing a single procedure reference with optional parameters. Only the last GCA{1}_INVPROC message should have end-of-group set. For a batch request of N procedures, 1 <= N, the server responds with M GCA_RESPONSE messages, 1 <= M <= N, and associated GCA_ERROR and GCA_RETPROC messages. Only the last GCA_RESPONSE message should have end-of-group set. If M < N, the client assumes that the Mth procedure failed in such a way that processing of the batch could not continue and that procedures M+1 through N were ignored. The associated error message(s) (optional) should precede the Mth GCA_RESPONSE message. If a recoverable error occurs processing the Kth procedure, 1 <= K <= M, the optional error messages would follow the GCA_RESPONSE for procedure K-1 (if 1 < K) and precede the GCA_RESPONSE for the Kth procedure omitting the associated GCA_RETPROC message.
GCA{1}_INVPROC (1)
---------------->
...
GCA{1}_INVPROC (N,EOG)
---------------->
[GCA_ERROR]
[GCA_RETPROC]
GCA_RESPONSE (1)
<-------------------
...
[GCA_ERROR]
[GCA_RETPROC]
GCA_RESPONSE (M,EOG)
<-------------------
Optimization Options
It is likely that the requests in a batch will reference the same procedure with differences only in the associated procedure parameter values. The server could recognize this situation and optimize the exeuction of the procedure.
While the client could explicitly request this optimization in a manner similar to that proposed for batch prepared statement execution, the overhead of providing the procedure name in each request and comparing procedure names in the server should be minimal enough that communication changes are not necessary.
JDBC Specifics
JDBC only supports batch execution of a single database procedure. The procedure name, and optional owner, will be the same for all requests in a batch. The requests will therefore only differ in the associated procedure parameter values.
DBMS Handling of Batched statements
The DBMS will know that a statement is part of a batch using the GCA_EOG and GCA_NOT_EOG modifiers described above. This will be the only interface to the DBMS for batched execution, the DBMS therefore requires the relevant client connection to send messages with these modifiers, which in turn requires that the relevant interface or driver has been coded to add the modifier. Implicit in this is that the programming language being used for the interface or driver has a paradigm that allows the programmer to stipulate that queries should be executed in a batch. The implication in batched statements is that the programmer may not receive a response for any statement in the batch until the last statement for the batch has been sent.
The primary purpose of batching statements is to save on the communication overhead that exists for single statement executions. When a single statement is executed, no matter how small it is, the buffer must be sent over GCA so that the DBMS can execute the statement. Having executed the statement, a response must be sent to the client to indicate the outcome; this requires an entire round trip to the DBMS (send and receive) for every statement executed. When statements are batched, no response is expected immediately, and so the batched statements may be packed into a single communications buffer, allowing a single round trip to the DBMS for several statements. This implies significant savings on communications, which are more significant for connections with low bandwidth or high latency. It should be noted that if the communications overhead of query execution is not significant, then the amount of time saved by using batched query execution may not be significant; for example, if it takes 10 milliseconds to send a query over GCA and 100 milliseconds to execute it in the DBMS, then the saving would be 10% at best.
The GCA API is designed such that, even if queries are batched into a single communications buffer, a call to GCA_RECEIVE will only return one query from the buffer; in this way, GCA can hide most of the complications of batched query execution from the caller, the only indication that the query is part of a batch is signified with the GCA_NOT_EOG modifier. This allows the DBMS to use a simple algorithm when handling batched queries; it will read one query from the batch, and will execute it in isolation in the usual way, if the GCA_NOT_EOG modifier was set for the received query, the DBMS sets GCA_NOT_EOG when calling GCA_SEND to send the response, this, in turn causes GCA to batch the responses to the queries. With this modification in place, the typical course of events is as follows:
- The client places a number of queries into a batch and eventually indicates the end of the batch
- GCA sends all the queries in the batch to the DBMS in one buffer (this may be several buffers if the batch size is larger than the buffer size)
- The DBMS reads each query from the batch in turn and executes each one in isolation
- The DBMS places the response for each query in a batch and eventually indicates the end of the batch when the last query in the batch has be executed
- GCA sends all the responses in one buffer (or several if they will not fit)
- The client receives all the responses for the batched queries
Since the DBMS receives several queries at once, there may be some specific optimizations that it could perform based on the knowledge of several future queries. In reality, such optimizations are much more difficult than they seem. The DBMS executes queries using a simple state machine known as the sequencer, this state machine is simple in nature, with 3 basic states, read (wait for next query), execute query, and write (send results), but it contains a vast amount of code, largely dedicated to making decisions about how to execute all the several different types of query that the DBMS can receive at the highest level, and then calling the various facilities in the DBMS to deal with their part of query execution, and handling all the various errors and exceptions that might arise as part of query execution. The sequencer itself is over 20,000 lines of C code including its numerous supporting functions, and the many variables and structures it uses are designed to handle only a single query (the one it is currently executing). In its initial form, a query is simply a character string, in order to make any sense of the query, the DBMS must first run it through the SQL parser, which will verify that it is semantically correct and will build a representation of the query that the DBMS can understand (called a parse tree), the parse tree is stored in some of the many structures that are used during query execution, and, currently, there is only one set for each session, which hold the query currently being executed. It doesn't take long to work out that holding several parse trees at once for a single session becomes a very complex task, add to that the fact that the sequencer state machine now has to do things out-of-order (i.e. parse several of the next queries before executing the current one), and we can see that having any understanding of future queries in the DBMS would require a vast amount of new code and a partial re-write of the 20,000 line sequencer, this is obviously way beyond the scope of this enhancement.
Given the situation described above, any special optimizations that the DBMS performs as a result of receiving batched queries must restrict itself in scope to the current query, any information it can store about previously executed queries, and any contextual information it can gain from GCA modifiers passed by the client. To this end, two possible optimizations are proposed as follows:
Conversion of batched dynamic INSERT into COPY
Insert statements are of specific interest to this project because they are commonly used in Java based ETL tools to load data into the database (see above enhancement requests). Early benchmarks using Java against Ingres and MySQL show that MySQL is significantly faster than Ingres at loading data through a Java based insert; one of the goals of this project is to reduce or eliminate this performance gap. Benchmarking against a prototype DBMS that enables batched query execution shows a 40% improvement in performance for large batches of simple insert statements (using a local connection). While this is a very good result (especially for local connections), it does not completely address the performance gap between Ingres and MySQL, and so further optimization is required. Profiling of the DBMS when running the batched insert benchmark showed that it was very CPU heavy, and that the CPU was spread relatively evenly among the active facilities within the DBMS. The only way to improve performance is to reduce CPU usage, and since there is no easily identified CPU hot-spot in the current code that executes insert, a different code path or a complete re-write of insert is required. Other than execution of the INSERT statement, data can be added to the database using the COPY statement. The Ingres COPY statement executes a much shorter code-path in the DBMS, and as a result it uses significantly less overall CPU when loading several records. The COPY statement is optimized for loading several records at a time, and this is where the most significant benefit is. OpenAPI based interfaces (such as JDBC and ODBC) cannot use COPY directly because much of the logic that makes batching and formatting of records work is in libq; in any case, Java does not have any programming paradigms that translate easily into the COPY statement.
Given all these parameters, the best option is to transparently translate the INSERT statement to use the COPY processing internally in the DBMS when we believe that the client is likely to execute an insert several times in a row to load data into a table. The following conditions are proposed as ones which make it probable that several successive inserts will be executed:
- The statement is a dynamic insert, i.e. one which is being run using "EXECUTE <insert_statement> USING...". If the programmer has gone to the effort of preparing a statement for dynamic execution, then it is likely that they will execute it several times.
- The statement is part of a batch. If statements are being batched, then it is likely that the same statement will be executed several times.
- The parameters for the dynamic insert contain the values to be inserted. This makes it possible to pick up new values each time to perform the optimization, and it is also the way the statement would usually be coded if the programmer intends to use several successive inserts to load a table.
Sending of batched dynamic statement parameters without query text
If a dynamic prepared statement is being executed several times in a batch, it is possible for the DBMS to assume that, if no query text is passed with a set of parameters, the the client wants to re-execute the previous dynamic statement again with a different set of parameters. An additional GCA modifier can be added to avoid errors of omission being construed as repeated executions. In this case, the optimization can help performance in two ways:
- Significant decrease in the amount of GCA traffic. If the query text makes up a significant amount of the query buffer for a given query, then that buffer is significantly reduced. When running in batch, this means that several more query executions will fit into a single buffer and less buffers are required to repeatedly execute the query. This saving is most significant in low bandwidth or high latency connections.
- The ability of the DBMS to bypass the parser for repeated executions of the query. Once the DBMS has parsed the initial execution of the query, it can save all the information about the parse tree that is not related directly to the varying parameters. When it receives another query with no text, it can assume that the same query is being run again, and can simply replace the parameters with the newly provided values and avoid most of the work involved in parsing the statement.
Documentation
Batch Query Execution
Queries now can be executed in batch. Batch statement execution improves communication performance between the client and the DBMS Server. Batched statements that are run against an Ingres Star server may see less performance benefit.
A group of statements can be sent to the server, where they are executed, and then the correct number of responses are sent back to the client. This feature is restricted to queries that return either a response or a row count, but no data.
Notes:
- SELECT statements and row-producing procedures are not allowed in batched queries. Statements allowed include INSERT, DELETE, and statements that return no data, such as CREATE TABLE.
- Embedded SQL does not yet support batch query execution.
- If the DBMS does not support batch processing, the Ingres JDBC driver will detect it and automatically execute the statements individually.
To take advantage of batch query execution, OpenAPI programmers can use the new function, IIapi_batch(). For more information, see the OpenAPI User Guide.
Java programmers can use the addBatch and executeBatch methods (which are supported in Ingres JDBC).
Even for servers that support batch processing, if you discover a difference in behavior between individual statement execution and batch execution and you want to force individual statement execution, you can use the ingres.jdbc.batch.enabled system property to disable batch query execution. For more information, see the ingres.jdbc.batch.enabled property in the Connectivity Guide.
We recommend not to use autocommit with batch execution. If batch execution is used with autocommit, and you cancel the batch execution, it is impossible to tell which statements were committed.
Batch Statement Processing
Batched statements that use repeated dynamic INSERT statements (for example, through Java-based Extract Transfer and Load tools) are specially optimized to improve performance. Faster inserts can be achieved if the following conditions are met:
- Inserts must be into a base table (not a view or index).
- The table must not have any rules or integrities defined on it.
- The table must be a regular Ingres (not gateway) table.
- The inserts must be batched.
- The batched statements must be an execution of a prepared dynamic insert where the dynamic parameters exactly match the values being inserted.
- All but the first statement in the batch should be sent without query text, to achieve an additional performance boost.
Development Unit Tests
These tests are C programs that make direct calls to GCA to test batch processing in the DBMS, they do not use any supported interface to the DBMS and were simply created for unit testing of the feature. Use at your own risk.
Development Unit Tests: batch_unit_tests.tar.gz

