Bulk Loading Projects
From Ingres Community Wiki
Bulk Operations Versus Fast Database Load
Getting data into the Ingres database fast and efficiently is increasing becoming a problem. It needs to be made clear that there are two distinctly different problems that need to be addressed.
- Bulk Operations
- Fast Database Load
It is important that both issues be addressed.
Bulk Operations generally take the form of a need to support Batch Queries from a client. The generic case is the need for our client interfaces (JDBC Driver, ODBC Driver, .NET Data Provider are the key interfaces with LIBQ-based interfaces such as ESQL/C and OpenROAD being secondary) to be able to support the processing of multiple INSERTs, UPDATEs, DELETEs and possibly EXECUTE PROCEDUREs in a single request to the database.
A more common case that has been seen in recent POC is the requirements for the client interfaces to transparently handle multiple INSERT operations as a single Batch Query. This approach would improve the efficiency of loading data frcom a client interface.
Gordy has provided a lot of research at the following Wiki page Batch Statement Execution Performance. This page discusses issues specific to JDBC and ODBC.
Fast Database Load
This topic is of primary concern to the Ingres Performance group and to Product Management. The requirement can be stated very simply,
- There is a need to load data into the database very fast that can be initiated from a client interface but all processing talkes place on the Ingres Server with no restrictions.
This has some profound implications:
- Once the client initiates the load operation the data should be accessed directly by the Ingres DBMS
- Formats other than BINARY COPY TABLE format (as currently required by the Ingres Fastload Utility) should be supported.
- These formats should include Command Separated Values, Tab Delimited and existing COPY TABLE Character formats.
This requirement also implies that the GCA Protocol will never be fast enough to load data since it requires the client to access and pass data over a communications interface. However, it also recognizes that there are circumstances in which the COPY TABLE interface will need to be the degenerate processing interface when the Ingres DBMS Server determines that it cannot access the data directly.
This also implies that enhancements will need to be made to the COPY TABLE implementation. This will need to be discussed when we meet.
Requirements Articulated by Product Management
Over the past two years, a number of Proof-of-Concepts have encountered difficulty loading data into Ingres. The nature of these issues are varied and complex. Unfortunately, they have gotten lumped into a rather gray category of “Bulk Load”. A complementary project is improving Batch Statement Execution Performance.
We need to clearly define what problems have been encountered, discuss possible solutions and determine the projects needed to implement changes.
The issues captured thus far include:
- Performance of a data load initiated from a program is too slow.
- Ingres does not support batch statement execution
- Ingres copyin can only be used from a few programmatic interfaces, and is difficult to use then.
- Only Embedded SQL and the API support copy-in. (odbc? jdbc? .net?)
- API support for copy is very complex.
- copy can only go to file, not program arrays or variables
- (There IS an undocumented and very delicate copy-to-memory interface, called "copy to/from program". It is undocumented because it has no safeguards and will crash or corrupt data at the slightest usage error.)
- desirable to have for example: Informix dbload is a standalone c program that can be called from the command line or any SQL interface.
- copy support is complicated because the client has to do all row formatting, and the client/server protocol is multi-phase and complex. Zero server side support for anything except properly formatted binary rows.
- CopyIn resorts to row-at-a-time inserts in most scenarios.
- CopyIn does row-at-a-time unless table is empty or HEAP
- CopyIn does row-at-a-time if table is journaled or has secondary index
- CopyIn does row-at-a-time if table is partitioned
- Typical data formats from external sources are difficult to load into Ingres.
- copy command has poor support for comma delimited/quoted string files
- import assistant is only on windows and not command line driven
- import assistant is buggy (memory errors cause it to fail intermittently)
Bulk Loading Improvements
The CSV and SSV delimiters allow COPY to read and write comma separated values (CSV) files.
COPY FROM an ordinary text data file is up to two times faster.
A related feature, Batch Query Execution, improves the efficiency of loading data from a client interface.
CSV and SSV Delimiters
The CSV and SSV delimiters allow COPY to read and write files that contain comma separated values (CSV).
The rules for a CSV delimited field are:
- The field is delimited by a comma, unless it is the last field in the COPY list; in that case, the field is delimited by a newline.
- COPY FROM: If the first non-blank character in the field is a double quote ("), the field extends until a closing double quote. Commas or newlines inside the quoted string are not delimiters and do not end the value. If a doubled double quote ("") is seen while looking for the closing quote, it is translated to one double quote and the value continues. For example, the data file value:
- “There is a double quote “” here”
- is translated to the table value:
- There is a double quote “ here
- Whitespace before the opening double quote, or between the closing double quote and the delimiter (comma or newline), is not part of the value and is discarded.
- COPY INTO: If the value to be written contains a comma, newline, or double quote, it is written enclosed in double quotes using quote doubling as described in the previous bullet item. If the value does not contain a comma, newline, or double quote, it is written as is.
The SSV delimiter works exactly the same as the CSV delimiter, with semicolon in place of comma.
CSV and SSV delimiters are only allowed with BYTE(0), C0, CHAR(0), and TEXT(0). They are not allowed with the “counted” formats (VARCHAR(0) and so on); the count defines the value exactly and there is no need for quoting. (If delimiting is desired, use the comma or nl delimiters on counted formats.)
COPY FROM: Some CSV file variants use quote escaping (\") instead of quote doubling ("") to indicate a quote inside a quoted string. The C format handles¬ escaping, so use the C0CSV format and delimiter to handle this type of file. (CSV with COPY INTO always writes quote doubling (never quote escaping) when needed.)
This example copies employee names, employee numbers, and salaries from the employee table into a file. All items are stored as text data to eliminate any padding. The sal column is converted from its table format (money) to text in the data file. The comma-separated-values delimiter is used.
copy table employee (ename = text(0)csv, eno = text(0)csv, sal = text(0)csv) into ‘mfile.out’; Joe Smith,101,$25000.00 Shirley Scott,102,$30000.00