Login Register Actian.com  

Actian Community Wiki

Navigation
Learn About
Developing With
Ingres Talk
Information
Toolbox

Bulk-Load Performance Study

From Ingres Community Wiki

Jump to: navigation, search

Karl: This is a free-form status page for my investigation into bulk load. Since the project itself is somewhat ill defined at the moment, you may find this page equally ill defined... Comments are welcome, but I strongly urge using the Discussion page rather than editing this page directly.

Contents

Overview thoughts

Bulk load basically talks about getting data en masse into a database. For simplicity, we'll assume that a single table is being loaded. (There may be ETL going on before or after the load, which is out of scope.) The term covers a lot of ground, because there are so many things to consider:

  • Data format:
    • binary, already in row form
    • binary, not in row form (this is probably quite rare)
    • character, needs conversion to row form
      • Esql or API COPY converts in the front-end (ESQL COPY provides the front-ender, an API copy would depend on the user app to interpret the copy map and do all the work)
      • INSERT converts in the server parser token scanner
  • Load statement used:
    • COPY FROM
    • INSERT (values) repeatedly, or ODBC/JDBC equivalent forms
    • fastload (very restrictive: data file is server local, data in binary row form, table-LOAD criteria satisfied)
    • INSERT/SELECT; not a loading statement per se, but it might be part of an overall load pipeline. E.g. load into temp or work table, then INSERT/SELECT into final target.
  • Data path to server:
    • LIBQ (embedded SQL: COPY or INSERT)
    • API (hand coded: COPY or INSERT)
    • ODBC (ODBC layer then API: INSERT. COPY??)
    • JDBC/.NET (JDBC/.NET driver to DAS to DBMS server: INSERT only, no COPY?)
    • Direct (fastload only)
    • DB procedure; not really useful for loading today (other than artificially generated data), but with flat-file read capability, might become a load path.
  • GCA connection to server:
    • direct TCP sockets (IPv4) (local or remote)
    • Unix domain sockets (local only)
    • direct IPv6 sockets (local or remote)
    • Ingres-Net
      • which can be direct or heterogeneous
      • various mixes possible when DAServer is in play, but let's assume DAM from client to local DAS, with no Net server.
  • Row handling at DBMS server:
    • table-LOAD (aka bulk-load). This is a non-journal-able load with fixed logging overhead independent of data size. Rollback either replaces original data file, or truncates back to original size (the latter only works for heap). At present, LOAD is only available for COPY and for CREATE TABLE AS SELECT, with the additional rules a) empty table OR heap, b) no journaling on table, c) no secondary indexes on table, d) no partitioning. Rule (b) is a problem in most production environments, and (c) is a problem when declarative constraints are used (which are usually supported by indexing). Rule (d) is mostly an implementation restriction rather than anything architectural in QEF or DMF.
    • row-PUT. The usual row-at-a-time PUT that logs each row (and, for btree, does additional btree logging for each row as well).
    • row-PUT with nologging. Nologging is implied if loading a session temp, but then there is the problem of moving the data from the session temp to a permanent table. (See INSERT/SELECT.) Nologging can be applied to permanent tables, but since it applies to the session, and not individual tables, Ingres marks the entire database inconsistent if anything at all goes wrong. (including interrupt, system crash, bad data, etc.)
  • Table definition and server configuration issues:
    • Page size, table structure, secondary indexing, incoming data in-sort-ness, dmf_build_pages setting, DMF cache settings, disk layout, logging setup, isnert rules, etc etc etc ad nauseum.
    • Initially, work with a simple case, but at some point we need to consider tables defined via external database tools or conversions. Such tables very often have primary key and referential integrity declarations that translate to secondary indexes and insert rules. It is unsatisfactory to simply say that such tables have to drop back to the slowest type of load, ie one by one inserts.

Note that "bulk load" can mean either the DMF LOAD operation, or the general notion of getting a pile of data into a database. For this page I'll try to use "bulk load" for the generic problem, and "table-LOAD" or just "LOAD" for the specific DMF operator.

Plan of Attack

Steve is looking into batched INSERT and some related INSERT optimizations, so I'll start with COPY.

I will build an instrumented server that lets me short-circuit a large character-mode COPY at various points, to see where major bottlenecks occur. The idea is to isolate front-end COPY conversion vs front-end GCA/GC vs client-server link plus back-end GCA/GC/sequencer vs QEF vs DMF. I will try a couple different data paths (esql/libq vs API for starters, also TCP vs unix domain sockets), and try to get a sense of LOAD vs row-PUT vs non-logged row-PUT difference. Obviously the logged row-PUT will be the slowest, but by how much? I will also play around with some variations on buffer sizes throughout the pipeline.

Initially I will keep everything local to one machine, no LAN overhead, no Ingres-Net. They will have to be factored in at some point, but later. I am also only considering unix server/client; Windows I am leaving to others. Initial tests will be on linux x86_64, as time permits I'll get a Solaris (SPARC) box involved as well. All tests will use 8K page size (so v3 page type); TPC-H lineitem table data at 0.75 scale, which is small enough for the load file to easily fit in memory.

Initial tests will ignore LOBs. The LOB data path on both client and server is disgustingly convoluted, involving many many many copies. Let's leave that rats-nest alone until we understand the basic performance characteristics for non-LOB.

Once I have some starting numbers I'll decide where to focus next.

CSV (comma-separated-values) handling

Since I will be in the COPY column delimiting code anyway, it occurred to me that now would be a good time to extend COPY to be able to handle CSV formats, including double-quote handling. COPY FROM will get CSV handling first, with COPY INTO to follow at some not-too-distant future time.

A brief DDS covering the salient points of CSV handling can be found here (ODT) or here (DOC). (DDS updated 3-Dec 15:14 EST for OpenROAD and backwards compatibility note.)

Results

Two data sets: "lineitem" is about 4.5 millions rows of TPC-H lineitem table, 16 columns including a mix of int, float, char, varchar, and date. Widest column is varchar(44); all not null. "intlob" table has an int and one long varchar, using two data sets: 500,000 rows with LOB length 80, and 100,000 rows of LOB length 4096. All heaps, all nojournaling (bulk load), all 8K pages (but see below).

Lineitem text format list is char(0)'|' for FROM, text(0)'|' for INTO. intlob text format list is text(0)comma, long varchar(0). One xx=d0nl dummy in both cases.

All timings are elapsed (\t\g\t). There was typically a 3-4 second variation in elapsed times which I have not bothered to isolate. The tests are run on a 2009 Mac Pro, 2x4core 2.26 Ghz Nehalem processors, 8 Gb memory; Linux 2.6.31.1, gcc 4.3.2; database data on XFS; txn log, copy-files on ext3. (Tables/work/log/misc all on different drives.)

Test Initial time after 502024 8K etabs As-of 31-Mar 5-May (10.0 final)
lineitem text FROM 1:18 00:33 n/a 00:21 00:21
lineitem text INTO 1:10 00:50 n/a 00:30 00:29
lineitem binary FROM 00:14 00:10 n/a 00:10 00:10
lineitem binary INTO 00:10 00:15 +- 6 n/a 00:08 00:08
intlob 500Kx80 text FROM 1:18 1:15 1:05 1:05 00:16
intlob 500Kx80 text INTO 14:47 14:40 4:20 4:30 +-00:10 00:45
intlob 500Kx80 binary FROM 1:15 1:09 1:05 1:05 00:15
intlob 500Kx80 binary INTO 14:47 14:20 4:07 4:20 00:43
intlob 100Kx4K text FROM 00:24 00:20 00:17 00:16 00:09
intlob 100Kx4K text INTO 2:00 2:05 00:45 00:35 00:13
intlob 100Kx4K binary FROM 00:21 00:17 00:22 00:16 00:09
intlob 100Kx4K binary INTO 2:00 2:00 00:45 00:35 00:15

While text loading has been significantly improved, LOB handling is about the same. (If the LOB test had more non-LOB columns, we'd see improvements similar to non-LOB loading.) Non-LOB text INTO is slightly better thanks to direct coercion calls and fine tuning of some of the mainline code. (The LOB text INTO is probably dominated by LOB handling, as there's only one non-LOB column.)

The hideous INTO performance for the initial LOB testing was due to a) 2K page size for the etab, and b) no compression in etabs. The 80 column value is much shorter than the etab segment size (about 1900+ bytes), so not only was "redeem" reading a separate page for each output row, most of what it was reading was wasted. 8K pages improves the situation by dropping random reads by 4x, plus a better filesystem match; what is needed for this particular test is the ability to data-compress etabs. The no-compression structure assumes that most LOB values are (much) larger than the segment size; since only the last segment is likely to be compressible, if last segments are rare, compression is a waste. However if last-segments are relatively common (as would be true for average LOB sizes under about 8K or so), compression should be a win.

(Feb 10) Timing investigation of the client-server link shows no unusual times on the client (libq) side. Standard TCP/IP sockets might be causing some delays, and there might be room for some improvement on the server side of things.

  • Loopback inside of copy mainline: 16- seconds
  • Loopback at socket write() on client side: 16 seconds consistent
  • Loopback in sequencer mainline on server side (before buffering and send to QEF): 18.6 secs avg, times between 17 and 20
  • Loopback in sequencer, but with II_GC_PROT=unix (unix domain sockets): 17 secs consistent.
  • Loopback in sequencer, regular sockets, 16K GCA buffer (normal is 4k): 16-18 secs.
  • Loopback in sequencer, unix sockets, 16K buffer: 16-17 secs.

Unix domain sockets are a win, especially at the default buffer size. Larger GCA buffers are also a win. GCA buffer on unix raised to 12K (point of diminishing returns).

(Mar 10) Latest:

  • Further improvements in the copy mainline have knocked off another 3 seconds. Total copy time for lineitem text-from test is now 21-22 seconds. Not bad considering that we started at 1:18.
  • All O_SYNC loads (including nonempty heap) are very sensitive to filesystem. On linux XFS, mounting nobarrier vs barrier may make a very significant difference, and it's hardware / controller dependent. On my box, a nonempty heap lineitem load runs in 25 sec nobarrier, but 75 sec barrier. EXT4 is reported to show similar variation.
  • Non-bulk loads are devilish slow. Nobarrier bulk-load rates are on the order of 500k rows/sec and up. Non-bulk-load rates start at 100k rows/sec (partitioned heap), drop to 65k rows/sec (hash), then 38k rows/sec to as slow as 250 rows/sec (btree, data sensitive). These are not lineitem rates, they are for a very narrow table with no dates, so only look at the relative numbers.

As of Mar 10, things to do next:

  1. Investigate improvements to string to date coercion. Consider splitting off ansidate because it can't have a time part.
  2. See how hard it would be to tie copy-from into bulk-load for partitioned tables. This was originally omitted because I wanted to parallelize things, but an initial serial-only version would be very useful. What happens if some partitions are bulk-load-able and others arent?
  3. Chop the copy path out of the sequencer and get it moved over into scscopy.c, or at least make it better isolated.

(Mar 31) Above 3 done, latest results noted in table above. The 500kx80 LOB INTO is a little slower for unknown reasons, everything else is the same or faster. Still to do:

  1. While LOB FROM isn't bad, see if the sequencer can use "blob put optimization" to avoid storing each blob in a holding temp. (Note that the tests so far use "load", which loads the LOB segments into a sort file rather than the etab; the load path is already reasonably efficient.)
  2. Fix up things in DMF, particularly the allocation= business, and fallocates during build.
  3. LOB FROM tests occasionally (rarely) fail with out-of-resource block messages from LK. This needs to be nailed down and fixed. It has the feel of an un-initialized flag or variable somewhere.
  4. See what would be required to support a default etab structure of cbtree or chash (compression). Also, do a preliminary look to see if there is syntax available or easily created to fine tune etab page size and structure on a per-column basis.
  5. Try to determine why btree row-by-row load can be so incredibly awful. Yes, it's page splitting, but the number of leaf splits doesn't seem to justify the 200x slowdown in performance.

(1H April) The first item above, blob put optimization, has turned into an 800 pound gorilla. (It's all about what happens when. Blobs have to be couponified to etabs long before the base row hits the "gee, should we load this?" code in qef/dmf. Hilarity all around.)

(May, 10.0 code complete) The LOB work has resulted in significant COPY FROM gains, as expected. In addition, a test of loading 12 million x 4K rows finished in about 30 minutes; that load would fail with the old code. The gains running COPY INTO are a bit unexpected, though. I don't have a ready explanation for the massive COPY INTO improvement at this stage, but the numbers are repeatable and the data seems to be valid. There is still much that can be done, particularly in the area of file preallocation and non-bulk-load, but the 10.0 work is a very good start.

10.0 Project Summary

The 10.0 development cycle is essentially closed. A review of what was achieved:

  • Addition of CSV (comma-separated-values) support, including quoted-value handling. One can now load a CSV file directly, without the Import Assistant.
  • Major speed-up of COPY FROM, especially formatted COPY FROM. Depending on many factors, a 2x to 10x improvement can be expected. (Unformatted binary copy may only show a few percent improvement.) The improvements stem mostly from a re-writing of the input file handling to reduce character movement, with additional improvements coming from optimizing the type coercion functions.
  • Minor speed-up of COPY INTO, typically on the order of 20-50%. Most of the speedup comes from optimizing the type coercions.
  • Major improvement in LOB COPY FROM. By copying the LOB data directly into the extension tables, most of the "out of locks" / "out of resources" issues disappear. A bulk-load COPY FROM with LOBs is 50% to 2x faster (depends on data) and can load an indefinite amount of data. A non-bulk-load COPY with LOBs is now limited mostly by transaction log size.
  • The ability to bulk-load a partitioned table. Partitions that are bulk-load-able will be bulk-loaded, any partitions that aren't will be row-by-row loaded. There is a minor semantic change explained below (*).
  • Significant maintainability improvements in the COPY client-side code, in the sequencer, in string-to-date coercion, and in LOB handling.

The timing research shows that future bulk-load performance work should probably be aimed at DMF. There may also be some value in additional fine-tuning of type coercions (using length-specific ones instead of general converters, for instance). GCF, SCF, QEF are not significant bottlenecks.

(*) When doing a COPY FROM into a partitioned table that is potentially bulk-loadable, Ingres will now always take an X table lock on the entire table. (It used to take an IX table lock.) This lock is held until the transaction ends. An X lock is required if a table is to be bulk-loaded. A non-partitioned table only takes the X lock if the table will actually be bulk-loaded. Unlike the situation with a non-partitioned table, there is no way to know a priori whether the partitions that will actually be loaded, are bulk-loadable (because we don't know a priori which partitions will get the incoming data). The user visible effect is that a COPY FROM that may previously have been able to run concurrently with other queries, may now lock out or be locked out by those queries.


Misc random disconnected thoughts

  1. Is there any intermediate between LOAD and logged row-PUT? e.g. something that logs table structure but not data. Would it help?
    1. somewhat-related, is there value in a table specific SET NOLOGGING ON table statement? what then is the rollback strategy? (probably "none", beyond making the table physically consistent, if that.)
  2. Is there any value to batching row-PUT's into DMF? (probably not, unless the batch can be handed down to at least dm2r's level if not below.)
  3. The notion of adding a flat-file interface to the DBP language so as to make DBP inserts a real loading technique might be interesting. (This could end up being a replacement for fastload.) Along vaguely similar lines, redo fastload as a DBMS server thread that is handed either a data stream from the outside, or a locally accessible file name. (No new functionality but avoids all the shared cache / separate cache issues that a separate fastload process presents.) One challenge to the latter would be hardening the thread against attack by rogue input.
  4. Is there any value to inventing a new kind of HEAP that is easier to load and rollback? (in particular, ditch the overflow links, maybe try to invent some SMS optimizations - extent-based SMS??)
  5. Non-empty HEAP loads lose pretty much all of the existing table file, even if it's free. This is wildly stupid and needs to be fixed.
    1. example 1: create table, insert one row. Modify to add-extend. LOAD more rows. Your add-extend is dead space.
    2. example 2: create table, insert one row. Insert lots and lots of rows with LOAD, then rollback. Insert one row. The space created by the initial insert (and freed by the rollback) is dead space.
  6. Somewhat related: disk space allocation issues.
    1. The ALLOCATION= with-option does two things: sets relallocation, and allocates space. There are instances (eg create followed by later load) where one wants to divorce the two. Invent a NOALLOCATE with-option. Make sure it applies to CTAS.
    2. When LOADing, there's no point in DIalloc'ing or DIgalloc'ing space that is about to be written, OTHER than the basic DI requirement of moving the file allocated-eof pointer. (Ie we want the file to run in "virtual" allocation mode no matter what, even if the file already exists; *unless* the extend= is larger than the build-pages group size.)
      1. one runs into a similar issue with ordinary (non-LOAD) file extends, but there it's unclear how to take advantage of it because of cache writebehind. DIgalloc is bloody annoying though, there should be no need to ever hard-write zeroes given logging and the SMS. Right?
    3. When LOADing non-rebuild non-recreate (ie the nonempty heap case), why must the table be in O_SYNC? The undo of the LOAD log record ought to be able to deal with the file being in an arbitrary-size state since it's going to in effect snip it back to the start anyway.
      1. and does LOAD really log what it needs to? ie current disk eof, current last-data.
  7. Datallegro did a "fastconvert" to do the character to binary conversion to feed fastload. Do we want to supply such a thing? How fancy would it get?
Personal tools
© 2011 Actian Corporation. All Rights Reserved