Solution 1:

Good question - well explained.

how can I tune MySQL to increase the speed of the LOAD DATA INFILE call?

You've already got a high(ish) setting for the key buffer - but is it enough? I'm assuming this is a 64-bit installation (if not then the first thing you need to do is upgrade) and not running on MSNT. Have a look at the output of mysqltuner.pl after running a few tests.

In order to use the cache to best effect, you may find benefits in batching/pre-sorting the input data (most recent versions of the 'sort' command have a lot of functionality for sorting large datasets). Also if you generate the ID numbers outside of MySQL, then it may be more efficient.

would using a cluster of computers to load different csv files

Assuming (again) that you want to have the output set behave as a single table, then the only benefits you'll get are by distributing the work of sorting and generating ids - which you don't need more databases for. OTOH using a database cluster, you will get problems with contention (which you shouldn't see other than as performance problems).

If you can shard the data and handle the resulting datasets independently, then yes, you will get performance benefits - but this does not negate the the need to tune each node.

Check you've got at least 4 Gb for the sort_buffer_size.

Beyond that, the limiting factor on performance is all about disk I/O. There's lots of ways to address this - but you should probably be considering a mirrored set of striped datasets on SSDs for optimal performance.

Solution 2:

  • Consider your limiting factor. It's almost certainly single-threaded CPU processing.
  • You've already determined that load data... is faster than insert, so use that.
  • You've already determined that really large files (by row number) slow things down a lot; you want to break them up into pieces.
  • Using non-overlapping primary keys, queue up at least N*CPU sets, using no more than one million rows... probably less (benchmark).
  • Use sequential blocks of primary keys in each file.

If you want to be really spiffy, you can create a multi-threaded program to feed a single file to a collection of named pipes and manage the insert instances.

In summary, you don't tune MySQL for this so much as you tune your workload to MySQL.