What is the Fastest Method for High Performance Sequential File I/O in C++?

Assuming the following for...
Output:
The file is opened...
Data is 'streamed' to disk. The data in memory is in a large contiguous buffer. It is written to disk in its raw form directly from that buffer. The size of the buffer is configurable, but fixed for the duration of the stream. Buffers are written to the file, one after another. No seek operations are conducted.
...the file is closed.

Input:
A large file (sequentially written as above) is read from disk from beginning to end.


Are there generally accepted guidelines for achieving the fastest possible sequential file I/O in C++?

Some possible considerations:

  • Guidelines for choosing the optimal buffer size
  • Will a portable library like boost::asio be too abstracted to expose the intricacies of a specific platform, or can they be assumed to be optimal?
  • Is asynchronous I/O always preferable to synchronous? What if the application is not otherwise CPU-bound?

I realize that this will have platform-specific considerations. I welcome general guidelines as well as those for particular platforms.
(my most immediate interest in Win x64, but I am interested in comments on Solaris and Linux as well)


Solution 1:

Are there generally accepted guidelines for achieving the fastest possible sequential file I/O in C++?

Rule 0: Measure. Use all available profiling tools and get to know them. It's almost a commandment in programming that if you didn't measure it you don't know how fast it is, and for I/O this is even more true. Make sure to test under actual work conditions if you possibly can. A process that has no competition for the I/O system can be over-optimized, fine-tuned for conditions that don't exist under real loads.

  1. Use mapped memory instead of writing to files. This isn't always faster but it allows the opportunity to optimize the I/O in an operating system-specific but relatively portable way, by avoiding unnecessary copying, and taking advantage of the OS's knowledge of how the disk actually being used. ("Portable" if you use a wrapper, not an OS-specific API call).

  2. Try and linearize your output as much as possible. Having to jump around memory to find the buffers to write can have noticeable effects under optimized conditions, because cache lines, paging and other memory subsystem issues will start to matter. If you have lots of buffers look into support for scatter-gather I/O which tries to do that linearizing for you.

Some possible considerations:

  • Guidelines for choosing the optimal buffer size

Page size for starters, but be ready to tune from there.

  • Will a portable library like boost::asio be too abstracted to expose the intricacies of a specific platform, or can they be assumed to be optimal?

Don't assume it's optimal. It depends on how thoroughly the library gets exercised on your platform, and how much effort the developers put into making it fast. Having said that a portable I/O library can be very fast, because fast abstractions exist on most systems, and it's usually possible to come up with a general API that covers a lot of the bases. Boost.Asio is, to the best of my limited knowledge, fairly fine tuned for the particular platform it is on: there's a whole family of OS and OS-variant specific APIs for fast async I/O (e.g. epoll, /dev/epoll, kqueue, Windows overlapped I/O), and Asio wraps them all.

  • Is asynchronous I/O always preferable to synchronous? What if the application is not otherwise CPU-bound?

Asynchronous I/O isn't faster in a raw sense than synchronous I/O. What asynchronous I/O does is ensure that your code is not wasting time waiting for the I/O to complete. It is faster in a general way than the other method of not wasting that time, namely using threads, because it will call back into your code when I/O is ready and not before. There are no false starts or concerns with idle threads needing to be terminated.

Solution 2:

A general advice is to turn off buffering and read/write in large chunks (but not too large, then you will waste too much time waiting for the whole I/O to complete where otherwise you could start munching away at the first megabyte already. It's trivial to find the sweet spot with this algorithm, there's only one knob to turn: the chunk size).

Beyond that, for input mmap()ing the file shared and read-only is (if not the fastest, then) the most efficient way. Call madvise() if your platform has it, to tell the kernel how you will traverse the file, so it can do readahead and throw out the pages afterwards again quickly.

For output, if you already have a buffer, consider underpinning it with a file (also with mmap()), so you don't have to copy the data in userspace.

If mmap() is not to your liking, then there's fadvise(), and, for the really tough ones, async file I/O.

(All of the above is POSIX, Windows names may be different).

Solution 3:

For Windows, you'll want to make sure you use the FILE_FLAG_SEQUENTIAL_SCAN in your CreateFile() call, if you opt to use the platform specific Windows API call. This will optimize caching for the I/O. As far as buffer sizes go, a buffer size that is a multiple of the disk sector size is typically advised. 8K is a nice starting point with little to be gained from going larger.

This article discusses the comparison between async and sync on Windows.

http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx

Solution 4:

As you noted above it all depends on the machine / system / libraries that you are using. A fast solution on one system may be slow on another.

A general guideline though would be to write in as large of chunks as possible.
Typically writing a byte at a time is the slowest.

The best way to know for sure is to code a few different ways and profile them.

Solution 5:

You asked about C++, but it sounds like you're past that and ready to get a little platform-specific.

On Windows, FILE_FLAG_SEQUENTIAL_SCAN with a file mapping is probably the fastest way. In fact, your process can exit before the file actually makes it on to the disk. Without an explicitly-blocking flush operation, it can take up to 5 minutes for Windows to begin writing those pages.

You need to be careful if the files are not on local devices but a network drive. Network errors will show up as SEH errors, which you will need to be prepared to handle.

On *nixes, you might get a bit higher performance writing sequentially to a raw disk device. This is possible on Windows too, but not as well supported by the APIs. This will avoid a little filesystem overhead, but it may not amount to enough to be useful.

Loosely speaking, RAM is 1000 or more times faster than disks, and CPU is faster still. There are probably not a lot of logical optimizations that will help, except avoiding movements of the disk heads (seek) whenever possible. A dedicated disk just for this file can help significantly here.