Is it a good idea to read multiple files at the same time?

One of our company's server has 32 CPUs, and we have 1000+ very large files to be processed. I'm not sure if it is a good idea to read 32 files at the same time so all cores can perform independent calculations at the same time too. Could anyone briefly explain how hard disk works? If I read 32 files at the same time, would that slow down the reading speed? Thanks!


The hard disk is traditionally a mechanical data storage device. I'm assuming the server uses mechanical ones, and not the newer SSD type of hard disks, which have no moving parts. I'm also assuming with this much data and processing power, that more than one hard disk is being used simultaneously (RAID or NAS.) These details can affect the performance significantly, and could render much of the following as inaccurate.

Hard disks, being mechanical devices, have a spinning disc (platter) inside like an old-fashioned record player or CD. It is coated with a magnetic material that can record and playback tiny magnetic pulses, much like audio tape. A positionable "read-write" head flies right above the surface of each disk, usually in tandem on both sides of it, ready to move across the surface to locate, read, and write these magnetic pulses. Both the spinning, and movement, take time. The more "work" a disk is given to do, the longer it takes to finish, simply because it has to physically locate more microscopic areas on the surface of the disk(s).

That said, imagine your boss wants all employees to read all 29 volumes of the Encyclopedia Brittanica and give a summary. Each volume is stored on one hard disk, so there are 29 hard disks. There are two ways in which the whole thing can be read:

  1. Pickup the 1st volume, and have employees take turns reading one page at a time until this volume is finished. Repeat until all volumes are finished. The boss collects and re-orders all of the pages as they are processed, one volume at a time.
  2. Employees pick up all 29 volumes at the same time, and try reading pages at essentially random (the net effect) until all volumes are finished. The boss collects and re-orders all pages from 29 random volumes as they are processed...

Option #1 seems "antiquated", however one important thing about this method is that the other 28 disks are not being used at all. Only one is. Hard disks are far better at reading data sequentially than randomly. This is because sequential reading avoids the delays caused by the read-write heads seeking back and forth.

Option #2 would work, and sounds reasonable, but it isn't ideal for two reasons: a) almost no sequential reading, and b) all of the disks are in use. This uses more power and puts a bigger demand on the server to run all of those disks concurrently. It would end up taking much longer this way.

So yes, if you try to process 32 huge files simultaneously, then that is going to place a tremendous load on the disks, and they will probably slow to a crawl. It is more complicated, but likely a better solution, to have the 32 cores "take turns" with one of those huge files at a time until they are all processed. (By "take turns" I mean break it up into smaller, more manageable chunks.) Again, the goal is to make the disks read as sequentially as possible, and avoid random seeking-back-and-forth.

Software to accomplish this must be multi-threaded, meaning that just one program is started by the user, but it creates 31 new "worker threads" for the other CPU cores. The main program starts reading data, sequentially, and splits this incoming data off into chunks for the other threads (cores) to process. Those all then "take turns" crunching small pieces of the whole data file, until it is completely processed.