Does the "bs" option in "dd" really improve the speed?

What you have done is only a read speed test. if you are actually copying blocks to another device you have pauses in the reading while the other device is accepting the data you want to write, when this happens you can hit rotational latency issues on the read device (if it's a hard disk) and so it's often significantly faster to read 1M chunks off the HDD as you come up against rotational latency less often that way.

I know when I'm copying hard disks I get a faster rate by specifying bs=1M than by using bs=4k or the default. I'm talking speed improvements of 30 to 300 percent. There's no need to tune it for absolute best unless it's all you do every day. but picking something better than the default can cut hours off the execution time.

When you're using it for real try a few different numbers and send the dd process a SIGUSR1 signal to get it to issue a status report so you can see how it's going.

✗ killall -SIGUSR1 dd
1811+1 records in
1811+1 records out
1899528192 bytes (1.9 GB, 1.8 GiB) copied, 468.633 s, 4.1 MB/s

With regards to the internal hard disk, at least -- when you are reading from the device the block layer at least has to retrieve one sector which is 512 bytes.

So, when handling a 1 byte read you've only really read from the disk on the sector aligned byte retrieval. The remaining 511 times are served up by cache.

You can prove this as follows, in this example sdb is a disk of interest:

# grep sdb /proc/diskstats
8      16 sdb 767 713 11834 6968 13710 6808 12970792 6846477 0 76967 6853359
...
# dd if=/dev/sdb of=/dev/null bs=1 count=512
512+0 records in
512+0 records out
512 bytes (512 B) copied, 0.0371715 s, 13.8 kB/s
# grep sedb /proc/diskstats
8      16 sdb 768 713 11834 6968 13710 6808 12970792 6846477 0 76967 6853359
...

The fourth column (which counts reads) indicates only 1 read occurred, despite the fact you requested 1 byte reads. This is expected behaviour since this device (a SATA 2 disk) has to at a minimum return its sector size. The kernel simply is caching the entire sector.

The biggest factor at play in these size requests is the overhead of issuing a system call for a read or write. In fact, issuing the call for < 512 is inefficient. Very large reads require less system calls at the cost of more memory being used to do it.

4096 is typically a 'safe' number for reading because:

  • When reading with caching on (the default) a page is 4k. Filling up a page with < 4k reads is more complicated than keeping the read and page size the same.
  • Most filesystem block sizes are set to 4k.
  • Its not a small enough number (maybe for SSDs it is now though) to cause syscall overhead but not a large enough number to consume lots of memory.