Processing large files through bash pipes, does it buffer?
I need to use a command like the following:
$ cat large.input.file | process.py > large.output.file
The problem is, as this is won't the hard disk have a hard time jumping between reading the input file and writing the output file?
Is there a way to tell bash to use a large memory buffer when doing this kind of pipe?
Solution 1:
Don't worry. The OS will do the buffering for you, and it's usually very good at it.
That being said: If you can change process.py you can implement your own buffering. If you can't change process.py you could write your own buffer.py and use it like this.
$ cat large.input.file | buffer.py | process.py | buffer.py > large.output.file
Probably much easier would be to read and write form a RAM disk.
Solution 2:
The OS will buffer the output to a certain amount, but there may still be a lot of head flipping if both the input and output files are on the same drive, unless your process.py
does some buffering of its own.
You could replace cat
in your example with pipe viewer (pv) (available in most standard repositories, and easily complied if it isn't in your distribution's repo) which allows you to set it to buffer more (with the -B
/--buffer-bytes
options) and displays a progress bar (unless you ask it not to) which could be very handy for a long operation if your process.py
doesn't output its own progress information. For passing data from one place on a drive to an other place on the same drive this can make quite a difference unless the overall process is primarily CPU bound rather than I/O bound.
So for a 1Mb buffer you could do:
pv -B 1m large.input.file | process.py > large.output.file
I use pv
all the time for this sort of thing, though mainly for the progress indicator more than the tweakable buffer size.
Another option is to use the more "standard" (standard in terms of being generally available by default, its command line format is a little different to most common commands) dd
, though this does not have the progress bar facility:
dd if=large.input.file bs=1048576 | process.py > large.output.file
Edit: ps. pendants may point out that cat
is not needed in your example as the following will work just as well and will be very slightly more efficient:
process.py < large.input.file > large.output.file
Some people refer to the removal of uneccassary calls to cat
as "demogification", those these people should probably not be encouraged...
Solution 3:
Isn't there an old unix tool called "buffer"? Not that this would be needed with todays caching techniques - but it is there.