How does `wc -l` work?
I have to read a large file and before I start reading it, I need to know the total number of lines in the file (which are in millions).
I have implemented a lot of solutions and have found one. But during my search I was thinking to look at how wc -l
works. I couldn't find anything on Google.
Though I have found a solution to my problem, I would still like to know how wc -l
works since it can calculate the number of lines of a file with 92 Million lines in a few seconds!
How?
Solution 1:
It reads the entire file and counts the number of line-endings. Counting line endings is really cheap; most of the time spent is reading the file. If the file happens to be (mostly) in the buffer cache, that will be cheap too. Otherwise, it will depend on the speed of your file storage.
In other words, there is no magic.
Solution 2:
WC just reads the file in blocks of raw bytes (preferable in multiples of the natural block-size of the underlying filesystem on which the file is located).
Then it just scans through the buffer counting the end-of-line characters.
(It also counts spaces, tabs, form-feeds and other special characters, just in case you wanted other information than the -l output.)
Reading from disk is the costly part in terms of speeds. The scanning of the buffer takes neglect-able time compared to that.
Say you've got a 90 million lines with on average 100 characters per line.
That is around 9.000.000.000 characters or about 860 MB.
A decent PC with a SATA-3Gb/s drive will do that under 10 seconds. Even on a relatively slow filesystem with some other activity going on at the same time.
A fast machine with some performance tuning and a optimized filesystem can do it under 5 seconds, even without having to resort to SATA-6G and a SSD drive.