Count the number of lines in a file without reading entire file into memory?

I'm processing huge data files (millions of lines each).

Before I start processing I'd like to get a count of the number of lines in the file, so I can then indicate how far along the processing is.

Because of the size of the files, it would not be practical to read the entire file into memory, just to count how many lines there are. Does anyone have a good suggestion on how to do this?

Reading the file a line at a time:

count = File.foreach(filename).inject(0) {|c, line| c+1}

or the Perl-ish

File.foreach(filename) {}
count = $.

count = 0
File.open(filename) {|f| count = f.read.count("\n")}

Will be slower than

count = %x{wc -l #{filename}}.split.first.to_i

If you are in a Unix environment, you can just let wc -l do the work.

It will not load the whole file into memory; since it is optimized for streaming file and count word/line the performance is good enough rather then streaming the file yourself in Ruby.

SSCCE:

filename = 'a_file/somewhere.txt'
line_count = `wc -l "#{filename}"`.strip.split(' ')[0].to_i
p line_count

Or if you want a collection of files passed on the command line:

wc_output = `wc -l "#{ARGV.join('" "')}"`
line_count = wc_output.match(/^ *([0-9]+) +total$/).captures[0].to_i
p line_count

It doesn't matter what language you're using, you're going to have to read the whole file if the lines are of variable length. That's because the newlines could be anywhere and theres no way to know without reading the file (assuming it isn't cached, which generally speaking it isn't).

If you want to indicate progress, you have two realistic options. You can extrapolate progress based on assumed line length:

assumed lines in file = size of file / assumed line size
progress = lines processed / assumed lines in file * 100%

since you know the size of the file. Alternatively you can measure progress as:

progress = bytes processed / size of file * 100%

This should be sufficient.

using ruby:

file=File.open("path-to-file","r")
file.readlines.size

39 milliseconds faster then wc -l on a 325.477 lines file

Summary of the posted solutions

require 'benchmark'
require 'csv'

filename = "name.csv"

Benchmark.bm do |x|
  x.report { `wc -l < #{filename}`.to_i }
  x.report { File.open(filename).inject(0) { |c, line| c + 1 } }
  x.report { File.foreach(filename).inject(0) {|c, line| c+1} }
  x.report { File.read(filename).scan(/\n/).count }
  x.report { CSV.open(filename, "r").readlines.count }
end

File with 807802 lines:

       user     system      total        real
   0.000000   0.000000   0.010000 (  0.030606)
   0.370000   0.050000   0.420000 (  0.412472)
   0.360000   0.010000   0.370000 (  0.374642)
   0.290000   0.020000   0.310000 (  0.315488)
   3.190000   0.060000   3.250000 (  3.245171)

Count the number of lines in a file without reading entire file into memory?

Related

Recent Posts