What's the best way to parse a tab-delimited file in Ruby?

What's the best (most efficient) way to parse a tab-delimited file in Ruby?


The Ruby CSV library lets you specify the field delimiter. Ruby 1.9 uses FasterCSV. Something like this would work:

require "csv"
parsed_file = CSV.read("path-to-file.csv", col_sep: "\t")

The rules for TSV are actually a bit different from CSV. The main difference is that CSV has provisions for sticking a comma inside a field and then using quotation characters and escaping quotes inside a field. I wrote a quick example to show how the simple response fails:

require 'csv'
line = 'boogie\ttime\tis "now"'
begin
  line = CSV.parse_line(line, col_sep: "\t")
  puts "parsed correctly"
rescue CSV::MalformedCSVError
  puts "failed to parse line"
end

begin
  line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ƃ")
  puts "parsed correctly with random quote char"
rescue CSV::MalformedCSVError
  puts "failed to parse line with random quote char"
end

#Output:
# failed to parse line
# parsed correctly with random quote char

If you want to use the CSV library you could used a random quote character that you don't expect to see if your file (the example shows this), but you could also use a simpler methodology like the StrictTsv class shown below to get the same effect without having to worry about field quotations.

# The main parse method is mostly borrowed from a tweet by @JEG2
class StrictTsv
  attr_reader :filepath
  def initialize(filepath)
    @filepath = filepath
  end

  def parse
    open(filepath) do |f|
      headers = f.gets.strip.split("\t")
      f.each do |line|
        fields = Hash[headers.zip(line.split("\t"))]
        yield fields
      end
    end
  end
end

# Example Usage
tsv = Vendor::StrictTsv.new("your_file.tsv")
tsv.parse do |row|
  puts row['named field']
end

The choice of using the CSV library or something more strict just depends on who is sending you the file and whether they are expecting to adhere to the strict TSV standard.

Details about the TSV standard can be found at http://en.wikipedia.org/wiki/Tab-separated_values