Parsing XLS and XLSX (MS Excel) files with Ruby?

I recently needed to parse some Excel files with Ruby. The abundance of libraries and options turned out to be confusing, so I wrote a blog post about it.

Here is a table of different Ruby libraries and what they support:

enter image description here

If you care about performance, here is how the xlsx libraries compare: enter image description here

I have sample code to read xlsx files with each supported library here

Here are some examples for reading xlsx files with some different libraries:

rubyXL

require 'rubyXL'

workbook = RubyXL::Parser.parse './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.sheet_name}"
  num_rows = 0
  worksheet.each do |row|
    row_cells = row.cells.map{ |cell| cell.value }
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

roo

require 'roo'

workbook = Roo::Spreadsheet.open './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet}"
  num_rows = 0
  workbook.sheet(worksheet).each_row_streaming do |row|
    row_cells = row.map { |cell| cell.value }
    num_rows += 1
  end
  puts "Read #{num_rows} rows" 
end

creek

require 'creek'

workbook = Creek::Book.new './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row.values
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

simple_xlsx_reader

require 'simple_xlsx_reader'

workbook = SimpleXlsxReader.open './sample_excel_files/xlsx_500000_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

Here is an example of reading a legacy xls file using the spreadsheet library:

spreadsheet

require 'spreadsheet'

# Note: spreadsheet only supports .xls files (not .xlsx)
workbook = Spreadsheet.open './sample_excel_files/xls_500_rows.xls'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row.to_a.map{ |v| v.methods.include?(:value) ? v.value : v }
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

Just found roo, that might do the job - works for my requirements, reading a basic spreadsheet.


The roo gem works great for Excel (.xls and .xlsx) and it's being actively developed.

I agree the syntax is not great nor ruby-like. But that can be easily achieved with something like:

class Spreadsheet
  def initialize(file_path)
    @xls = Roo::Spreadsheet.open(file_path)
  end

  def each_sheet
    @xls.sheets.each do |sheet|
      @xls.default_sheet = sheet
      yield sheet
    end
  end

  def each_row
    0.upto(@xls.last_row) do |index|
      yield @xls.row(index)
    end
  end

  def each_column
    0.upto(@xls.last_column) do |index|
      yield @xls.column(index)
    end
  end
end

I'm using creek which uses nokogiri. It is fast. Used 8.3 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is a hash of row and column name to cell content: {"A1"=>"a cell", "B1"=>"another cell"} The hash makes no guarantee that the keys will be in the original column order. https://github.com/pythonicrubyist/creek

dullard is another great one that uses nokogiri. It is super fast. Used 6.7 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 2.0.0+. The output format for each row is an array: ["a cell", "another cell"] https://github.com/thirtyseven/dullard

simple_xlsx_reader which has been mentioned is great, a bit slow. Used 91 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is an array: ["a cell", "another cell"] https://github.com/woahdae/simple_xlsx_reader

Another interesting one is oxcelix. It uses ox's SAX parser which supposedly faster than both nokogiri's DOM and SAX parser. It supposedly outputs a Matrix. I could not get it to work. Also, there were some dependency issues with rubyzip. Would not recommend it.

In conclusion, creek seems like a good choice. Other posts recommend simple_xlsx_parser as it has similar performance.

Removed dullard as recommended as it's outdated and people are getting errors/having problems with it.