Excel 2010 slow with large sets of data
Problem:
Recently I have been working with fairly large sets of data.
However, Excel seems to handle these very slowly (it takes 3-5 minutes just to open or save the file).
Spreadsheet details
- Rows: 50,000
- Coloumns: 90
- Worksheets: 1
- File size: 157mb.
The sheet consists of pure data. No formulae or VBA script has been added yet. There are no cell references (eg =H3) or arrays.
Specs:
Office 2010 32 bit
Processor : i7 3.4 GHZ Quad Core
Ram : 6 Gb
Windows 7
Steps taken thus far
I tried saving it as an Excel binary (.xlsb) file, which helped a bit (and reduced file size a bit as well).
However, 50,000 rows is nowhere near the limit of Excel 2010 (which is 1,048,576 rows), yet it's really slow. I also remember older versions of Excel being able to deal with large chunks of data relatively fast.
Is there any way of speeding Excel up?
Solution 1:
I decided to try and mimic as closely as possible your scenario. To do this I:
- created a MS Excel spreadsheet with 50,000 rows and 90 columns of data (thankfully I work with large data sets, so this wasn't too hard to do)
- ensured the data was a mixture of formats: Text, Dates, Number and General
- did not use any formulas or VBA code
However, for me this spreadsheet was only 24MB in size, which is only about 15% of the size yours is, so I cannot account for that.
Unfortunately, the only context you've provided is the read/write speed for opening/saving your spreadsheet, so this is all I could test for. My results are as follows:
- 7.68 seconds to save the data to a local hard drive
- 3 mins 53 seconds to save the data to a network drive
- 10.4 seconds to open the data from a local drive
- 50.43 seconds to open the data from a network drive
Now, since my 50,000 rows and 90 columns of data only amounted to 24MB, my data is obviously going to take a lot less to save and load than your 157MB file (which is about 6.5 times larger).
However, as you can see, the speed of opening and saving files is not dictated purely by the size of the file itself. Working with an SSD is a lot faster than a local hard drive which in turn is a lot faster than a networked drive. Also, the type of network has a huge impact on data transfer speeds (e.g. Gigabit ethernet is a lot faster than fast ethernet).
Unfortunately, the closest computer I could find to run this test had the following specs:
- Windows 7 Enterprise (64-bit)
- i5 2.6GHz
- 8GB RAM
- MS Excel 2010 (32-bit)
Summary
In a nutshell:
- there are a lot of factors which impact read/write speeds, not just the amount of data but also your hardware and/or network environment
- a better test would be to see how long it takes for Excel to do some actual number crunching with your data, as opposed to read/write transfer speeds
- another factor in the case of Excel is if you're loading any add-ins etc and where they're located
- your data is 157MB in size, 6.5 times the size of my spreadsheet also containing 4.5 million cells of data (so what accounts for this?)
- install more RAM (I have worked with large data sets in Excel - Windows and Mac - for nearly two decades and the biggest factor in Excel's performance is the amount of RAM you have installed
- with 4.5 million cells of data you may need to reconsider whether a database would serve your needs better
Solution 2:
Spreadsheet is definitely not the right tool to handler large set of data because cell are loaded as object in memory (a very inefficient way).
You can use CSV (or TSV) format for your data and use dedicated tools to manipulate them. Such as open-refine or GNU awk, etc.
EDIT : I recently tried visidata on quite large dataset (1.7GB, 300k+ lines) with success. Very convenient quickly to slice and dice.