How can I parse extremely large (70+ GB) .txt files?

Solution 1:

Choose your tools

It seems Excel is not a suitable tool for what you want to do.

One approach would to be to use a different tool to consolidate or summarise the data. awk, sed, grep or perl might be better suited to this initial processing and create a smaller CSV file that can then be processed in Excel or other tools.

There are other tools that may be better suited to doing the whole job. Maybe something like R or a DBMS. It depends what you want to do with the data.

For simply taking a bunch of dissimilar text files and reordering and selecting columns I'd jump immediately to perl. Others would use awk.

Since tools like awk and perl can process files line by line and don't need to store everything in memory, they can process huge files that would choke other tools. They can also be surprisingly fast.


Just for fun, an example

With this data

Apples,27,500,10.2,fruit,100,200,300
Chairs  1   501 123.78  furniture 101   201 301
Europe, 655, 502,0.0001,continent,   102, 202,302 

we can produce this output

fruit:Apples:10.2
furniture:Chairs:123.78
continent:Europe:0.0001

using this command

perl -l -n -e "print join(':',(split(/[, \t] */))[4,0,3])" *.txt

explanation

element what it does
-l add a newline after every print
-n process line by line but don't implicitly print
-e what follows is a program to execute
print print the result of the following expression
join(":" list) create one string from a list, use ":" between each
split (/expr/) use expression to divide line into fields
[, \t] either a comma, a space or a tab followed by
* (space asterisk) 0,1 or more spaces
(list)[4,0,3] select the 4th, 0th and 3rd items from a list

That one line program is equivalent to the following, which may be easier to follow

#!perl
use warnings;
use strict;

while(<>) {  # read line by line all input or all listed files
    my @columns = split(/[, \t] */);   # split on whitespace or commas
    my @chosen = @columns[4,0,3];      # choose what to keep
    my $new_line = join(":", @chosen); # join items using colons between
    print "$new_line\n";               # print with line-separator
}

invoked as perl data.pl *.txt > newdata.txt

I like perl and am moderately familiar with a subset of it, although it is waning in popularity partly because it is easy to write perl programs that are very hard to read. However it was designed for exactly your use-case. Anyone familiar with awk, python, ruby or any of a large number of tools would solve this just as easily.

Solution 2:

OS agnostic answer:

Learn just a tiny bit of Python and you will have a tool to do similar conversions in any way you wish.

Type this into a file, save it as e.g. cvt.py (originally based on code from here)

import sys

exportcolumns = [3, 4, 5]
with open(sys.argv[1], 'r') as fi:
    for line in fi:
        columns = line.split(',')
        print( '\t'.join( columns[col] for col in exportcolumns) )

After installing Python (version 3, nothing else!) you should be able to run the above by
Python3 cvt.py filename >newfile
where filename is one of your datafiles, and newfile is where you want the result.

As written the code looks for , as column separator, outputs columns 3,4,5 (in that order) with tabs \t as separator (at the end of each column).


If you have more complex (inconsistent) column separation you may well do

import re

... as shown here: https://stackoverflow.com/a/4998688/3720510


Short explanation for the above

  • First line makes the sys module available. This allows the use of sys.argv here; making the command line arguments available to the script as a simple list.
  • The second line creates a list with the indices of the columns to extract from the input data.
  • The with-line opens the file and makes it available during the following indented block - the file closes as the block has been executed.
  • for - loop once for every line that can be read from the file.
  • Next line; create a list of one line of content, splitting at every ,.
  • The print; uses a "list comprehension" to pick the columns from the list, join them with \t (a tab char) between them, then print them to sys.stdout (implicit with print()), which may be a file - if you redirected using > on the command line.

Solution 3:

Disclaimer: I have not actually tried this with a 70 GB file, but I have done a couple of GB and above 4 million rows.

The intended workflow for huge files is not to load whole file into a sheet, but to connect to the file.

Open the data tab, select "From text/CSV", select your file. When the preview dialog appears, click on the caret besides "Load" button and choose to "Only Create Connection". That's it. Here's a more elaborate tutorial: https://excel.officetuts.net/en/examples/open-large-csv

There might be some quirks and more challenges to solve with the column transformations, but it's worth giving a try if you feel much better in Excel than with command line tools.

Another option — if you have access to Access, you can import and handle the data there as well. That software is THE database system for power users.

That being said, I would choose awk for the particular task. But you should be at least somewhat comfortable with shell then.

Solution 4:

If your data format is well-known (CSV or other character delimited file, JSON, etc.) you can usually find a general-purpose command-line tool to help query it.

  • xsv is a popular one for comma-separated/space-separated data
  • jq is popular for JSON data (download available here)

xsv can chew through hundreds of MB per second depending on your hardware and the type of query.

Solution 5:

Lots of good advice from elsewhere about the mechanics of data extraction, however you are going to need some dirty coding skills to do anything useful with it.

Large data sets often contain corrupt lines, loopy data, strange characters, ohs instead of zeroes and every manner of formatting glitches. You need to validate and filter what you've got. (An example. Split a file into two then join them. There may well be the most subtle of flaws at the join. Possibly all normal lines are CRLF but at the join the end of line is just CR. This can go unnoticed or even cause the read-in to assume end of file!) As a minimum I would make sure that you're outputting exactly the same number of lines as you read.

Still on line-by line processing, it's very simple, and worthwhile, to add very basic sanity checking to the data. Even if a field isn't getting outputted, if it's easy to check then do it because it could indicate some more subtle trouble. Be aware that actual data may not conform to the official specs. Why does a price of -1 sometimes appear? An especially useful field to check is the last one that should always have something in it or the last in each row.

Log processing somewhere. That way you can set the process running and go to lunch. You have a record of what version of your program was used to create what outputs. Of course you're looking for '...lines rejected:0' all the time.

Bad source lines should be outputted to a failure file. (But quit after 15 lines.) You can visually examine a small amount of data to see what sort of weirdness you've got.

It may well be that inside the loop that processes each line you have to apply filters. This may not happen at the first pass, but as downstream analysis progresses you may be asked to give a more select data-set. Eg. exclude lines with products with 'test' in the name or product code starting with 9.

An often missed validation issue is missing or duplicated data. For example somehow Friday's raw data has been added to the end of Thursday's and Friday's is from the week before. How will anybody know? The network failed from 3pm to 5pm so nothing was recorded. Monday was a bank holiday where there shouldn't be any transactions but somebody has supplied data from the previous Monday. You are in a good position to do some simple sums, for example daily turnover or maximum period of no activity etc. These are bulk sanity checks used to give a human pause for thought and prompt for checking before poisoned data is passed further down the chain. It's probably not your job to decide what to do with a loopy batch, but you can highlight it and probably tweak your code to give a better data-set.

All the above is 'easy', one step at a time programming. You'll learn about automation, tidy workflows, loopy formatting and basic data anomalies. You'll also be a bit of an expert on spotting unusual data and what the fields are supposed to mean. That will be useful for...

Doing something useful with the data. You should be involved with the downstream analysis. This is not to suggest you should build analysis into your translation program, but you've got a framework ready to do it. Totals, averages, max and min, hourly, daily, weekly are all possible easy (NB Automated) outputs. You might think a database is a better tool, but for fiddly things simple coding may be better. Let me give an example: Smooth a set of data points. An easy moving average is nextPoint = (lastPoint *(0.8)) + (rawValue *(0.2)) [Adjust .8 and .2 to suit]. That's fine for continuous data but what about start of business each day? That's a special case where nextPoint = rawValue. Something to code perhaps.

Spurious data values is a good example of the cross-over between raw data crunching and analysis. When somebody punched in £175 when they meant £1.75 do we really want to include that in our analysis? It's a bit of an art, or fudge, but the raw data processor can easily calculate a mean and standard deviation for a couple of thousand data points, or an actual distribution for all rows of data. You /might/ want to throw out, mark, highlight or otherwise draw attention to unexpected values at the data crunching stage or use it to inform the analysis stage. Perhaps add another column with a blank for OK and 'H' for higher than expected and so on.

You will become a skilled craftsman, able to turn a huge tree into useful planks from start to finish. You'll learn who wants what sort of planks for what purpose and be able to saw up the raw wood in the right way to avoid splits and shakes. Moreover if you spot a diseased tree you can raise the alarm.