Joining text files with 600M+ lines

I have two files, huge.txt and small.txt. huge.txt has around 600M rows and it's 14 GB. Each line has four space separated words (tokens) and finally another space separated column with a number. small.txt has 150K rows with a size of ~3M, a space separated word and a number.

Both files are sorted using the sort command, with no extra options. The words in both files may include apostrophes (') and dashes (-).

The desired output would contain all columns from the huge.txt file and the second column (the number) from small.txt where the first word of huge.txt and the first word of small.txt match.

My attempts below failed miserably with the following error:

cat huge.txt|join -o 1.1 1.2 1.3 1.4 2.2 - small.txt > output.txt

join: memory exhausted  

What I suspect is that the sorting order isn't right somehow even though the files are pre-sorted using:

sort -k1 huge.unsorted.txt > huge.txt
sort -k1 small.unsorted.txt > small.txt

The problems seem to appear around words that have apostrophes (') or dashes (-). I also tried dictionary sorting using the -d option bumping into the same error at the end.

I tried loading the files into MySQL, create indexes and join them, but it seems to take weeks on my laptop. (I don't have a computer with more memory or fast disk/SSD for this task)

I see two ways out of this but don't know how to implement any of them.

  1. How do I sort the files in a way that the join command considers them to be sorted properly?

  2. I was thinking of calculating MD5 or some other hashes of the strings to get rid of the apostrophes and dashes but leave the numbers intact at the end of the lines. Do the sorting and joining with the hashes instead of the strings themselves and finally "translate" back the hashes to strings. Since there would be only 150K hashes it's not that bad. What would be a good way to calculate individual hashes for each of the strings? Some AWK magic?

See file samples at the end.

Sample of huge.txt

had stirred me to 46 
had stirred my corruption 57 
had stirred old emotions 55 
had stirred something in 69 
had stirred something within 40 

Sample of small.txt

caley 114881 
calf 2757974 
calfed 137861 
calfee 71143 
calflora 154624 
calfskin 148347 
calgary 9416465 
calgon's 94846 
had 987654

Desired output:

had stirred me to 46 987654
had stirred my corruption 57 987654
had stirred old emotions 55 987654
had stirred something in 69 987654
had stirred something within 40 987654

Solution 1:

IMO the best way to do this would be to use the programming/scripting language you know best and:

  1. load small.txt into an in-memory hash/map/associative array keyed on the words
  2. Process huge.txt line by line, adding the column looked up from the hash and writing the result into an output file
  3. Buffer input and output so that it happens in chunks of at least 4K

Solution 2:

To build on Michael Borgwardt's answer: as long as both files are sorted, you can put them together by basically performing one step of a mergesort. It'll be a little different than standard mergesort because you only want to keep one of the files. This will, of course, have to be implemented in your favorite programming language.

Here's a sketch of the algorithm:

line1 = read a line from file 1
line2 = read a line from file 2
start of loop:
if (first word of line1 == first word of line2) {
    write all fields of line1
      and second field of line2 to output
    line1 = read a line from file 1
    go to start of loop
}
else if (first word of line1 < first word of line2) {
    write line1 to output
    line1 = read a line from file 1
    go to start of loop
}
else (first word of line1 > first word of line2) {
    line2 = read a line from file 2
    go to start of loop
}

Here's a Python version (since Python is just what I happen to know best, not necessarily the best language for the job):

file1 = open('file1', 'r')
file2 = open('file2', 'r')
w2, n2 = file2.readline().split()
for line1 in file1:
  w11, w12, w13, w14, n15 = line1.split()
  if w11 == w2:
    print w11, w12, w13, w14, n15, n2
    continue
  elif w11 < w2:
    print w11, w12, w13, w14, n15
    continue
  else:
    while w11 > w2:
      w2, n2 = file2.readline().split()
    if w11 == w2:
      print w11, w12, w13, w14, n15, n2
    elif w11 < w2:
      print w11, w12, w13, w14, n15

and for completeness, after some digging here's what I came up with for Awk:

BEGIN {
  getline line2 <"file2";
  split(line2, a);
}
{
  if (a[1] == $1) print $0,a[2];
  else if (a[1] < $1) print $0;
  else { getline line2 <"file2"; split(line2, a); }
}

Invoke as awk -f program.awk <file1.

Solution 3:

My answer is similar to Michael Borgwardt's, but you don't have to load all of either file into memory. If the files are both sorted, then you walk through first file one line at a time, and you do binary search on the second file to find the target line in question. That's a lot of HD access, but it's low memory consumption.