SQL like group by and sum for text files in command line?

Solution 1:

In AWK, you could do something like this:

awk '($1 == last) || (last == "") {sum += $2}
     ($1 != last) && (last != "") {print last " " sum; sum = $2}
                                  {last = $1}
     END                          {print last " " sum}' huge_text_file.txt

Solution 2:

This type of queries are most easily and flexibly done in SQL. You can use SQLet and SQLite for that purpose:

sqlet.py -d' ' -A file1.txt 'select a1, sum(a2) from A group by a1' | sqlite3

Solution 3:

You could a use a few lines of Lua to acheive this. Lua is available on a wide range of platforms including Windows and Linux.

-- Quick and dirty - no error checking, unsorted output

io.input('huge_text_file.txt')

results = {}

for line in io.lines() do
    for text, number in string.gmatch(line, '(%w+)%s+(%d+)') do
        results[text] = (results[text] or 0) + number
    end
end

for text, number in pairs(results) do
    print(text, number)
end

You can sort the output using any sort utility or a few more lines of Lua.