Merge by date multiple log files that also include un-dated lines (e.g. stack traces)

How can I merge log files, i.e. files that are sorted by time but that also have multi-lines, where only the first line has the time, and the remaining ones have not.

log1

01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:04:03.6497,2224,0022 bar
1bar
bar2
3bar

log2

01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3

Expected result

01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3
01:04:03.6497,2224,0022 bar
1bar
bar2
3bar

If it weren't for the non-timestamp lines starting with a digit, a simple sort -nm log1 log2 would do.

Is there an easy way on a unix/linux cmd line to get the job done?

Edit As these log files are often in the gigabytes, merging should be done without re-sorting the (already sorted) log files, and without loading the files completely into memory.

Solution 1:

Tricky. While it is possible using date and bash arrays, this really is the kind of thing that would benefit from a real programming language. In Perl for example:

$ perl -ne '$d=$1 if /(.+?),/; $k{$d}.=$_; END{print $k{$_} for sort keys(%k);}' log*
01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3
01:04:03.6497,2224,0022 bar
1bar
bar2
3bar

Here's the same thing uncondensed into a commented script:

#!/usr/bin/env perl

## Read each input line, saving it 
## as $_. This while loop is equivalent
## to perl -ne 
while (<>) {
    ## If this line has a comma
    if (/(.+?),/) {
        ## Save everything up to the 1st 
        ## comma as $date
        $date=$1;
    }
    ## Add the current line to the %k hash.
    ## The hash's keys are the dates and the 
    ## contents are the lines.
    $k{$date}.=$_;
}

## Get the sorted list of hash keys
@dates=sort(keys(%k));
## Now that we have them sorted, 
## print each set of lines.
foreach $date (@dates) {
    print "$k{$date}";
}

Note that this assumes that all date lines and only the date lines contain a comma. If that's not the case, you can use this instead:

perl -ne '$d=$1 if /^(\d+:\d+:\d+\.\d+),/; $k{$d}.=$_; END{print $k{$_} for sort keys(%k);}' log*

The approach above needs to keep the entire contents of the files in memory. If that is a problem, here's one that doesn't:

$ perl -pe 's/\n/\0/; s/^/\n/ if /^\d+:\d+:\d+\.\d+/' log* | 
    sort -n | perl -lne 's/\0/\n/g; printf'
01:02:03.6497,2224,0022 foo
foo1
2foo
foo3    
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3    
01:04:03.6497,2224,0022 bar
1bar
bar2
3bar

This one simply puts all lines between successive timestamps on to a single line by replacing newlines with \0 (if this can be in your log files, use any sequence of characters you know will never be there). This passed to sort and then tr to get the lines back.

As very correctly pointed out by the OP, all of the above solutions need to be resorted and don't take into account that the files can be merged. Here's one that does but which unlike the others will only work on two files:

$ sort -m <(perl -pe 's/\n/\0/; s/^/\n/ if /^\d+:\d+:\d+\.\d+/' log1) \
            <(perl -pe 's/\n/\0/; s/^/\n/ if /^\d+:\d+:\d+\.\d+/' log2) | 
    perl -lne 's/[\0\r]/\n/g; printf'

And if you save the perl command as an alias, you can get:

$ alias a="perl -pe 's/\n/\0/; s/^/\n/ if /^\d+:\d+:\d+\.\d+/'"
$ sort -m <(a log1) <(a log2) | perl -lne 's/[\0\r]/\n/g; printf'

Solution 2:

One way to do it (thanks @terdon for the newline replace idea):

Concat all multilines to single lines by replacing those newlines by e.g. NUL in each input file
Do a sort -m on the replaced files
Replace NUL back to newlines

Example

As the multiline concatenation is used more than once, let's alias it away:

alias a="awk '{ if (match(\$0, /^[0-9]{2}:[0-9]{2}:[0-9]{2}\\./, _))\
    { if (NR == 1) printf \"%s\", \$0; else printf \"\\n%s\", \$0 }\
    else printf \"\\0%s\", \$0 } END { print \"\" }'"

Here's the merge command, using above alias:

sort -m <(a log1) <(a log2) | tr '\0' '\n'

As shell script

In order to use it like this

merge-logs log1 log2

I put it into a shell script:

x=""
for f in "$@";
do
 x="$x <(awk '{ if (match(\$0, /^[0-9]{2}:[0-9]{2}:[0-9]{2}\\./, _)) { if (NR == 1) printf \"%s\", \$0; else printf \"\\n%s\", \$0 } else printf \"\\0%s\", \$0 } END { print \"\" }' $f)"
done

eval "sort -m $x | tr '\0' '\n'"

Not sure if I can offer a variable number of log files without resorting to evil eval.

Merge by date multiple log files that also include un-dated lines (e.g. stack traces)

Solution 1:

Solution 2:

Related

Recent Posts