sed: Removing \r\r before \n in a very large file

I have a corrupted disk image file (about 27 GB) in which before all \n characters \r\r was inserted. I want to remove these \r\r before all \n.

I tried with awk:

awk '{ sub("\r\r$", ""); print }' mangled.raw > image.raw

But the file seems too large: "awk: run time error: out of memory"

I also tried with sed:

sed 's/\r\r$//g' mangled.raw > image.raw

But here the output file seems incomplete: It is only 20 GB in size and the end of mangled.raw contains a lot of zero characters while the end of image.raw contains the contents of a file. Somehow sed seems to stop before the end.

Any idea how to do this right?


Solution 1:

eldering's comment may be correct - it depends on how the corruption happened. If it did the equivalent of s/\n/\r\r\n/ then it's reversible, but if it did s/\r*\n/\r\r\n/ then it's not.

In any case I'd use perl for something like this. Unlike sed, it was designed from the beginning to work with strings that are very long and can contain NULs and other non-text characters.

perl -pe 's/\r\r\n/\n/g' mangled.raw > image.raw

That could eat a lot of memory since it's still reading the file as a series of lines, and there could be large segments of the file with no \n that will be seen as a single "line". But if you read it by blocks you have to be careful not to miss a \r\r\n sequence that straddles a block boundary. Like this:

perl -e '
  $/=\65536;
  while(<>) {
    if(/\r\z/) {
      if(length($nextblock=<>)) {
        $_.=$nextblock;
        redo;
      }
    }
    s/\r\r\n/\n/g;
    print;
   }
' mangled.raw > image.raw

Edit: I realized the above code would get stuck in an infinite loop if the last byte of the input was \r. It has been updated to handle that case correctly.

Edit 2: The perl one-liner contained an incorrect replacement character. It has be updated.