sed: Removing \r\r before \n in a very large file
I have a corrupted disk image file (about 27 GB) in which before all \n characters \r\r was inserted. I want to remove these \r\r before all \n.
I tried with awk:
awk '{ sub("\r\r$", ""); print }' mangled.raw > image.raw
But the file seems too large: "awk: run time error: out of memory"
I also tried with sed:
sed 's/\r\r$//g' mangled.raw > image.raw
But here the output file seems incomplete: It is only 20 GB in size and the end of mangled.raw contains a lot of zero characters while the end of image.raw contains the contents of a file. Somehow sed seems to stop before the end.
Any idea how to do this right?
Solution 1:
eldering's comment may be correct - it depends on how the corruption happened. If it did the equivalent of s/\n/\r\r\n/
then it's reversible, but if it did s/\r*\n/\r\r\n/
then it's not.
In any case I'd use perl for something like this. Unlike sed, it was designed from the beginning to work with strings that are very long and can contain NULs and other non-text characters.
perl -pe 's/\r\r\n/\n/g' mangled.raw > image.raw
That could eat a lot of memory since it's still reading the file as a series of lines, and there could be large segments of the file with no \n
that will be seen as a single "line". But if you read it by blocks you have to be careful not to miss a \r\r\n
sequence that straddles a block boundary. Like this:
perl -e '
$/=\65536;
while(<>) {
if(/\r\z/) {
if(length($nextblock=<>)) {
$_.=$nextblock;
redo;
}
}
s/\r\r\n/\n/g;
print;
}
' mangled.raw > image.raw
Edit: I realized the above code would get stuck in an infinite loop if the last byte of the input was \r
. It has been updated to handle that case correctly.
Edit 2: The perl one-liner contained an incorrect replacement character. It has be updated.