Best way to remove text from the beginning of a huge file
I have a huge MySQL backup file (from mysqldump) with the tables in alphabetical order. My restore failed and I want to pick up where I left off with the next table in the backup file. (I have corrected the problem, this isn't really a question about MySQL restores, etc.)
What I would like to do is take my backup file, e.g. backup.sql
and trim-off the beginning of the file until I see this line:
-- Table structure for `mytable`
Then everything after that will end up in my result file, say backup-secondhalf.sql
. This is somewhat complicated by the fact that the file is bzip2-compressed, but that shouldn't be too big of a deal.
I think I can do it like this:
$ bunzip2 -c backup.sql.bz2 | grep --text --byte-offset --only-matching -e '--Table structure for table `mytable`' -m 1
This will give me the byte-offset in the file that I want to trim up to. Then:
$ bunzip2 -c backup.sql.bz2 | dd skip=[number from above] | bzip2 -c > backup-secondhalf.sql.bz2
Unfortunately, this requires me to run bunzip2 on the file twice and read-through all those bytes twice.
Is there a way to do this all at once?
I'm not sure my sed-fu is strong enough to do a "delete all lines until regular expression, then let the rest of the file through" expression.
This is on Debian Linux, so I have GNU tools available.
Solution 1:
bunzip2 -c backup.sql.bz2 | \
sed -n '/-- Table structure for `mytable`/,$p'
Explanation:
-n suppress automatic printing of pattern space
Address range construction: Start with regex
/-- Table structure for `mytable`/
End with
$ Match the last line.
Command
p Print the current pattern space.
Edit: depending on how you dumped the database you may have very long lines. GNU sed can handle them up to the amount of available memory.
Solution 2:
NOTE: Not an actual answer
Since I was motivated to get this solved now, I went ahead and used grep
to find the offset in the file I wanted; it worked great.
Running dd
unfortunately requires that you set ibs=1
which basically means no buffering, and performance is terrible. While waiting for dd to complete, I spent time writing my own custom-built C program to skip the bytes. After having done that, I see that tail
could have done it for me just as easily:
$ bunzip2 -c restore.sql.bz2 | tail -c +[offset] | bzip2 -c > restore-trimmed.sql.bz2
I say "this doesn't answer my question" because it still requires two passes through the file: one to find the offset of the thing I'm looking for and another to trim the file.
If I were to go back to my custom program, I could implement a KMP during the "read-only" phase of the program and then switch-over to "read+write everything" after that.