Best way to remove text from the beginning of a huge file

I have a huge MySQL backup file (from mysqldump) with the tables in alphabetical order. My restore failed and I want to pick up where I left off with the next table in the backup file. (I have corrected the problem, this isn't really a question about MySQL restores, etc.)

What I would like to do is take my backup file, e.g. backup.sql and trim-off the beginning of the file until I see this line:

-- Table structure for `mytable`

Then everything after that will end up in my result file, say backup-secondhalf.sql. This is somewhat complicated by the fact that the file is bzip2-compressed, but that shouldn't be too big of a deal.

I think I can do it like this:

$ bunzip2 -c backup.sql.bz2 | grep --text --byte-offset --only-matching -e '--Table structure for table `mytable`' -m 1

This will give me the byte-offset in the file that I want to trim up to. Then:

$ bunzip2 -c backup.sql.bz2 | dd skip=[number from above] | bzip2 -c > backup-secondhalf.sql.bz2

Unfortunately, this requires me to run bunzip2 on the file twice and read-through all those bytes twice.

Is there a way to do this all at once?

I'm not sure my sed-fu is strong enough to do a "delete all lines until regular expression, then let the rest of the file through" expression.

This is on Debian Linux, so I have GNU tools available.

Solution 1:

bunzip2 -c backup.sql.bz2 | \
  sed -n '/-- Table structure for `mytable`/,$p'

Explanation:

-n suppress automatic printing of pattern space

Address range construction: Start with regex

/-- Table structure for  `mytable`/

End with

$ Match the last line.

Command

p Print the current pattern space.

Edit: depending on how you dumped the database you may have very long lines. GNU sed can handle them up to the amount of available memory.

Solution 2:

NOTE: Not an actual answer

Since I was motivated to get this solved now, I went ahead and used grep to find the offset in the file I wanted; it worked great.

Running dd unfortunately requires that you set ibs=1 which basically means no buffering, and performance is terrible. While waiting for dd to complete, I spent time writing my own custom-built C program to skip the bytes. After having done that, I see that tail could have done it for me just as easily:

$ bunzip2 -c restore.sql.bz2 | tail -c +[offset] | bzip2 -c > restore-trimmed.sql.bz2

I say "this doesn't answer my question" because it still requires two passes through the file: one to find the offset of the thing I'm looking for and another to trim the file.

If I were to go back to my custom program, I could implement a KMP during the "read-only" phase of the program and then switch-over to "read+write everything" after that.

Best way to remove text from the beginning of a huge file

Solution 1:

Solution 2:

Related

Recent Posts