Using 'head' or 'tail' on HUGE text file - 19 GB

Solution 1:

You should use sed.

sed -n -e 45000000,45000100p -e 45000101q bigfile > savedlines

This tells sed to print lines 45000000-45000100 inclusive, and to quit on line 45000101.

Solution 2:

Create a MySQL database with a single table which has a single field. Then import your file into the database. This will make it very easy to look up a certain line.

I don't think anything else could be faster (if head and tail already fail). In the end, the application that wants to find line n has to seek through the whole file until is has found n newlines. Without some sort of lookup (line-index to byte offset into file) no better performance can be achieved.

Given how easy it is to create a MySQL database and import data into it, I feel like this is a viable approach.

Here is how to do it:

DROP DATABASE IF EXISTS helperDb;
CREATE DATABASE `helperDb`;
CREATE TABLE `helperDb`.`helperTable`( `lineIndex` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, `lineContent` MEDIUMTEXT , PRIMARY KEY (`lineIndex`) );
LOAD DATA INFILE '/tmp/my_large_file' INTO TABLE helperDb.helperTable (lineContent);
SELECT lineContent FROM helperTable WHERE ( lineIndex > 45000000 AND lineIndex < 45000100 );

/tmp/my_large_file would be the file you want to read.

The correct syntax to import a file with tab-delimited values on each line, is:

LOAD DATA INFILE '/tmp/my_large_file' INTO TABLE helperDb.helperTable FIELDS TERMINATED BY '\n' (lineContent);

Another major advantage of this is, if you decide later on to extract another set of lines, you don't have to wait hours for the processing again (unless you delete the database of course).

Solution 3:

Two good old tools for big files are joinand split. You can use split with --lines=<number> option that cut file to multiple files of certain size.

For example split --lines=45000000 huge_file.txt. The resulted parts would be in xa, xb, etc. Then you can head the part xb which would include the the lines you wanted. You can also 'join' files back to single big file.