How to truncate file by lines?

I have a large number of file, some of which are very long. I would like to truncate them to a certain size if they are larger by removing the end of the file. But I only want to remove whole lines. How can I do this? It feels like the kind of thing that would be handled by the Linux toolchain but I don't know of the right command.

For example, say I have a 120,000 byte file with 300-byte lines and I'm trying to truncate it to 10,000 bytes. The first 33 lines should stay (9900 bytes) and the remainder should be cut. I don't want to cut at 10,000 bytes exactly, since that would leave a partial line.

Of course the files are of differing lengths and the lines are not all the same length.

Ideally the resulting files would be made slightly shorter rather than slightly longer (if the breakpoint is on a long line) but that's not too important, it could be a little longer if that' easier. I would like the changes to be made directly to files (well, possibly the new file copied elsewhere, the original deleted, and the new file moved, but that's the same from the user's POV). A solution that redirects data to a bunch of places and then back invites the possibility of corrupting the file and I'd like to avoid that...


Solution 1:

The sed/wc complexity can be avoided in previous answers if awk is used. Using example provided from OP (showing complete lines before 10000 bytes):

awk '{i += (length() + 1); if (i <= 10000) print $ALL}' myfile.txt

Also showing the complete line containing 10000th byte if that byte is not at end of line:

awk '{i += (length() + 1); print $ALL; if (i >= 10000) exit}' myfile.txt

The answer above assumes:

  1. Text file are of Unix line terminator (\n). For Dos/Windows text files (\r\n), change length() + 1 to length() + 2
  2. Text file only contains single byte character. If there's multibyte character (such as under unicode environment), set environment LC_CTYPE=C to force interpretation on byte level.

Solution 2:

The sed approach is fine, but to loop over all lines is not. If you know how many lines you want to keep (to have an example, I use 99 here), you can do it like this:

sed -i '100,$ d' myfile.txt

Explanation: sed is a regular expression processor. With the option -i given, it processes a file directly ("inline") -- instead of just reading it and writing the results to the standard output. 100,$ just means "from line 100 to the end of the file" -- and is followed by the command d, which you probably guessed correctly to stand for "delete". So in short, the command means: "Delete all lines from line 100 to the end of the file from myfile.txt". 100 is the first line to be deleted, as you want to keep 99 lines.

Edit: If, on the other hand, there are log files where you want to keep e.g. the last 100 lines:

[ $(wc -l myfile.txt) -gt 100 ] && sed -i "1,$(($(wc -l myfile.txt|awk '{print $1}') - 100)) d" myfile.txt

What is going on here:

  • [ $(wc -l myfile.txt) -gt 100 ]: do the following only if the file has more than 100 lines
  • $((100 - $(wc -l myfile.txt|awk '{print $1}'))): calculate number of lines to delete (i.e. all lines of the file except the (last) 100 to keep)
  • 1, $((..)) d: remove all lines from the first to the calculated line

EDIT: as the question was just edited to give more details, I will include this additional information with my answer as well. Added facts are:

  • a specific size shall remain with the file (10,000 bytes)
  • each line has a specific size in bytes (300 bytes in the example)

From these data it is possible to calculate the number of lines to remain as " / ", which with the example would mean 33 lines. The shell term for the calculation: $((size_to_remain / linesize)) (at least on Linux using Bash, the result is an integer). The adjusted command now would read:

# keep the start of the file (OPs question)
sed -i '34,$ d' myfile.txt
# keep the end of the file (my second example)
[ $(wc -l myfile.txt) -gt 33 ] && sed -i "1,33 d" myfile.txt

As the sizes are known in advance, there's no longer any need for a calculation embedded to the sed command. But for flexibility, inside some shell script one can use variables.

For conditional processing based on the file size, one can use th following "test"-construct:

[ "$(ls -lk $file | awk ' {print $5}')" -gt 100 ] &&

which means: "if the size of $file exceeds 100kB, do..." (ls -lk lists the file size in kB at position 5, hence awk is used to extract exactly this).