How to split a file and keep the first line in each of the pieces?
Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).
Wanted: An equivalent of the coreutils split -l
command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.
I am guessing some concoction of split
and head
will do the trick?
Solution 1:
This is robhruska's script cleaned up a bit:
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "$file"
done
I removed wc
, cut
, ls
and echo
in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.
If you want to get fancy, you could use mktemp
or tempfile
to create a temporary filename instead of using a hard coded one.
Edit
Using GNU split
it's possible to do this:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
Broken out for readability:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
When --filter
is specified, split
runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE
, in the command's environment, to the filename.
A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat"
for example.
Solution 2:
This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)
cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
Based on Ole Tange's answer. (re Ole's answer: You can't use line count with pipepart)
See comments for some tips on installing parallel