Split text file into smaller multiple text file using command line
I have multiple text file with about 100,000 lines and I want to split them into smaller text files of 5000 lines each.
I used:
split -l 5000 filename.txt
That creates files:
xaa
xab
aac
xad
xbe
aaf
files with no extensions. I just want to call them something like:
file01.txt
file02.txt
file03.txt
file04.txt
or if that is not possible, i just want them to have the ".txt" extension.
Solution 1:
I know the question has been asked a long time ago, but I am surprised that nobody has given the most straightforward unix answer:
split -l 5000 -d --additional-suffix=.txt $FileName file
-
-l 5000
: split file into files of 5,000 lines each. -
-d
: numerical suffix. This will make the suffix go from 00 to 99 by default instead of aa to zz. -
--additional-suffix
: lets you specify the suffix, here the extension -
$FileName
: name of the file to be split. -
file
: prefix to add to the resulting files.
As always, check out man split
for more details.
For Mac, the default version of split
is apparently dumbed down. You can install the GNU version using the following command. (see this question for more GNU utils)
brew install coreutils
and then you can run the above command by replacing split
with gsplit
. Check out man gsplit
for details.
Solution 2:
Here's an example in C# (cause that's what I was searching for). I needed to split a 23 GB csv-file with around 175 million lines to be able to look at the files. I split it into files of one million rows each. This code did it in about 5 minutes on my machine:
var list = new List<string>();
var fileSuffix = 0;
using (var file = File.OpenRead(@"D:\Temp\file.csv"))
using (var reader = new StreamReader(file))
{
while (!reader.EndOfStream)
{
list.Add(reader.ReadLine());
if (list.Count >= 1000000)
{
File.WriteAllLines(@"D:\Temp\split" + (++fileSuffix) + ".csv", list);
list = new List<string>();
}
}
}
File.WriteAllLines(@"D:\Temp\split" + (++fileSuffix) + ".csv", list);
Solution 3:
@ECHO OFF
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET /a fcount=100
SET /a llimit=5000
SET /a lcount=%llimit%
FOR /f "usebackqdelims=" %%a IN ("%sourcedir%\q25249516.txt") DO (
CALL :select
FOR /f "tokens=1*delims==" %%b IN ('set dfile') DO IF /i "%%b"=="dfile" >>"%%c" ECHO(%%a
)
GOTO :EOF
:select
SET /a lcount+=1
IF %lcount% lss %llimit% GOTO :EOF
SET /a lcount=0
SET /a fcount+=1
SET "dfile=%sourcedir%\file%fcount:~-2%.txt"
GOTO :EOF
Here's a native windows batch that should accomplish the task.
Now I'll not say that it'll be fast (less than 2 minutes for each 5Kline output file) or that it will be immune to batch character-sensitivites. Really depends on the characteristics of your target data.
I used a file named q25249516.txt
containing 100Klines of data for my testing.
Revised quicker version
REM
@ECHO OFF
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET /a fcount=199
SET /a llimit=5000
SET /a lcount=%llimit%
FOR /f "usebackqdelims=" %%a IN ("%sourcedir%\q25249516.txt") DO (
CALL :select
>>"%sourcedir%\file$$.txt" ECHO(%%a
)
SET /a lcount=%llimit%
:select
SET /a lcount+=1
IF %lcount% lss %llimit% GOTO :EOF
SET /a lcount=0
SET /a fcount+=1
MOVE /y "%sourcedir%\file$$.txt" "%sourcedir%\file%fcount:~-2%.txt" >NUL 2>nul
GOTO :EOF
Note that I used llimit
of 50000 for testing. Will overwrite the early file numbers if llimit
*100 is gearter than the number of lines in the file (cure by setting fcount
to 1999
and use ~3
in place of ~2
in file-renaming line.)
Solution 4:
You can maybe do something like this with awk
awk '{outfile=sprintf("file%02d.txt",NR/5000+1);print > outfile}' yourfile
Basically, it calculates the name of the output file by taking the record number (NR) and dividing it by 5000, adding 1, taking the integer of that and zero-padding to 2 places.
By default, awk
prints the entire input record when you don't specify anything else. So, print > outfile
writes the entire input record to the output file.
As you are running on Windows, you can't use single quotes because it doesn't like that. I think you have to put the script in a file and then tell awk
to use the file, something like this:
awk -f script.awk yourfile
and script.awk
will contain the script like this:
{outfile=sprintf("file%02d.txt",NR/5000+1);print > outfile}
Or, it may work if you do this:
awk "{outfile=sprintf(\"file%02d.txt\",NR/5000+1);print > outfile}" yourfile
Solution 5:
Syntax looks like:
$ split [OPTION] [INPUT [PREFIX]]
where prefix is PREFIXaa, PREFIXab, ...
Just use proper one and youre done or just use mv for renameing.
I think
$ mv * *.txt
should work but test it first on smaller scale.
:)