How to split a huge csv file based on content of first column?
- I have a 250MB+ huge csv file to upload
- file format is
group_id, application_id, reading
and data could look like
1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1, 0.9 2, b1, 0.1 2, b1, 0.2 2, b1, 0.4 2, b1, 0.3 2, b1, 0.0 2, b1, 0.9 ..... n, x, 0.3(lets say)
- I want to divide the file based on
group_id
, so output should be n files wheren=group_id
Output
File 1 1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1, 0.9
and
File2 2, b1, 0.1 2, b1, 0.2 2, b1, 0.4 2, b1, 0.3 2, b1, 0.0 2, b1, 0.9 .....
and
File n n, x, 0.3(lets say)
How can I do this effectively?
Solution 1:
awk
is capable:
awk -F "," '{print $0 >> ("FILE" $1)}' HUGE.csv
Solution 2:
If the file is already sorted by group_id
, you can do something like:
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("foo.csv")),
lambda row: row[0]):
with open("%s.txt" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")