Sorting based on text file
I'm trying to sort a text file based on the second field in the columns, so basically I have a file like
"Year","District Code","District Name","Neighborhood Code","Neighborhood Name","Gender","Number"
"2017","1","Ciutat Vella","1","el Raval","Boys",283
"2017","1","Ciutat Vella","2","el Barri Gòtic","Boys",56
"2017","1","Ciutat Vella","3","la Barceloneta","Boys",51
"2017","1","Ciutat Vella","4","Sant Pere, Santa Caterina i la Ribera","Boys",90
"2017","2","Eixample","5","el Fort Pienc","Boys",117
"2017","2","Eixample","6","la Sagrada Família","Boys",207
"2017","2","Eixample","7","la Dreta de l'Eixample","Boys",185
"2017","2","Eixample","8","l'Antiga Esquerra de l'Eixample","Boys",176
"2017","2","Eixample","9","la Nova Esquerra de l'Eixample","Boys",218
"2017","2","Eixample","10","Sant Antoni","Boys",172
"2017","3","Sants-Montjuïc","11","el Poble Sec","Boys",156
"2017","3","Sants-Montjuïc","12","la Marina del Prat Vermell","Boys",4
"2017","3","Sants-Montjuïc","13","la Marina de Port","Boys",119
"2017","3","Sants-Montjuïc","14","la Font de la Guatlla","Boys",42
"2017","3","Sants-Montjuïc","15","Hostafrancs","Boys",78
"2017","3","Sants-Montjuïc","16","la Bordeta","Boys",81
"2017","3","Sants-Montjuïc","17","Sants - Badal","Boys",108
"2017","3","Sants-Montjuïc","18","Sants","Boys",155
"2017","4","Les Corts","19","les Corts","Boys",199
"2017","4","Les Corts","20","la Maternitat i Sant Ramon","Boys",101
"2017","4","Les Corts","21","Pedralbes","Boys",32
"2017","5","Sarrià-Sant Gervasi","22","Vallvidrera, el Tibidabo i les Planes","Boys",17
"2017","5","Sarrià-Sant Gervasi","23","Sarrià","Boys",122
"2017","5","Sarrià-Sant Gervasi","24","les Tres Torres","Boys",74
"2017","5","Sarrià-Sant Gervasi","25","Sant Gervasi - la Bonanova","Boys",104
"2017","5","Sarrià-Sant Gervasi","26","Sant Gervasi - Galvany","Boys",206
"2017","5","Sarrià-Sant Gervasi","27","el Putxet i el Farró","Boys",127
"2017","6","Gràcia","28","Vallcarca i els Penitents","Boys",66
"2017","6","Gràcia","29","el Coll","Boys",26
"2017","6","Gràcia","30","la Salut","Boys",62
"2017","6","Gràcia","31","la Vila de Gràcia","Boys",241
"2017","6","Gràcia","32","el Camp d'en Grassot i Gràcia Nova","Boys",149
"2017","7","Horta-Guinardó","33","el Baix Guinardó","Boys",97
"2017","7","Horta-Guinardó","34","Can Baró","Boys",41
"2017","7","Horta-Guinardó","35","el Guinardó","Boys",148
"2017","7","Horta-Guinardó","36","la Font d'en Fargues","Boys",26
"2017","7","Horta-Guinardó","37","el Carmel","Boys",123
"2017","7","Horta-Guinardó","38","la Teixonera","Boys",48
"2017","7","Horta-Guinardó","39","Sant Genís dels Agudells","Boys",28
"2017","7","Horta-Guinardó","40","Montbau","Boys",26
"2017","7","Horta-Guinardó","41","la Vall d'Hebron","Boys",12
"2017","7","Horta-Guinardó","42","la Clota","Boys",8
"2017","7","Horta-Guinardó","43","Horta","Boys",9
I want to sort the whole file based on the District code and then print out the results, I just copied a couple of lines here because the file has more than 1000 lines so. I tried several commands:
cat $1 | sort -n -k 2
cat $1 | sort -t \" -n -k 2
But it seems to do absolutely nothing?
There are also some lines that are not sorted like:
"2017","10","Sant Martí","64","el Camp de l'Arpa del Clot","Boys",148
"2017","10","Sant Martí","65","el Clot","Boys",125
"2017","10","Sant Martí","66","el Parc i la Llacuna del Poblenou","Boys",79
"2017","10","Sant Martí","67","la Vila Olímpica del Poblenou","Boys",34
"2017","10","Sant Martí","68","el Poblenou","Boys",160
"2017","10","Sant Martí","69","Diagonal Mar i el Front Marítim del Poblenou","Boys",103
"2017","10","Sant Martí","70","el Besòs i el Maresme","Boys",119
"2017","10","Sant Martí","71","Provençals del Poblenou","Boys",106
"2017","10","Sant Martí","72","Sant Martí de Provençals","Boys",100
"2017","10","Sant Martí","73","la Verneda i la Pau","Boys",122
"2017","1","Ciutat Vella","1","el Raval","Girls",219
"2017","1","Ciutat Vella","2","el Barri Gòtic","Girls",53
"2017","1","Ciutat Vella","3","la Barceloneta","Girls",50
"2017","1","Ciutat Vella","4","Sant Pere, Santa Caterina i la Ribera","Girls",90
"2017","2","Eixample","5","el Fort Pienc","Girls",114
"2017","2","Eixample","6","la Sagrada Família","Girls",183
"2017","2","Eixample","7","la Dreta de l'Eixample","Girls",159
So just in case you were wondering. Thanks...
Solution 1:
Try this code:
sort -t'"' -n -k4 "input.txt" > "output.txt"
What is going on?
It uses double quotes "
as the delimiter. This lets us get to the second quoted column of numbers as "numbers" and not strings. But it messes up the counting of columns. Because each line starts with a "
sort considers an empty column before the "Year". Similarly, the comma separator between the "Year" and the "District Code" is also considered a column. Thus the District Code is -k4, and not -k2.
If we use comma ,
as the delimiter, the District Codes are quoted in double quotes, and are treated as strings by sort.
The input.txt
is the name of the input file. And the output.txt
is the name of the output file.
Reference: https://stackoverflow.com/questions/24688548/bash-sort-quoted-csv-files-by-numeric-key
Hope this helps
Solution 2:
The issue is that the quotes make your District Code
field non-numeric even if you use the correct ,
delimiter.
For CSV input, I'd recommend using a CSV-aware tool - as well as handling the quoting, it will handle the header (which will need to be removed and re-inserted if you use plain sort
).
For example you can use csvsort
from the Python-based csvkit
package. Unfortunately it strips unnecessary field quoting by default - but you can put that back with csvformat
from the same package:
csvsort -c "District Code" file.csv | csvformat -U1
Since csvsort performs type inference by default, you get a numeric sort for free - turn inference off using -I
or --no-inference
to see the difference it makes.
Alternatively you can use Miller (available from the Ubuntu universe
repository):
mlr --csv --quote-all sort -n "District Code" file.csv
You can try --csvlite
in place of --csv
if your files fit the simpler format.
If neither csvkit
nor mlr
is available, then you could do something like this using GNU awk
. The basic idea is to create two arrays, both indexed by the record number, with one containing the whole lines and the other containing the (numeric converted) second column values. Then sort the latter numerically and use its sorted indices to print the lines:
gawk -F, '
{line[NR] = $0; gsub(/^"|"$/,"",$2); key[NR] = $2}
END {PROCINFO["sorted_in"] = "@val_num_asc"; for(k in key) print line[k]}
' file.csv
or (more elegantly) using a single array with a custom comparison function
gawk -F, '
function mycmp (i1, v1, i2, v2) {
split(v1,a); gsub(/^"|"$/,"",a[2]);
split(v2,b); gsub(/^"|"$/,"",b[2]);
return a[2] - b[2]
}
{
line[NR] = $0
}
END {
PROCINFO["sorted_in"] = "mycmp";
for (i in line) print line[i]
}
' file.csv
Note that in both cases the header line remains in place because the column name District Code
evaluates numerically as 0
, and none of the column values is less that "1"
. A more formally correct solution would treat the header line with its own NR==1 {print; next}
rule.
See The GNU Awk Users' Guide: 12.2.1 Controlling Array Traversal