Sorting based on text file

I'm trying to sort a text file based on the second field in the columns, so basically I have a file like

"Year","District Code","District Name","Neighborhood Code","Neighborhood Name","Gender","Number"
"2017","1","Ciutat Vella","1","el Raval","Boys",283
"2017","1","Ciutat Vella","2","el Barri Gòtic","Boys",56
"2017","1","Ciutat Vella","3","la Barceloneta","Boys",51
"2017","1","Ciutat Vella","4","Sant Pere, Santa Caterina i la Ribera","Boys",90
"2017","2","Eixample","5","el Fort Pienc","Boys",117
"2017","2","Eixample","6","la Sagrada Família","Boys",207
"2017","2","Eixample","7","la Dreta de l'Eixample","Boys",185
"2017","2","Eixample","8","l'Antiga Esquerra de l'Eixample","Boys",176
"2017","2","Eixample","9","la Nova Esquerra de l'Eixample","Boys",218
"2017","2","Eixample","10","Sant Antoni","Boys",172
"2017","3","Sants-Montjuïc","11","el Poble Sec","Boys",156
"2017","3","Sants-Montjuïc","12","la Marina del Prat Vermell","Boys",4
"2017","3","Sants-Montjuïc","13","la Marina de Port","Boys",119
"2017","3","Sants-Montjuïc","14","la Font de la Guatlla","Boys",42
"2017","3","Sants-Montjuïc","15","Hostafrancs","Boys",78
"2017","3","Sants-Montjuïc","16","la Bordeta","Boys",81
"2017","3","Sants-Montjuïc","17","Sants - Badal","Boys",108
"2017","3","Sants-Montjuïc","18","Sants","Boys",155
"2017","4","Les Corts","19","les Corts","Boys",199
"2017","4","Les Corts","20","la Maternitat i Sant Ramon","Boys",101
"2017","4","Les Corts","21","Pedralbes","Boys",32
"2017","5","Sarrià-Sant Gervasi","22","Vallvidrera, el Tibidabo i les Planes","Boys",17
"2017","5","Sarrià-Sant Gervasi","23","Sarrià","Boys",122
"2017","5","Sarrià-Sant Gervasi","24","les Tres Torres","Boys",74
"2017","5","Sarrià-Sant Gervasi","25","Sant Gervasi - la Bonanova","Boys",104
"2017","5","Sarrià-Sant Gervasi","26","Sant Gervasi - Galvany","Boys",206
"2017","5","Sarrià-Sant Gervasi","27","el Putxet i el Farró","Boys",127
"2017","6","Gràcia","28","Vallcarca i els Penitents","Boys",66
"2017","6","Gràcia","29","el Coll","Boys",26
"2017","6","Gràcia","30","la Salut","Boys",62
"2017","6","Gràcia","31","la Vila de Gràcia","Boys",241
"2017","6","Gràcia","32","el Camp d'en Grassot i Gràcia Nova","Boys",149
"2017","7","Horta-Guinardó","33","el Baix Guinardó","Boys",97
"2017","7","Horta-Guinardó","34","Can Baró","Boys",41
"2017","7","Horta-Guinardó","35","el Guinardó","Boys",148
"2017","7","Horta-Guinardó","36","la Font d'en Fargues","Boys",26
"2017","7","Horta-Guinardó","37","el Carmel","Boys",123
"2017","7","Horta-Guinardó","38","la Teixonera","Boys",48
"2017","7","Horta-Guinardó","39","Sant Genís dels Agudells","Boys",28
"2017","7","Horta-Guinardó","40","Montbau","Boys",26
"2017","7","Horta-Guinardó","41","la Vall d'Hebron","Boys",12
"2017","7","Horta-Guinardó","42","la Clota","Boys",8
"2017","7","Horta-Guinardó","43","Horta","Boys",9

I want to sort the whole file based on the District code and then print out the results, I just copied a couple of lines here because the file has more than 1000 lines so. I tried several commands:

cat $1 | sort -n -k 2
cat $1 | sort -t \" -n -k 2

But it seems to do absolutely nothing?

There are also some lines that are not sorted like:

"2017","10","Sant Martí","64","el Camp de l'Arpa del Clot","Boys",148
"2017","10","Sant Martí","65","el Clot","Boys",125
"2017","10","Sant Martí","66","el Parc i la Llacuna del Poblenou","Boys",79
"2017","10","Sant Martí","67","la Vila Olímpica del Poblenou","Boys",34
"2017","10","Sant Martí","68","el Poblenou","Boys",160
"2017","10","Sant Martí","69","Diagonal Mar i el Front Marítim del Poblenou","Boys",103
"2017","10","Sant Martí","70","el Besòs i el Maresme","Boys",119
"2017","10","Sant Martí","71","Provençals del Poblenou","Boys",106
"2017","10","Sant Martí","72","Sant Martí de Provençals","Boys",100
"2017","10","Sant Martí","73","la Verneda i la Pau","Boys",122
"2017","1","Ciutat Vella","1","el Raval","Girls",219
"2017","1","Ciutat Vella","2","el Barri Gòtic","Girls",53
"2017","1","Ciutat Vella","3","la Barceloneta","Girls",50
"2017","1","Ciutat Vella","4","Sant Pere, Santa Caterina i la Ribera","Girls",90
"2017","2","Eixample","5","el Fort Pienc","Girls",114
"2017","2","Eixample","6","la Sagrada Família","Girls",183
"2017","2","Eixample","7","la Dreta de l'Eixample","Girls",159

So just in case you were wondering. Thanks...


Solution 1:

Try this code:

sort -t'"' -n -k4 "input.txt" > "output.txt"

What is going on?

It uses double quotes " as the delimiter. This lets us get to the second quoted column of numbers as "numbers" and not strings. But it messes up the counting of columns. Because each line starts with a " sort considers an empty column before the "Year". Similarly, the comma separator between the "Year" and the "District Code" is also considered a column. Thus the District Code is -k4, and not -k2.

If we use comma , as the delimiter, the District Codes are quoted in double quotes, and are treated as strings by sort.

The input.txt is the name of the input file. And the output.txt is the name of the output file.

Reference: https://stackoverflow.com/questions/24688548/bash-sort-quoted-csv-files-by-numeric-key

Hope this helps

Solution 2:

The issue is that the quotes make your District Code field non-numeric even if you use the correct , delimiter.

For CSV input, I'd recommend using a CSV-aware tool - as well as handling the quoting, it will handle the header (which will need to be removed and re-inserted if you use plain sort).

For example you can use csvsort from the Python-based csvkit package. Unfortunately it strips unnecessary field quoting by default - but you can put that back with csvformat from the same package:

csvsort -c "District Code" file.csv | csvformat -U1

Since csvsort performs type inference by default, you get a numeric sort for free - turn inference off using -I or --no-inference to see the difference it makes.

Alternatively you can use Miller (available from the Ubuntu universe repository):

mlr --csv --quote-all sort -n "District Code" file.csv

You can try --csvlite in place of --csv if your files fit the simpler format.


If neither csvkit nor mlr is available, then you could do something like this using GNU awk. The basic idea is to create two arrays, both indexed by the record number, with one containing the whole lines and the other containing the (numeric converted) second column values. Then sort the latter numerically and use its sorted indices to print the lines:

gawk -F, '
  {line[NR] = $0; gsub(/^"|"$/,"",$2); key[NR] = $2} 
  END {PROCINFO["sorted_in"] = "@val_num_asc"; for(k in key) print line[k]}
' file.csv

or (more elegantly) using a single array with a custom comparison function

gawk -F, '
  function mycmp (i1, v1, i2, v2) {
    split(v1,a); gsub(/^"|"$/,"",a[2]); 
    split(v2,b); gsub(/^"|"$/,"",b[2]); 
    return a[2] - b[2]
  }
  {
    line[NR] = $0
  } 
  END {
    PROCINFO["sorted_in"] = "mycmp"; 
    for (i in line) print line[i]
  }
' file.csv

Note that in both cases the header line remains in place because the column name District Code evaluates numerically as 0, and none of the column values is less that "1". A more formally correct solution would treat the header line with its own NR==1 {print; next} rule.

See The GNU Awk Users' Guide: 12.2.1 Controlling Array Traversal