Sort a file according to a field starting with string

Try the following bash command:

sort -t- -d -k2 -o output.txt input.txt

It has four options plus the name of the input file input.txt. If this file is not in the current directory you will have to provide the path/to/the/folder/input.txt. The options and their arguments are as follows:

  • -t marks the field separator. We use - as the separator, so that everything before and after the - are considered separate columns.
  • -d indicates dictionary sort. For example Apple is before Berry.
  • -k2 indicates the column by which to sort, in this case the second column. Note the first column is everything before the first -. For example, /home/zz/BOOKS/Author. The second column is in between the first and the second -, that is, Artemis.
  • -o output.txt redirects the sorted output to a file rather than to the terminal.

Hope this helps


Although it's overkill for the present example because of the solution proposed in user68186's answer, you could more generally do something like this in GNU awk:

gawk -F/ '
  function mycmp(i1,v1,i2,v2) {
    m = split(v1,a);
    n = split(v2,b);
    return a[m]"" > b[n]"" ? 1 : a[m]"" < b[n]"" ? -1 : 0
  }
  {
    lines[NR] = $0
  }
  END {
    PROCINFO["sorted_in"] = "mycmp";
    for(i in lines) print lines[i]
  }
' file

Note that it sorts according to the lexical value of everything after the last / - so if the format is Author-<author name>-<title>.<extension> that will be

  • the fixed string Author- (which has no effect, since it has the same weight for all lines); then
  • <author name>-; then
  • <title>.; then
  • <extension>

This is similar to how GNU sort's simple KEYDEF -t- -k2 works i.e. the effective sort key starts from the <author name> and continues to the line end.

An explicit delimiter is omitted from the split calls so that they inherit the value of FS, making it easy to change for systems that use a different path separator. The appended empty strings "" in the mycmp function force lexical comparison even if the filenames are numerical - see for example How awk Converts Between Strings and Numbers


If you'd rather stick with the sort command, you could leverage GNU awk's Two-Way Communications with Another Process to:

  • duplicate the last /-separated field at the start of the string
  • pass the result to a sort comnand
  • read back the sorted result, remove the duplicated prefix and print

i.e.

gawk -F/ '
  BEGIN {OFS=FS; cmd = "sort -d"} 
  {print $NF $0 |& cmd} 
  END {
    close(cmd,"to"); 
    while(cmd |& getline){$1 = ""; print};
    close(cmd,"from")
  }
' file

There's a bit of a cheat here in that the absolute paths (lines start with /) imply an initial empty field; to handle relative paths you'd need to change print $NF $0 to print $NF,$0 to insert the "missing" separator, and then perhaps use a regex sub() instead of the simpler $1 = "" to remove the leading element.

As well as potentially being faster / more memory efficient than the pure gawk solution, this allows other sort options to be added straightforwardly ex. cmd = "sort -d -t " FS " -k1,1r" .