How can I add a break line after the header of a sequence and before the actual sequence?

I have a file with multiple sequences, the problem is that after the id there is a space and then the actual sequence, I want to add a break line between the id and the actual sequence.

This is what I have:

UniRef90_Q8YC41 Putative binding protein BMEII0691 MNRFIAFFRSVFLIGLVATAFGRACA

This is what I want it to look like:

UniRef90_Q8YC41 Putative binding protein BMEII0691
MNRFIAFFRSVFLIGLVATAFGRACA

If its possible I would rather it look like this

UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA

  • Using awk, printing first and last field with \n as delimiter:

    awk '{printf "%s\n%s\n", $1, $NF}' file.txt
    
  • Using sed, capturing first and last field while matching and using in replacement:

    sed -E 's/([^[:blank:]]+).*[[:blank:]]([^[:blank:]]+)$/\1\n\2/' file.txt
    
  • With perl, similar logic to sed:

    perl -pe 's/^([^\s]+).*\s([^\s]+)/$1\n$2/' file.txt
    
  • Using bash, slower approach, creating an array from each line and printing first and last element from the array separating them by \n:

    while read -ra line; do printf '%s\n%s\n' "${line[0]}" \
           "${line[$((${#line[@]]}-1))]}"; done <file.txt
    
  • With python, creating a list containing whitespace separated elements from each line, then printing the first and last element from the list, separating by \n:

    #!/usr/bin/env python3
    with open("file.txt") as f:
        for line in f:
            line = line.split()
            print(line[0]+'\n'+line[-1])
    

Example:

$ cat file.txt                               
UniRef90_Q8YC41 Putative binding protein BMEII0691 MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41 Putative binding protein BMEII0691 MNRFIAFFRSVFLIGLVATAFGRACA

$ awk '{printf "%s\n%s\n", $1, $NF}' file.txt                             
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA

$ sed -E 's/([^[:blank:]]+).*[[:blank:]]([^[:blank:]]+)$/\1\n\2/' file.txt
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA

$ perl -pe 's/^([^\s]+).*\s([^\s]+)/$1\n$2/' file.txt
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA


$ while read -ra line; do printf '%s\n%s\n' "${line[0]}" "${line[$((${#line[@]]}-1))]}"; done <file.txt
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA

>>> with open("file.txt") as f:
...     for line in f:
...         line = line.split()
...         print(line[0]+'\n'+line[-1])
... 
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA

Ruby Version

File.open(ARGV[0]) do |f|
  f.each do |line|
    puts "#{line.partition(' ')[0] + "\n" + line.rpartition(' ')[-1]}"
  end
end

Save it as any name say line_breaker.rb and run it with ruby line_breaker.rb file.txt while file.txt is the file where you have the sequences stored.


In this answer:

  1. bash + xargs one-liner
  2. python one-liner
  3. Ruby one-liner

1. bash + xargs version.

$> cat input_file.txt  | xargs -L 1 bash -c 'for i; do : ; done ; echo $1;echo $i' bash 

This essentially gives each line to bash as command line arguments, loop till we get the last one , and echo them out.

Demo:

$> cat input_file.txt                                                                     
UniRef90_Q8YC41 Putative binding protein BMEII0691 MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41 Putative binding protein BMEII0691 MNRFIAFFRSVFLIGLVATAFGRACA
$> cat input_file.txt  | xargs -L 1 bash -c 'for i; do : ; done ; echo $1;echo $i' bash   
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA

Even shorter version:

$> cat input_file.txt  | xargs -L 1 bash -c 'echo $1;echo ${@: -1}' bash                  
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA

2 .python one-liner

This one-liner assembles a list of strings that are basically first word + newline + last word. Finally, it prints all list items as one string joined with newline.

python -c 'import sys ; print "\n".join([ l.split()[0] + "\n" + l.split()[-1]  for l in sys.stdin ])' < input_file.txt

Usage demo:

$ python -c 'import sys ; print "\n".join([ l.split()[0] + "\n" + l.split()[-1]  for l in sys.stdin ])' < input_file.txt
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA

3. Ruby one liner

In this one liner, -n flag works as while gets . . . end loop. $_ holds value of each line read, so per each line we split it into an array of words, and then print first and last one.

$ ruby -ne 'words=$_.split(); puts words[0],words[-1]' < input_file.txt                   
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA