remove all letters after space in a line that start with specific character
I have big fasta file, I want to remove all letter after first space in a header line that start with specific character/symbol (>
).
Here is an example input file:
>AB3446 human helix ACGTGAGATGGATAGA
GATAGATAGATAGACACA
>AH4567 human beta sheet
ACGTGATAGATGAGACGATGCCC
CACGGGTATATAGCCCAA
Given
$ cat file.fasta
>AB3446 human helix ACGTGAGATGGATAGA
GATAGATAGATAGACACA
>AH4567 human beta sheet
ACGTGATAGATGAGACGATGCCC
CACGGGTATATAGCCCAA
then
$ sed '/^>/ s/ .*//' file.fasta
>AB3446
GATAGATAGATAGACACA
>AH4567
ACGTGATAGATGAGACGATGCCC
CACGGGTATATAGCCCAA
replaces everything from the first space (inclusive) onward on every line that starts with >
Alternatively, with awk
:
$ awk '/^>/ {$0=$1} 1' file.fasta
>AB3446
GATAGATAGATAGACACA
>AH4567
ACGTGATAGATGAGACGATGCCC
CACGGGTATATAGCCCAA
Assuming the example data from your question is stored in file.txt
, you could use sed
to process the text and remove everything after (and including) the first whitespace character in each line starting with a >
:
$ sed -r 's/^(>\S+)\s.*/\1/' file.txt
>AB3446
GATAGATAGATAGACACA
>AH4567
ACGTGATAGATGAGACGATGCCC
CACGGGTATATAGCCCAA
If the command sed -r 's/^(>\S+)\s.*/\1/' file.txt
produces the right output for you, you can tell it to modify the given file in-place, instead of just showing the output, by adding the -i
option to this sed
command:
sed -r -i 's/^(>\S+)\s.*/\1/' file.txt
What this does is simple. -r
enables extended regular expressions, giving us more functionality to define regex patterns in our command, which is s/PATTERN/REPLACEMENT/
here.
PATTERN
is the regular expression ^(>\S+)\s.*
which matches a >
character at the beginning of a line (^
) followed by at least one non-whitespace character (\S+
), a whitespace character (\s
, could be a normal blank, tab, etc.) and then the whole rest of the line (.*
is any number of any characters).
REPLACEMENT
is \1
which tells sed
to use the content of the first capture group (what got matched by the pattern inside the leftmost pair of round parentheses (...)
) from the matched line as replacement. In our case, this is everything up to the first whitespace, exclusively.
portable shell way
With use of word splitting:
$ while read -r one two;do echo "$one" ;done < input.txt
>AB3446
GATAGATAGATAGACACA
>AH4567
ACGTGATAGATGAGACGATGCCC
CACGGGTATATAGCCCAA
With use of case
and parameter substitution:
$ while IFS= read -r line;do case "$line" in ">"*) printf "%s\n" "${line%% *}";;*)printf "%s\n" "$line";;esac ;done < input.txt
>AB3446
GATAGATAGATAGACACA
>AH4567
ACGTGATAGATGAGACGATGCCC
CACGGGTATATAGCCCAA
Perl
$ perl -lane '$_=$F[0] if $F[0] =~ /^>/;print' input.txt
>AB3446
GATAGATAGATAGACACA
>AH4567
ACGTGATAGATGAGACGATGCCC
CACGGGTATATAGCCCAA
not portable bash way
$ bash -c 'for((i=0;;i++)); do IFS= read -r line || break; [[ $line =~ ^\> ]] && line=${line/ */} ;echo "$line" ;done' < input.txt
>AB3446
GATAGATAGATAGACACA
>AH4567
ACGTGATAGATGAGACGATGCCC
CACGGGTATATAGCCCAA