remove all letters after space in a line that start with specific character

I have big fasta file, I want to remove all letter after first space in a header line that start with specific character/symbol (>).

Here is an example input file:

>AB3446 human helix ACGTGAGATGGATAGA 
GATAGATAGATAGACACA 
>AH4567 human beta sheet 
ACGTGATAGATGAGACGATGCCC 
CACGGGTATATAGCCCAA

Given

$ cat file.fasta 
>AB3446 human helix ACGTGAGATGGATAGA 
GATAGATAGATAGACACA 
>AH4567 human beta sheet 
ACGTGATAGATGAGACGATGCCC 
CACGGGTATATAGCCCAA

then

$ sed '/^>/ s/ .*//' file.fasta 
>AB3446
GATAGATAGATAGACACA 
>AH4567
ACGTGATAGATGAGACGATGCCC 
CACGGGTATATAGCCCAA

replaces everything from the first space (inclusive) onward on every line that starts with >


Alternatively, with awk:

$ awk '/^>/ {$0=$1} 1' file.fasta 
>AB3446
GATAGATAGATAGACACA 
>AH4567
ACGTGATAGATGAGACGATGCCC 
CACGGGTATATAGCCCAA

Assuming the example data from your question is stored in file.txt, you could use sed to process the text and remove everything after (and including) the first whitespace character in each line starting with a >:

$ sed -r 's/^(>\S+)\s.*/\1/' file.txt
>AB3446
GATAGATAGATAGACACA 
>AH4567
ACGTGATAGATGAGACGATGCCC 
CACGGGTATATAGCCCAA

If the command sed -r 's/^(>\S+)\s.*/\1/' file.txt produces the right output for you, you can tell it to modify the given file in-place, instead of just showing the output, by adding the -i option to this sed command:

sed -r -i 's/^(>\S+)\s.*/\1/' file.txt

What this does is simple. -r enables extended regular expressions, giving us more functionality to define regex patterns in our command, which is s/PATTERN/REPLACEMENT/ here.

PATTERN is the regular expression ^(>\S+)\s.* which matches a > character at the beginning of a line (^) followed by at least one non-whitespace character (\S+), a whitespace character (\s, could be a normal blank, tab, etc.) and then the whole rest of the line (.* is any number of any characters).

REPLACEMENT is \1 which tells sed to use the content of the first capture group (what got matched by the pattern inside the leftmost pair of round parentheses (...)) from the matched line as replacement. In our case, this is everything up to the first whitespace, exclusively.


portable shell way

With use of word splitting:

$ while read -r one two;do echo "$one" ;done < input.txt                    
>AB3446
GATAGATAGATAGACACA
>AH4567
ACGTGATAGATGAGACGATGCCC
CACGGGTATATAGCCCAA

With use of case and parameter substitution:

$ while IFS= read -r line;do case "$line" in ">"*) printf "%s\n" "${line%% *}";;*)printf "%s\n" "$line";;esac ;done < input.txt                                        
>AB3446
GATAGATAGATAGACACA 
>AH4567
ACGTGATAGATGAGACGATGCCC 
CACGGGTATATAGCCCAA

Perl

$ perl -lane '$_=$F[0] if $F[0] =~ /^>/;print' input.txt                                                                                                               
>AB3446
GATAGATAGATAGACACA 
>AH4567
ACGTGATAGATGAGACGATGCCC 
CACGGGTATATAGCCCAA

not portable bash way

$ bash -c 'for((i=0;;i++)); do IFS= read -r line || break; [[ $line =~ ^\> ]] && line=${line/ */} ;echo "$line" ;done' < input.txt                                     
>AB3446
GATAGATAGATAGACACA 
>AH4567
ACGTGATAGATGAGACGATGCCC 
CACGGGTATATAGCCCAA