awk, sed, or other text processing suggestions, please
Solution 1:
This bash script
#!/bin/bash
PART1=$(echo "$1" | sed 's/\(.*\)\s(.*/\1/')
PART3=$(echo "$1" | sed 's/.*)\(.*\)/\1/')
PART2=$(echo "$1" | sed 's/.*(\s*\(.*\)).*/\1/')
START=$(echo "$PART2" | sed 's/\s*-.*//')
END=$(echo "$PART2" | sed 's/.*-\s*//')
STARTNUM=$(echo "$START" | sed 's/^\(.\).*/\1/')
ENDNUM=$(echo "$END" | sed 's/^\(.\).*/\1/')
if test "$STARTNUM" '!=' "$ENDNUM"; then
echo "Error: Numeral is different"
exit 1
fi
STARTLETTER=$(echo "$START" | sed 's/^.\(.\).*/\1/')
ENDLETTER=$(echo "$END" | sed 's/^.\(.\).*/\1/')
OUTPUT=''
for LETTER in A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ; do
test "$LETTER" '==' "$STARTLETTER" && OUTPUT='yes'
test -n "$OUTPUT" && echo "$PART1, $STARTNUM$LETTER,$PART3"
test "$LETTER" '==' "$ENDLETTER" && OUTPUT=''
done
Will do what you need, albeit not in a very performant way when called with the original text as $1
EDIT
As requested a few words about the sed
expressions:
- I isolate
PART1
by taking everything before whitespace and an opening(
- I isolate
PART3
by taking everything from the closing)
onwards - I isolate
PART2
by taking what is between(
and)
, ignoring whitespace -
START
andEND
are isolated by the dash, again ignoring whitespace - Number and Letter are isolated by being first and second character
Solution 2:
If GNU sed is available
sed -r 's/([^(]+) \((.)(.) - .(.)\)(.*)/printf \x27\1, \2%s,\5\\n\x27 {\3..\4}/e' <<<'Gene Code (1A - 1F) D2 fragment, D74F'
Gene Code, 1A, D2 fragment, D74F
Gene Code, 1B, D2 fragment, D74F
Gene Code, 1C, D2 fragment, D74F
Gene Code, 1D, D2 fragment, D74F
Gene Code, 1E, D2 fragment, D74F
Gene Code, 1F, D2 fragment, D74F
If not, run it sending as pipe to the shell
sed -r 's/([^(]+) \((.)(.) - .(.)\)(.*)/printf \x27\1, \2%s,\5\\n\x27 {\3..\4}/' <<<'Gene Code (1A - 1F) D2 fragment, D74F'|bash
Gene Code, 1A, D2 fragment, D74F
Gene Code, 1B, D2 fragment, D74F
Gene Code, 1C, D2 fragment, D74F
Gene Code, 1D, D2 fragment, D74F
Gene Code, 1E, D2 fragment, D74F
Gene Code, 1F, D2 fragment, D74F
(with sh
and ksh
the output is the same)
Solution 3:
A perl way:
#!/usr/bin/perl
use feature 'say';
my $str = '"Gene Code (3D - 3H) D2 fragment, D74F"';
# get begin number, begin letter, end number, end letter
my ($bn,$bl,$en,$el) = $str =~ /\((.)(.) - (.)(.)\)/;
# loop from begin letter to end letter
for my $i ($bl .. $el) {
# do the substitution and print
($_ = $str) =~ s/ \(.. - ..\)/, $bn$i,/ && say;
}
Output:
"Gene Code, 3D, D2 fragment, D74F"
"Gene Code, 3E, D2 fragment, D74F"
"Gene Code, 3F, D2 fragment, D74F"
"Gene Code, 3G, D2 fragment, D74F"
"Gene Code, 3H, D2 fragment, D74F"