awk, sed, or other text processing suggestions, please

Solution 1:

This bash script

#!/bin/bash

PART1=$(echo "$1" | sed 's/\(.*\)\s(.*/\1/')
PART3=$(echo "$1" | sed 's/.*)\(.*\)/\1/')
PART2=$(echo "$1" | sed 's/.*(\s*\(.*\)).*/\1/')

START=$(echo "$PART2" | sed 's/\s*-.*//')
END=$(echo "$PART2" | sed 's/.*-\s*//')

STARTNUM=$(echo "$START" | sed 's/^\(.\).*/\1/')
ENDNUM=$(echo "$END" | sed 's/^\(.\).*/\1/')
if test "$STARTNUM" '!=' "$ENDNUM"; then
    echo "Error: Numeral is different"
    exit 1
fi

STARTLETTER=$(echo "$START" | sed 's/^.\(.\).*/\1/')
ENDLETTER=$(echo "$END" | sed 's/^.\(.\).*/\1/')

OUTPUT=''
for LETTER in A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ; do
    test "$LETTER" '==' "$STARTLETTER" && OUTPUT='yes'
    test -n "$OUTPUT" && echo "$PART1, $STARTNUM$LETTER,$PART3"
    test "$LETTER" '==' "$ENDLETTER" && OUTPUT=''
done

Will do what you need, albeit not in a very performant way when called with the original text as $1

EDIT

As requested a few words about the sed expressions:

  • I isolate PART1 by taking everything before whitespace and an opening (
  • I isolate PART3 by taking everything from the closing ) onwards
  • I isolate PART2 by taking what is between ( and ), ignoring whitespace
  • START and END are isolated by the dash, again ignoring whitespace
  • Number and Letter are isolated by being first and second character

Solution 2:

If GNU sed is available

sed -r 's/([^(]+) \((.)(.) - .(.)\)(.*)/printf \x27\1, \2%s,\5\\n\x27 {\3..\4}/e' <<<'Gene Code (1A - 1F) D2 fragment, D74F'
Gene Code, 1A, D2 fragment, D74F
Gene Code, 1B, D2 fragment, D74F
Gene Code, 1C, D2 fragment, D74F
Gene Code, 1D, D2 fragment, D74F
Gene Code, 1E, D2 fragment, D74F
Gene Code, 1F, D2 fragment, D74F

If not, run it sending as pipe to the shell

sed -r 's/([^(]+) \((.)(.) - .(.)\)(.*)/printf \x27\1, \2%s,\5\\n\x27 {\3..\4}/' <<<'Gene Code (1A - 1F) D2 fragment, D74F'|bash
Gene Code, 1A, D2 fragment, D74F
Gene Code, 1B, D2 fragment, D74F
Gene Code, 1C, D2 fragment, D74F
Gene Code, 1D, D2 fragment, D74F
Gene Code, 1E, D2 fragment, D74F
Gene Code, 1F, D2 fragment, D74F

(with sh and ksh the output is the same)

Solution 3:

A perl way:

#!/usr/bin/perl
use feature 'say';

my $str = '"Gene Code (3D - 3H) D2 fragment, D74F"';
# get begin number, begin letter, end number, end letter
my ($bn,$bl,$en,$el) = $str =~ /\((.)(.) - (.)(.)\)/;
# loop from begin letter to end letter
for my $i ($bl .. $el) {
    # do the substitution and print
    ($_ = $str) =~ s/ \(.. - ..\)/, $bn$i,/ && say;
}

Output:

"Gene Code, 3D, D2 fragment, D74F"
"Gene Code, 3E, D2 fragment, D74F"
"Gene Code, 3F, D2 fragment, D74F"
"Gene Code, 3G, D2 fragment, D74F"
"Gene Code, 3H, D2 fragment, D74F"