Most simple way of extracting substring in Unix shell?

What's the most simple way to extract substring on Unix shell (with regex)?

Simple means:

  • less feature
  • less options
  • less study

Update

I realized regex itself is conflicting with simplicity, and I chose the simplest one cut as the chosen answer. I am sorry for vague question. I changed title to represent current state of this QA more precisely.


Solution 1:

cut might be useful:

$ echo hello | cut -c1,3
hl
$ echo hello | cut -c1-3
hel
$ echo hello | cut -c1-4
hell
$ echo hello | cut -c4-5
lo

Shell Builtins are good for this too, here is a sample script:

#!/bin/bash
# Demonstrates shells built in ability to split stuff.  Saves on
# using sed and awk in shell scripts. Can help performance.

shopt -o nounset
declare -rx       FILENAME=payroll_2007-06-12.txt

# Splits
declare -rx   NAME_PORTION=${FILENAME%.*}     # Left of .
declare -rx      EXTENSION=${FILENAME#*.}     # Right of .
declare -rx           NAME=${NAME_PORTION%_*} # Left of _
declare -rx           DATE=${NAME_PORTION#*_} # Right of _
declare -rx     YEAR_MONTH=${DATE%-*}         # Left of _
declare -rx           YEAR=${YEAR_MONTH%-*}   # Left of _
declare -rx          MONTH=${YEAR_MONTH#*-}   # Left of _
declare -rx            DAY=${DATE##*-}        # Left of _

clear

echo "  Variable: (${FILENAME})"
echo "  Filename: (${NAME_PORTION})"
echo " Extension: (${EXTENSION})"
echo "      Name: (${NAME})"
echo "      Date: (${DATE})"
echo "Year/Month: (${YEAR_MONTH})"
echo "      Year: (${YEAR})"
echo "     Month: (${MONTH})"
echo "       Day: (${DAY})"

That outputs:

  Variable: (payroll_2007-06-12.txt)
  Filename: (payroll_2007-06-12)
 Extension: (txt)
      Name: (payroll)
      Date: (2007-06-12)
Year/Month: (2007-06)
      Year: (2007)
     Month: (06)
       Day: (12)

And as per Gnudif above, there are always sed/awk/perl for when the going gets really tough.

Solution 2:

Unix shells do not traditionally have regex support built-in. Bash and Zsh both do, so if you use the =~ operator to compare a string to a regex, then:

You can get the substrings from the $BASH_REMATCH array in bash.

In Zsh, if the BASH_REMATCH shell option is set, the value is in the $BASH_REMATCH array, else it's in the $MATCH/$match tied pair of variables (one scalar, the other an array). If the RE_MATCH_PCRE option is set, then the PCRE engine is used, else the system regexp libraries, for an extended regexp syntax match, as per bash.

So, most simply: if you're using bash:

if [[ "$variable" =~ unquoted.*regex ]]; then
  matched_portion="${BASH_REMATCH[0]}"
  first_substring="${BASH_REMATCH[1]}"
fi

If you're not using Bash or Zsh, it gets more complicated as you need to use external commands.

Solution 3:

Consider also /usr/bin/expr.

$ expr substr hello 2 3
ell

You can also match patterns against the beginning of strings.

$ expr match hello h
1

$ expr match hello hell
4

$ expr match hello e
0

$ expr match hello 'h.*o'
5

$ expr match hello 'h.*l'
4

$ expr match hello 'h.*e'
2

Solution 4:

grep and sed are probably the tools you want, depending on the structure of text.

sed should do the trick, if you do not know what the substring is, but know some pattern that is around it.

for example, if you want to find a substring of digits that starts with a "#" sign, you could write something like:

sed 's/^.*#\([0-9]\+\)/\1/g' yourfile

grep could do something similar, but the question is what you need to do with the substring and whether we are talking normal line-end delimited text or not.