Most simple way of extracting substring in Unix shell?
What's the most simple way to extract substring on Unix shell (with regex)?
Simple means:
- less feature
- less options
- less study
Update
I realized regex itself is conflicting with simplicity, and I chose the simplest one cut
as the chosen answer. I am sorry for vague question. I changed title to represent current state of this QA more precisely.
Solution 1:
cut
might be useful:
$ echo hello | cut -c1,3
hl
$ echo hello | cut -c1-3
hel
$ echo hello | cut -c1-4
hell
$ echo hello | cut -c4-5
lo
Shell Builtins are good for this too, here is a sample script:
#!/bin/bash
# Demonstrates shells built in ability to split stuff. Saves on
# using sed and awk in shell scripts. Can help performance.
shopt -o nounset
declare -rx FILENAME=payroll_2007-06-12.txt
# Splits
declare -rx NAME_PORTION=${FILENAME%.*} # Left of .
declare -rx EXTENSION=${FILENAME#*.} # Right of .
declare -rx NAME=${NAME_PORTION%_*} # Left of _
declare -rx DATE=${NAME_PORTION#*_} # Right of _
declare -rx YEAR_MONTH=${DATE%-*} # Left of _
declare -rx YEAR=${YEAR_MONTH%-*} # Left of _
declare -rx MONTH=${YEAR_MONTH#*-} # Left of _
declare -rx DAY=${DATE##*-} # Left of _
clear
echo " Variable: (${FILENAME})"
echo " Filename: (${NAME_PORTION})"
echo " Extension: (${EXTENSION})"
echo " Name: (${NAME})"
echo " Date: (${DATE})"
echo "Year/Month: (${YEAR_MONTH})"
echo " Year: (${YEAR})"
echo " Month: (${MONTH})"
echo " Day: (${DAY})"
That outputs:
Variable: (payroll_2007-06-12.txt)
Filename: (payroll_2007-06-12)
Extension: (txt)
Name: (payroll)
Date: (2007-06-12)
Year/Month: (2007-06)
Year: (2007)
Month: (06)
Day: (12)
And as per Gnudif above, there are always sed/awk/perl for when the going gets really tough.
Solution 2:
Unix shells do not traditionally have regex support built-in. Bash and Zsh both do, so if you use the =~
operator to compare a string to a regex, then:
You can get the substrings from the $BASH_REMATCH
array in bash.
In Zsh, if the BASH_REMATCH
shell option is set, the value is in the $BASH_REMATCH
array, else it's in the $MATCH/$match
tied pair of variables (one scalar, the other an array). If the RE_MATCH_PCRE
option is set, then the PCRE engine is used, else the system regexp libraries, for an extended regexp syntax match, as per bash.
So, most simply: if you're using bash:
if [[ "$variable" =~ unquoted.*regex ]]; then
matched_portion="${BASH_REMATCH[0]}"
first_substring="${BASH_REMATCH[1]}"
fi
If you're not using Bash or Zsh, it gets more complicated as you need to use external commands.
Solution 3:
Consider also /usr/bin/expr
.
$ expr substr hello 2 3
ell
You can also match patterns against the beginning of strings.
$ expr match hello h
1
$ expr match hello hell
4
$ expr match hello e
0
$ expr match hello 'h.*o'
5
$ expr match hello 'h.*l'
4
$ expr match hello 'h.*e'
2
Solution 4:
grep and sed are probably the tools you want, depending on the structure of text.
sed should do the trick, if you do not know what the substring is, but know some pattern that is around it.
for example, if you want to find a substring of digits that starts with a "#" sign, you could write something like:
sed 's/^.*#\([0-9]\+\)/\1/g' yourfile
grep could do something similar, but the question is what you need to do with the substring and whether we are talking normal line-end delimited text or not.