Extract string from string using RegEx in the Terminal [duplicate]
I have a string like first url, second url, third url
and would like to extract only the url
after the word second
in the OS X Terminal (only the first occurrence). How can I do it?
In my favorite editor I used the regex /second (url)/
and used $1
to extract it, I just don't know how to do it in the Terminal.
Keep in mind that url
is an actual url, I'll be using one of these expressions to match it: Regex to match URL
echo 'first url, second url, third url' | sed 's/.*second//'
Edit: I misunderstood. Better:
echo 'first url, second url, third url' | sed 's/.*second \([^ ]*\).*/\1/'
or:
echo 'first url, second url, third url' | perl -nle 'm/second ([^ ]*)/; print $1'
Piping to another process (like 'sed' and 'perl' suggested above) might be very expensive, especially when you need to run this operation multiple times. Bash does support regexp:
[[ "string" =~ regex ]]
Similarly to the way you extract matches in your favourite editor by using $1
, $2
, etc., Bash fills in the $BASH_REMATCH
array with all the matches.
In your particular example:
str="first url1, second url2, third url3"
if [[ $str =~ (second )([^,]*) ]]; then
echo "match: '${BASH_REMATCH[2]}'"
else
echo "no match found"
fi
Output:
match: 'url2'
Specifically, =~
supports extended regular expressions as defined by POSIX, but with platform-specific extensions (which vary in extent and can be incompatible).
On Linux platforms (GNU userland), see man grep
; on macOS/BSD platforms, see man re_format
.
In the other answer provided you still remain with everything after the desired URL. So I propose you the following solution.
echo 'first url, second url, third url' | sed 's/.*second \(url\)*.*/\1/'
Under sed you group an expression by escaping the parenthesis around it (POSIX standard).
While trying this, what you probably forgot was the -E
argument for sed
.
From sed --help
:
-E, -r, --regexp-extended
use extended regular expressions in the script
(for portability use POSIX -E).
You don't have to change your regex significantly, but you do need to add .*
to match greedily around it to remove the other part of string.
This works fine for me:
echo "first url, second url, third url" | sed -E 's/.*second (url).*/\1/'
Output:
url
In which the output "url" is actually the second instance in the string. But if you already know that it is formatted in between comma and space, and you don't allow these characters in URLs, then the regex [^,]*
should be fine.
Optionally:
echo "first http://test.url/1, second ://test.url/with spaces/2, third ftp://test.url/3" \
| sed -E 's/.*second ([a-zA-Z]*:\/\/[^,]*).*/\1/'
Which correctly outputs:
://example.com/with spaces/2