Non greedy (reluctant) regex matching in sed?

I'm trying to use sed to clean up lines of URLs to extract just the domain.

So from:

http://www.suepearson.co.uk/product/174/71/3816/

I want:

http://www.suepearson.co.uk/

(either with or without the trailing slash, it doesn't matter)

I have tried:

 sed 's|\(http:\/\/.*?\/\).*|\1|'

and (escaping the non-greedy quantifier)

sed 's|\(http:\/\/.*\?\/\).*|\1|'

but I can not seem to get the non-greedy quantifier (?) to work, so it always ends up matching the whole string.


Solution 1:

Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Fortunately, Perl regex for this context is pretty easy to get:

perl -pe 's|(http://.*?/).*|\1|'

Solution 2:

In this specific case, you can get the job done without using a non-greedy regex.

Try this non-greedy regex [^/]* instead of .*?:

sed 's|\(http://[^/]*/\).*|\1|g'

Solution 3:

With sed, I usually implement non-greedy search by searching for anything except the separator until the separator :

echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*\)/.*;\1;p'

Output:

http://www.suon.co.uk

this is:

  • don't output -n
  • search, match pattern, replace and print s/<pattern>/<replace>/p
  • use ; search command separator instead of / to make it easier to type so s;<pattern>;<replace>;p
  • remember match between brackets \( ... \), later accessible with \1,\2...
  • match http://
  • followed by anything in brackets [], [ab/] would mean either a or b or /
  • first ^ in [] means not, so followed by anything but the thing in the []
  • so [^/] means anything except / character
  • * is to repeat previous group so [^/]* means characters except /.
  • so far sed -n 's;\(http://[^/]*\) means search and remember http://followed by any characters except / and remember what you've found
  • we want to search untill the end of domain so stop on the next / so add another / at the end: sed -n 's;\(http://[^/]*\)/' but we want to match the rest of the line after the domain so add .*
  • now the match remembered in group 1 (\1) is the domain so replace matched line with stuff saved in group \1 and print: sed -n 's;\(http://[^/]*\)/.*;\1;p'

If you want to include backslash after the domain as well, then add one more backslash in the group to remember:

echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*/\).*;\1;p'

output:

http://www.suon.co.uk/

Solution 4:

Simulating lazy (un-greedy) quantifier in sed

And all other regex flavors!

  1. Finding first occurrence of an expression:

    • POSIX ERE (using -r option)

      Regex:

        (EXPRESSION).*|.
      

      Sed:

        sed -r ‍'s/(EXPRESSION).*|./\1/g' # Global `g` modifier should be on
      

      Example (finding first sequence of digits) Live demo:

        $ sed -r 's/([0-9]+).*|./\1/g' <<< 'foo 12 bar 34'
      
        12
      

      How does it work?

      This regex benefits from an alternation |. At each position engine tries to pick the longest match (this is a POSIX standard which is followed by couple of other engines as well) which means it goes with . until a match is found for ([0-9]+).*. But order is important too.

      enter image description here

      Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched (EXPRESSION) rest of line is consumed immediately as well .*. We now hold our value in the first capturing group.

    • POSIX BRE

      Regex:

        \(\(\(EXPRESSION\).*\)*.\)*
      

      Sed:

        sed 's/\(\(\(EXPRESSION\).*\)*.\)*/\3/'
      

      Example (finding first sequence of digits):

        $ sed 's/\(\(\([0-9]\{1,\}\).*\)*.\)*/\3/' <<< 'foo 12 bar 34'
      
        12
      

      This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit.

      enter image description here

      If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since * means more or zero it skips over second capturing group \(\([0-9]\{1,\}\).*\)* and arrives at a dot . to match a single character and this process continues.

  2. Finding first occurrence of a delimited expression:

    This approach will match the very first occurrence of a string that is delimited. We can call it a block of string.

    sed 's/\(END-DELIMITER-EXPRESSION\).*/\1/; \
         s/\(\(START-DELIMITER-EXPRESSION.*\)*.\)*/\1/g'
    

    Input string:

    foobar start block #1 end barfoo start block #2 end
    

    -EDE: end

    -SDE: start

    $ sed 's/\(end\).*/\1/; s/\(\(start.*\)*.\)*/\1/g'
    

    Output:

    start block #1 end
    

    First regex \(end\).* matches and captures first end delimiter end and substitues all match with recent captured characters which is the end delimiter. At this stage our output is: foobar start block #1 end.

    enter image description here

    Then the result is passed to second regex \(\(start.*\)*.\)* that is same as POSIX BRE version above. It matches a single character if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.

    enter image description here


Directly answering your question

Using approach #2 (delimited expression) you should select two appropriate expressions:

  • EDE: [^:/]\/

  • SDE: http:

Usage:

$ sed 's/\([^:/]\/\).*/\1/g; s/\(\(http:.*\)*.\)*/\1/' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'

Output:

http://www.suepearson.co.uk/

Note: this will not work with identical delimiters.