how to parse some mails from html code [closed]

I want to build a little bash script to detect some mails in a HTML-code. currently, I am not sure how can I build the right regex to auto-detect emails from HTML.

I tried this regex with curl:

egrep -o "\S*@.*\.\S*" 

But this includes all non-alpha characters until the first space.

For a little example:

</span></p><p class="footertext"><span style="color: rgb(255, 255, 255);">Email&nbsp;</span><br><a href="mailto:[email protected]" style="color: rgb(255, 255, 255);"

Now I want to auto-detect only this part: [email protected]

Does somebody have any idea?

cheers


When you just want to get whatever comes between the "mailto: and the ", then this would do the trick:

grep -oP '(?<="mailto:)[^"]+(?=")'

It uses positive lookbehind and positive lookahead which is supported by the Perl regex syntax (-P flag).

If you need additional validation of the address, you might want to look into expressions like the ones discussed here: https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression

UPDATE:

If you don't want to fall back to overly complex expressions, this should do the job:

grep -oP $'[^\'",<>:\\s]+@[^\'",<>:\\s]+'

You can easily add additional delimiting characters within the square brackets.

UPDATE 2:

If you also want to match something like this: regex @ example.com

grep -oP $'[^\'",<>:\\s]+\\s*@\\s*[^\'",<>:\\s]+'