how to parse some mails from html code [closed]
I want to build a little bash script to detect some mails in a HTML-code. currently, I am not sure how can I build the right regex to auto-detect emails from HTML.
I tried this regex with curl:
egrep -o "\S*@.*\.\S*"
But this includes all non-alpha characters until the first space.
For a little example:
</span></p><p class="footertext"><span style="color: rgb(255, 255, 255);">Email </span><br><a href="mailto:[email protected]" style="color: rgb(255, 255, 255);"
Now I want to auto-detect only this part: [email protected]
Does somebody have any idea?
cheers
When you just want to get whatever comes between the "mailto: and the ", then this would do the trick:
grep -oP '(?<="mailto:)[^"]+(?=")'
It uses positive lookbehind and positive lookahead which is supported by the Perl regex syntax (-P flag).
If you need additional validation of the address, you might want to look into expressions like the ones discussed here: https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression
UPDATE:
If you don't want to fall back to overly complex expressions, this should do the job:
grep -oP $'[^\'",<>:\\s]+@[^\'",<>:\\s]+'
You can easily add additional delimiting characters within the square brackets.
UPDATE 2:
If you also want to match something like this: regex @ example.com
grep -oP $'[^\'",<>:\\s]+\\s*@\\s*[^\'",<>:\\s]+'