Challenging RegEx not working for fellow noob
I want to capture all URL's in a document, but that are not from google,bscscan,github, etc.
So far I have this Regex working
(www|http:|https:)+[\W]+(?!bscscan|google|binance|t\.me)[\w]+
When applied to this paragraph
https://bscscan.com testing123
website: https://www.yahoo.com
another one www.bing.com is great
www.binance.org
http://bob.bscscan.com
https://twitter.google.com
https://google.twitter.com
https://t.me/rawr omg
It matches only
1) https://www
2) www.bing
3) http://bob
4) https:/twitter
But I want it to match
https://yahoo.com
www.bing.com
Fixes desired
#1) Include entire URL link.
#2) Omit the URLs that have ANY mention of the negative lookahead words within the link.
Solution 1:
Use
\b(?:www\.|https?:)(?!\S*\b(?:bscscan|google|binance|t\.me)\b)\S+
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
www 'www'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
http 'http'
--------------------------------------------------------------------------------
s? 's' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\S* non-whitespace (all but \n, \r, \t, \f,
and " ") (0 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
bscscan 'bscscan'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
google 'google'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
binance 'binance'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
t 't'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
me 'me'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
Solution 2:
Try this one, it has enough expressions in there to allow you to modify them based on how it is implemented:
/(|www\.|http\:\/\/|https\:\/\/)(?!(bscscan|google|binance|t\.me|twitter|bob))(yahoo\.com|bing\.com)/g
This will match any of the following variations:
https://yahoo.com. <- your required one
www.bing.com. <- your required one
www.yahoo.com
https://bing.com
http://bing.com
bing.com <- remove the "|" before "www" if you don't want this one
yahoo.com <- remove the "|" before "www" if you don't want this one
if you add (https\:\/\/www\.)|(http\:\/\/www\.)
then it will also match https://www
and http://www