Regular expression to extract URL from an HTML link
I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>
I’m trying to code a tool that only prints out http://ptop.se
. Can you help me please?
Solution 1:
If you're only looking for one:
import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
print(match.group(1))
If you have a long string, and want every instance of the pattern in it:
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))
Where s
is the string that you're looking for matches in.
Quick explanation of the regexp bits:
r'...'
is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\
especially -- in a raw string a\
is just a\
. In a regular string you'd have to do\\
every time, and that gets old in regexps.)"
href=[\'"]?
" says to match "href=", possibly followed by a'
or"
. "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.Enclosing the next bit in "
()
" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in.""
[^\'" >]+
" says to match any characters that aren't'
,"
,>
, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.
The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.
It's pretty easy to do:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
print(tag['href'])
Once you've installed BeautifulSoup, anyway.
Solution 2:
Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.