Extract src attribute from img tag using BeautifulSoup
Solution 1:
You can use BeautifulSoup
to extract src
attribute of an html img
tag. In my example, the htmlText
contains the img
tag itself but this can be used for a URL too along with urllib2
.
For URLs
from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
#print image source
print image['src']
#print alternate text
print image['alt']
For Texts with img tag
from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
print image['src']
Python 3 : Updated on 2022-02-02
from bs4 import BeautifulSoup as BSHTML
import urllib
page = urllib.request.urlopen('https://github.com/abushoeb/emotag')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
#print image source
print(image['src'])
#print alternate text
print(image['alt'])
Install modules if needed
# python 3
pip install beautifulsoup4
pip install urllib3
Solution 2:
A link doesn't have attribute src
you have to target actual img
tag.
import bs4
html = """<div class="someClass">
<a href="href">
<img alt="some" src="some"/>
</a>
</div>"""
soup = bs4.BeautifulSoup(html, "html.parser")
# this will return src attrib from img tag that is inside 'a' tag
soup.a.img['src']
>>> 'some'
# if you have more then one 'a' tag
for a in soup.find_all('a'):
if a.img:
print(a.img['src'])
>>> 'some'
Solution 3:
here is a solution that will not trigger a KeyError in case the img tag does not have a src attribute:
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "[insert name of the site]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img')
for img in images:
if img.has_attr('src'):
print(img['src'])
Solution 4:
You can use BeautifulSoup to extract src attribute of an html img tag. In my example, the htmlText contains the img tag itself but this can be used for a URL too along with urllib2.
The solution provided by the most rated answer is not working any more with python3. This is the correct implementation:
For URLs
from bs4 import BeautifulSoup as BSHTML
import urllib3
http = urllib3.PoolManager()
url = 'your_url'
response = http.request('GET', url)
soup = BSHTML(response.data, "html.parser")
images = soup.findAll('img')
for image in images:
#print image source
print(image['src'])
#print alternate text
print(image['alt'])
For Texts with img tag
from bs4 import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
print(image['src'])