How to parse data-uri in python?

HTML image elements have this simplified format:

<img src='something'>

That something can be data-uri, for example:

...

Is there a standard way of parsing this with python, so that I get content_type and base64 data separated, or should I create my own parser for this?


Split the data URI on the comma to get the base64 encoded data without the header. Call base64.b64decode to decode that to bytes. Last, write the bytes to a file.

from base64 import b64decode

data_uri = "..."

# Python 2 and <Python 3.4
header, encoded = data_uri.split(",", 1)
data = b64decode(encoded)

# Python 3.4+
# from urllib import request
# with request.urlopen(data_uri) as response:
#     data = response.read()

with open("image.png", "wb") as f:
    f.write(data)

Python since 3.4 has support for data-uri, under the hood using urllib.request.DataHandler.

from urllib.request import urlopen

with urlopen(data_uri) as response:
    data = response.read()

w3lib (a library used by Scrapy) has a function to parse data uris:

>>> from w3lib.url import parse_data_uri
>>> parse_data_uri('')
ParseDataURIResult(media_type='image/png', media_type_parameters={}, data=b'\x89PNG\r\n\x1a')

This may help:

import re
from lxml import html

BASE_NAME = "image_"

source_code = """<img src="
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot" />
<img src="" alt="Black dot" />"""

tree = html.fromstring(source_code)

for i,image in enumerate(tree.xpath('//img[contains(@src, "data:image")]/@src')):
    image_type, image_content = image.split(',', 1)
    image_type = re.findall('data:image\/(\w+);base64', image_type)[0]
    with open("{}{}.{}".format(BASE_NAME, i, image_type), "wb") as f:
        f.write(image_content.decode('base64'))
    print "[*] '{}' image found with content: {}\n".format(image_type, image_content)

Output:

[*] 'png' image found with content: iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==

[*] 'gif' image found with content: R0lGODlhAQABAIAAAAUEBAAAACwAAAAAAQABAAACAkQBADs=

It will save every base64 image within <img> tags, with their respective file extension:

Prefixed by BASE_NAME + auto-increment digit(s) provided by enumerate + image_extension

enter image description here


Correcting JRodDynamite's post:

from base64 import decodestring

png_arr= "..."
png_arr = png_arr.split(",")
png_arr = png_arr[1]

fh = open("imageToSave.png", "wb")
fh.write(decodestring(png_arr))
fh.close()