Parsing compressed xml feed into ElementTree
You can pass the value returned by urlopen()
directly to GzipFile()
and in turn you can pass it to ElementTree
methods such as iterparse()
:
#!/usr/bin/env python3
import xml.etree.ElementTree as etree
from gzip import GzipFile
from urllib.request import urlopen, Request
with urlopen(Request("http://smarkets.s3.amazonaws.com/oddsfeed.xml",
headers={"Accept-Encoding": "gzip"})) as response, \
GzipFile(fileobj=response) as xml_file:
for elem in getelements(xml_file, 'interesting_tag'):
process(elem)
where getelements()
allows to parse files that do not fit in memory.
def getelements(filename_or_file, tag):
"""Yield *tag* elements from *filename_or_file* xml incrementaly."""
context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
if event == 'end' and elem.tag == tag:
yield elem
root.clear() # free memory
To preserve memory, the constructed xml tree is cleared on each tag element.
The ET.parse
function takes "a filename or file object containing XML data". You're giving it a string full of XML. It's going to try to open a file whose name is that big chunk of XML. There is probably no such file.
You want the fromstring
function, or the XML
constructor.
Or, if you prefer, you've already got a file object, gzipper
; you could just pass that to parse
instead of reading it into a string.
This is all covered by the short Tutorial in the docs:
We can import this data by reading from a file:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
Or directly from a string:
root = ET.fromstring(country_data_as_string)