parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)
I send a GET request to the CareerBuilder API :
import requests
url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text
And get back an XML that looks like this. However, I have trouble parsing it.
Using either lxml
>>> from lxml import etree
>>> print etree.fromstring(xml)
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
print etree.fromstring(xml)
File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (src\lxml\lxml.etree.c:62311)
File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.
or ElementTree:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
print ET.fromstring(xml)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1301, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1641, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 3717: ordinal not in range(128)
So, even though the XML file starts with
<?xml version="1.0" encoding="UTF-8"?>
I have the impression that it contains characters that are not allowed. How do I parse this file with either lxml
or ElementTree
?
Solution 1:
You are using the decoded unicode value. Use r.raw
raw response data instead:
r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)
which will read the data from the response directly; do note the stream=True
option to .get()
.
Setting the r.raw.decode_content = True
flag ensures that the raw socket will give you the decompressed content even if the response is gzip or deflate compressed.
You don't have to stream the response; for smaller XML documents it is fine to use the response.content
attribute, which is the un-decoded response body:
r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)
XML parsers always expect bytes as input as the XML format itself dictates how the parser is to decode those bytes to Unicode text.
Solution 2:
Correction!
See below how I got it all wrong. Basically, when we use the method .text
then the result is a unicode encoded string. Using it raises the following exception in lxml
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
Which basically means that @martijn-pieters was right, we must use the raw response as returned by .content
Incorrect answer (but might be interesting to someone)
For whoever is interested. I believe the reason this error occurs is probably an invalid guess taken by requests as explained in Response.text
documentation:
Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using chardet.
The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r.encoding appropriately before accessing this property.
So, following this, one could also make sure requests' r.text
encodes the response content correctly by explicitly setting the encoding with r.encoding = 'UTF-8'
This approach adds another validation that the received response is indeed in the correct encoding prior to parsing it with lxml.