How to use Python XML findall to find '<v:imagedata r:id="rId7" o:title="1-REN"/>'
With ElementTree in Python 3.8, you can simply use a wildcard ({*}
) for the namespace:
results = ET.fromstring(xml).findall(".//{*}imagedata")
Note the .//
part, which means that the whole document (all descendants) is searched.
ET.findall()
vs BS4.find_all()
:
-
ElementTree's
findall()
is not recursive by default*. It's only going to find direct children of the node provided. So in your case, it's only searching for image nodes directly under the root element.-
* as per mzjn's comment below, prefixing the
match
argument (tag or path) with".//"
will search for that node anywhere in the tree, since it's supports XPath's.
-
* as per mzjn's comment below, prefixing the
-
BeautifulSoup's
find_all()
searches all descendants. So it seaches for 'imagedata' nodes anywhere in the tree. -
However,
ElementTree.iter()
does search all descendants. Using the 'working with namespaces' example in the docs:>>> for char in root.iter('{http://characters.example.com}character'): ... print(' |-->', char.text) ... |--> Lancelot |--> Archie Leach |--> Sir Robin |--> Gunther |--> Commander Clement
- Sadly,
ET.iterfind()
which works with namespaces as a dict (like ET.findall), also does not search descendants, only direct children by default*. Just like ET.findall. Apart from how empty strings''
in the tags are treated wrt the namespace, and one returns a list while the other returns an iterator, I can't say there's a meaningful difference betweenET.findall
andET.iterfind
.-
* As above for
ET.findall()
, prefixing".//"
makes it search the entire tree (matches with any node).
-
* As above for
When you use the namespaces with ET, you still need the namespace name with the tag. The results line should be:
namespace = {'v': "urn:schemas-microsoft-com:vml"}
results = ET.fromstring(xml).findall("v:imagedata", namespace) # note the 'v:'
Also, the 'v'
doesn't need to be a 'v'
, you could change it to something more meaningful if needed:
namespace = {'image': "urn:schemas-microsoft-com:vml"}
results = ET.fromstring(xml).findall("image:imagedata", namespace)
Of course, this still won't necessarily get you all the imagedata elements if they aren't direct children of the root. For that, you'd need to create a recursive function to do it for you. See this answer on SO for how. Note, while that answer does a recursive search, you are likely to hit Python's recursion limit if the descendant depth is too...deep.
To get all the imagedata elements anywhere in the tree, use the ".//"
prefix:
results = ET.fromstring(xml).findall(".//v:imagedata", namespace)
I'm going to leave the question open, but the workaround I'm currently using is to use BeautifulSoup which happily accepts the v:
syntax.
soup = BeautifulSoup(xml, "lxml")
results = soup.find_all("v:imagedata")