How to remove elements from XML using Python
Solution 1:
Using lxml:
import lxml.etree as le
with open('doc.xml','r') as f:
doc=le.parse(f)
for elem in doc.xpath('//*[attribute::lang]'):
if elem.attrib['lang']=='en':
elem.attrib.pop('lang')
else:
parent=elem.getparent()
parent.remove(elem)
print(le.tostring(doc))
yields
<root>
<elm>Common content</elm>
<elm>
<elm>Content EN</elm>
</elm>
<elm>Common content</elm>
<elm>Content EN</elm>
<elm>
<elm>Content EN</elm>
<elm>Content EN</elm>
</elm>
</root>
Solution 2:
I'm not sure how best to remove the lang
attribute, but here's some code that does the other changes (Python 2.7; for 2.5 or 2.6, use getIterator
instead of iter
), assuming that when you remove an element you also always want to remove everything contained in that element.
This code just prints the result to standard output (you could redirect it as you wish, of course, or directly write it to some new file, and so on):
import sys
from xml.etree import cElementTree as et
def picklang(path, lang='en'):
tr = et.parse(path)
for element in tr.iter():
for subelement in element:
la = subelement.get('lang')
if la is not None and la != lang:
element.remove(subelement)
return tr
if __name__ == '__main__':
tr = picklang('la.xml')
tr.write(sys.stdout)
print
With la.xml
being your example, this writes
<root>
<elm>Common content</elm>
<elm>
<elm lang="en">Content EN</elm>
</elm>
<elm>Common content</elm>
<elm lang="en">Content EN</elm>
<elm lang="en">
<elm>Content EN</elm>
<elm>Content EN</elm>
</elm>
</root>
Solution 3:
updating @Alex Martelli's code to remove a bug where the element list is updated in place. Above solution will give wrong answer if the input is little more complex.
import sys
from xml.etree import cElementTree as et
def picklang(path, lang='en'):
tr = et.parse(path)
for element in tr.iter():
for subelement in element[:]:
la = subelement.get('lang')
if la is not None and la != lang:
element.remove(subelement)
return tr
if __name__ == '__main__':
tr = picklang('la.xml')
tr.write(sys.stdout)
print
Code in line 7 for subelement in element:
is changed to for subelement in element[:]:
as it is incorrect to update list in place while iterating over it.
This code iterates over a copy of element list and removes elements when lang != "en" in the original element list.