Get all text inside a tag in lxml

I'd like to write a code snippet that would grab all of the text inside the <content> tag, in lxml, in all three instances below, including the code tags. I've tried tostring(getchildren()) but that would miss the text in between the tags. I didn't have very much luck searching the API for a relevant function. Could you help me out?

<div>Text inside tag</div>
#should return "<div>Text inside tag</div>

Text with no tag
#should return "Text with no tag"

Text outside tag <div>Text inside tag</div>
#should return "Text outside tag <div>Text inside tag</div>"

Solution 1:

Does text_content() do what you need?

Solution 2:

Just use the node.itertext() method, as in:


Solution 3:


def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))


from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>

Produces: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

Solution 4:

A version of albertov 's stringify-content that solves the bugs reported by hoju:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    return ''.join(
        chunk for chunk in chain(
            chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
            (node.tail,)) if chunk)