lxml etree HTML parser changes order of nodes (<center> inside <p>)

I'm currently facing an issue where I can't explain the etree behaviour. Following code demonstrates the issue I am facing. I want to parse an HTML string as illustrated below, change the attribute of an element and reprint the HTML when done.

from lxml import etree
from io import StringIO, BytesIO

string = "<p><center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center></p>"
parser = etree.HTMLParser()
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html")

I get this output:

<html><body>
<p></p>
<center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center>
</body></html>

As you can see (let's ignore the <html> and <body> tags etree adds), the order of the nodes has been changed! The <p> tag that used to wrap the <center> tag, now loses its content, and that content gets added after the </p> tag closes. Eh?

When I omit the <center> tag, all of a sudden the parsing is done right:

from lxml import etree
from io import StringIO, BytesIO

string = "<p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p>"
parser = etree.HTMLParser()
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html"))

With correct output:

<html><body><p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p></body></html>

Am I doing something wrong here? I have to use the HTML parser because I get a lot of parsing errors when not using it. I also can't change the order of the <p> and <center> tags, as I read them this way.


<center> is a block level element.

<p> cannot legally contain block level elements.

Therefore the parser closes the <p> when it encounters <center>.

Use valid HTML - or an XML parser, which does not care about HTML rules (but in exchange can't deal with some of the HTML specifics, like most named entities, such as &nbsp; or unclosed/self-closing tags).

Centering content has been done with CSS for ages now, there is no reason to use <center> anymore (and, in fact, it's deprecated). But it still works, and if you insist on using it, switch the nesting.

<center><p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p></center>