Why does a stray </p> end tag generate an empty paragraph?

Apparently, if you have a </p> end tag with no matching start tag within the body element, most if not all browsers will generate an empty paragraph in its place:

<!DOCTYPE html>
<title></title>
<body>
</p>
</body>

Even if any text exists around the end tag, none of it is made part of this p element — it will always be empty and the text nodes will always exist on their own:

<!DOCTYPE html>
<title></title>
<body>
some text</p>more text
</body>

If the above contents of body are wrapped in <p> and </p> tags... I'll leave you to guess what happens:

<!DOCTYPE html>
<title></title>
<body>
<p>some text</p>more text</p>
</body>

Interestingly, if the </p> tag is not preceded by a <body> or </body> tag, all browsers except IE9 and older will not generate an empty paragraph (IE ≤ 9 on the other hand will always create one, while IE10 and later behave the same as all other browsers):

<!DOCTYPE html>
<title></title>
</p>
<!DOCTYPE html>
<title></title>
</p><body>
<!DOCTYPE html>
<title></title>
</p></body>

I can't find any references stipulating that an end tag with no corresponding start tag should generate an empty element, but that shouldn't come across as surprising considering that it's not even valid HTML in the first place. Indeed, I've only found browsers to do this with the p element (and to some extent the br element as well!), but not any explanation as to why.

It is rather consistent across browsers using both traditional HTML parsers and HTML5 parsers, though, applying both in quirks mode and in standards mode. So, it's probably fair to deduce that this is for backward compatibility with early specifications or legacy behavior.

In fact, I did find this comment on an answer to a somewhat related question, which basically confirms it:

The reason why <p> tags are valid unclosed is that originally <p> was defined as a "new paragraph" marker, rather than p being a container element. Equivalent to <br> being a "new line" marker. You can see so defined in this document from 1992:http://www.w3.org/History/19921103-hypertext/hypertext/WWW/MarkUp/Tags.html and this one from 1993: http://www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt Because there were web pages pre-dating the change and browser parsers have always been as backward compatible as possible with existing web content, it's always stayed possible to use <p> that way.

But it doesn't quite explain why parsers treat an explicit </p> end tag (with the slash) as simply... a tag, and generate an empty element in the DOM. Is this part of some parser error handling convention from way back when the syntax wasn't as strictly defined as it was more recently or something? If so, is it documented anywhere at all?


That it is required is documented in HTML5. See http://w3c.github.io/html/syntax.html#the-in-body-insertion-mode and search down for An end tag whose tag name is "p" and it says:

If the stack of open elements does not have an element in button scope with the same tag name as that of the token, then this is a parse error; act as if a start tag with the tag name "p" had been seen, then reprocess the current token.

Which translated into English means create a p element if the </p> tag can't be matched with an existing <p> tag.

Why it is so, is harder to ascertain. Usually, this is because some browser in the past caused this to happen as a bug, and web pages came to rely on the behaviour, so other browsers had to implement it too.


The HTML4 DTD states that the end tag is optional for the paragraph element, but the start tag is required.

The SGML declaration for HTML4 states that omittag is 'yes', which means that the start tag can be implied.

The end tag follows SGML rules:

an end tag closes, back to the matching start tag, all unclosed intervening start tags with omitted end tags

Anonymous block boxes are generated for inline elements such as text nodes, so they need not be wrapped by the paragraph element.

There's a thread in the Mozilla bug database which explains this behaviour:

  • Mozilla parses "half-tags" gullibly, leading to XSS security problems

Here's a relevant comment by Boris Zbarsky:

Actually, as I understand it, proper parsing of SGML/HTML requires that we behave this way. That is, the '<' of the next tag is a valid way to close out the markup of a previous tag...

And summarized by Ian Hickson:

The basic principle at work here, it appears, is that the markup is fixed up by delaying any closing tags until after all other open elements have been closed, and no attempt is made to make the DOM follow the HTML DTD.

References

  • SGML Productions

  • HTML 2.0 Specification

  • Arguments against SGML

  • Tag Soup: How UAs handle

  • Tag Soup: How Mac IE 5 and Safari handle

  • Web SGML and HTML 4.0 Explained

  • Testing SGML SHORTTAG support across browsers

  • Mozilla Bug 226495

  • Shorttag and Omittag

  • Jotting on parsers for SGML-family document languages: SGML, HTML, XML

  • A brief, opinionated history of XML - bobdc.blog