Is strip_tags() vulnerable to scripting attacks?
Is there a known XSS or other attack that makes it past a
$content = "some HTML code";
$content = strip_tags($content);
echo $content;
?
The manual has a warning:
This function does not modify any attributes on the tags that you allow using allowable_tags, including the style and onmouseover attributes that a mischievous user may abuse when posting text that will be shown to other users.
but that is related to using the allowable_tags
parameter only.
With no allowed tags set, is strip_tags()
vulnerable to any attack?
Chris Shiflett seems to say it's safe:
Use Mature Solutions
When possible, use mature, existing solutions instead of trying to create your own. Functions like strip_tags() and htmlentities() are good choices.
is this correct? Please if possible, quote sources.
I know about HTML purifier, htmlspecialchars() etc.- I am not looking for the best method to sanitize HTML. I just want to know about this specific issue. This is a theoretical question that came up here.
Reference: strip_tags()
implementation in the PHP source code
As its name may suggest, strip_tags
should remove all HTML tags. The only way we can proof it is by analyzing the source code. The next analysis applies to a strip_tags('...')
call, without a second argument for whitelisted tags.
First at all, some theory about HTML tags: a tag starts with a <
followed by non-whitespace characters. If this string starts with a ?
, it should not be parsed. If this string starts with a !--
, it's considered a comment and the following text should neither be parsed. A comment is terminated with a -->
, inside such a comment, characters like <
and >
are allowed. Attributes can occur in tags, their values may optionally be surrounded by a quote character ('
or "
). If such a quote exist, it must be closed, otherwise if a >
is encountered, the tag is not closed.
The code <a href="example>xxx</a><a href="second">text</a>
is interpreted in Firefox as:
<a href="http://example.com%3Exxx%3C/a%3E%3Ca%20href=" second"="">text</a>
The PHP function strip_tags
is referenced in line 4036 of ext/standard/string.c. That function calls the internal function php_strip_tags_ex.
Two buffers exist, one for the output, the other for "inside HTML tags". A counter named depth
holds the number of open angle brackets (<
).
The variable in_q
contains the quote character ('
or "
) if any, and 0
otherwise. The last character is stored in the variable lc
.
The functions holds five states, three are mentioned in the description above the function. Based on this information and the function body, the following states can be derived:
- State 0 is the output state (not in any tag)
- State 1 means we are inside a normal html tag (the tag buffer contains
<
) - State 2 means we are inside a php tag
- State 3: we came from the output state and encountered the
<
and!
characters (the tag buffer contains<!
) - State 4: inside HTML comment
We need just to be careful that no tag can be inserted. That is, <
followed by a non-whitespace character. Line 4326 checks an case with the <
character which is described below:
- If inside quotes (e.g.
<a href="inside quotes">
), the<
character is ignored (removed from the output). - If the next character is a whitespace character,
<
is added to the output buffer. - if outside a HTML tag, the state becomes
1
("inside HTML tag") and the last characterlc
is set to<
- Otherwise, if inside the a HTML tag, the counter named
depth
is incremented and the character ignored.
If >
is met while the tag is open (state == 1
), in_q
becomes 0
("not in a quote") and state
becomes 0
("not in a tag"). The tag buffer is discarded.
Attribute checks (for characters like '
and "
) are done on the tag buffer which is discarded. So the conclusion is:
strip_tags without a tag whitelist is safe for inclusion outside tags, no tag will be allowed.
By "outside tags", I mean not in tags as in <a href="in tag">outside tag</a>
. Text may contain <
and >
though, as in >< a>>
. The result is not valid HTML though, <
, >
and &
need still to be escaped, especially the &
. That can be done with htmlspecialchars()
.
The description for strip_tags
without an whitelist argument would be:
Makes sure that no HTML tag exist in the returned string.
I cannot predict future exploits, especially since I haven't looked at the PHP source code for this. However, there have been exploits in the past due to browsers accepting seemingly invalid tags (like <s\0cript>
). So it's possible that in the future someone might be able to exploit odd browser behavior.
That aside, sending the output directly to the browser as a full block of HTML should never be insecure:
echo '<div>'.strip_tags($foo).'</div>'
However, this is not safe:
echo '<input value="'.strip_tags($foo).'" />';
because one could easily end the quote via "
and insert a script handler.
I think it's much safer to always convert stray <
into <
(and the same with quotes).