Catch-All HTML Encoding Entities
I keep thinking I have come up with the perfect function to tackle my problem, but I eventually find something that breaks it for no apparent reason. I don't fully understand how htmlentities / htmlspecialchars works, or what exactly they convert, so I suppose that would help...
I have a mixture of old and new databases, and user-input
-
- Old databases sometimes characters are encoded with htmlentities() inside the data
- Old databases occasionally contain HTML within content (need stripping)
- New databases characters are not encoded before insertion
-
- User input could include nasty
<script>
or<script>
&lt;script/&gt;
- New databases characters are not encoded before insertion
- User input could include nasty
I am trying to create a catch-all function that will make each case (#1 and #2) both safe, and visually appealing
function html_enc($text){
while($text!==html_entity_decode($text,ENT_HTML5,'UTF-8')){
$text=html_entity_decode($text,ENT_HTML5,'UTF-8');
}
$text=strip_tags($text);
$text=htmlentities($text,ENT_HTML5,'UTF-8');
return $text;
}
I thought I had nailed point #1 it with this function, but when I used it on a pagetitle, it had double quotes in the title, and the page is spitting out "
instead of ", but the rest of the page is displaying "... I don't understand why the <title>
element would be different to the normal body... Does anyone know how to solve this small issue? Or suggestion of a better function / improvement?
For point #2 this also seems to be the best solution - I haven't broken this function yet with user input, and standard display on a page / in a textarea
Also on a side note, but in the interest of security; my code is assuming that user input is UTF-8 posted in HTML forms, all of my pages are specified
<head>
<meta charset="UTF-8"/>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Is it possible for a user to submit a different encoding? I would imagine it is, and how would this affect my functions? Is it possible to catch this?
By specifying ENT_HTML5
you've lost the default flags ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401
, so quotes are not being decoded.
You'll need ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5
or ENT_QUOTES | ENT_HTML5
.