I keep thinking I have come up with the perfect function to tackle my problem, but I eventually find something that breaks it for no apparent reason. I don't fully understand how htmlentities / htmlspecialchars works, or what exactly they convert, so I suppose that would help...

I have a mixture of old and new databases, and user-input

    • Old databases sometimes characters are encoded with htmlentities() inside the data
    • Old databases occasionally contain HTML within content (need stripping)
    • New databases characters are not encoded before insertion
    • User input could include nasty <script> or &lt;script&gt; &amp;lt;script/&amp;gt;
    • New databases characters are not encoded before insertion

I am trying to create a catch-all function that will make each case (#1 and #2) both safe, and visually appealing

function html_enc($text){
  while($text!==html_entity_decode($text,ENT_HTML5,'UTF-8')){
    $text=html_entity_decode($text,ENT_HTML5,'UTF-8');
  }
  $text=strip_tags($text);
  $text=htmlentities($text,ENT_HTML5,'UTF-8');
  return $text;
}

I thought I had nailed point #1 it with this function, but when I used it on a pagetitle, it had double quotes in the title, and the page is spitting out &quot; instead of ", but the rest of the page is displaying "... I don't understand why the <title> element would be different to the normal body... Does anyone know how to solve this small issue? Or suggestion of a better function / improvement?

For point #2 this also seems to be the best solution - I haven't broken this function yet with user input, and standard display on a page / in a textarea

Also on a side note, but in the interest of security; my code is assuming that user input is UTF-8 posted in HTML forms, all of my pages are specified

<head>
<meta charset="UTF-8"/>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Is it possible for a user to submit a different encoding? I would imagine it is, and how would this affect my functions? Is it possible to catch this?


By specifying ENT_HTML5 you've lost the default flags ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401, so quotes are not being decoded.

You'll need ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5 or ENT_QUOTES | ENT_HTML5.