HTML encoding issues - "Â" character showing up instead of " "

I've got a legacy app just starting to misbehave, for whatever reason I'm not sure. It generates a bunch of HTML that gets turned into PDF reports by ActivePDF.

The process works like this:

  1. Pull an HTML template from a DB with tokens in it to be replaced (e.g. "~CompanyName~", "~CustomerName~", etc.)
  2. Replace the tokens with real data
  3. Tidy the HTML with a simple regex function that property formats HTML tag attribute values (ensures quotation marks, etc, since ActivePDF's rendering engine hates anything but single quotes around attribute values)
  4. Send off the HTML to a web service that creates the PDF.

Somewhere in that mess, the non-breaking spaces from the HTML template (the  s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character when viewing the document in a browser (FireFox). ActivePDF pukes on these non-UTF8 characters.

My question: since I don't know where the problem stems from and don't have time to investigate it, is there an easy way to re-encode or find-and-replace the bad characters? I've tried sending it through this little function I threw together, but it turns it all into gobbledegook doesn't change anything.

Private Shared Function ConvertToUTF8(ByVal html As String) As String
    Dim isoEncoding As Encoding = Encoding.GetEncoding("iso-8859-1")
    Dim source As Byte() = isoEncoding.GetBytes(html)
    Return Encoding.UTF8.GetString(Encoding.Convert(isoEncoding, Encoding.UTF8, source))
End Function

Any ideas?

EDIT:

I'm getting by with this for now, though it hardly seems like a good solution:

Private Shared Function ReplaceNonASCIIChars(ByVal html As String) As String
    Return Regex.Replace(html, "[^\u0000-\u007F]", " ")
End Function

Solution 1:

Somewhere in that mess, the non-breaking spaces from the HTML template (the  s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character

That'd be encoding to UTF-8 then, not ISO-8859-1. The non-breaking space character is byte 0xA0 in ISO-8859-1; when encoded to UTF-8 it'd be 0xC2,0xA0, which, if you (incorrectly) view it as ISO-8859-1 comes out as " ". That includes a trailing nbsp which you might not be noticing; if that byte isn't there, then something else has mauled your document and we need to see further up to find out what.

What's the regexp, how does the templating work? There would seem to be a proper HTML parser involved somewhere if your   strings are (correctly) being turned into U+00A0 NON-BREAKING SPACE characters. If so, you could just process your template natively in the DOM, and ask it to serialise using the ASCII encoding to keep non-ASCII characters as character references. That would also stop you having to do regex post-processing on the HTML itself, which is always a highly dodgy business.

Well anyway, for now you can add one of the following to your document's <head> and see if that makes it look right in the browser:

  • for HTML4: <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  • for HTML5: <meta charset="utf-8">

If you've done that, then any remaining problem is ActivePDF's fault.

Solution 2:

If any one had the same problem as me and the charset was already correct, simply do this:

  1. Copy all the code inside the .html file.
  2. Open notepad (or any basic text editor) and paste the code.
  3. Go "File -> Save As"
  4. Enter you file name "example.html" (Select "Save as type: All Files (.)")
  5. Select Encoding as UTF-8
  6. Hit Save and you can now delete your old .html file and the encoding should be fixed

Solution 3:

Problem: Even I was facing the problem where we were sending '£' with some string in POST request to CRM System, but when we were doing the GET call from CRM , it was returning '£' with some string content. So what we have analysed is that '£' was getting converted to '£'.

Analysis: The glitch which we have found after doing research is that in POST call we have set HttpWebRequest ContentType as "text/xml" while in GET Call it was "text/xml; charset:utf-8".

Solution: So as the part of solution we have included the charset:utf-8 in POST request and it works.