How do I detect if have to apply UTF-8 decode or encode on a string?
I have a feed taken from third-party sites, and sometimes I have to apply utf8_decode
and other times utf8_encode
to get the desired visible output.
If by mistake the same stuff is applied twice/or the wrong method is used I get something more ugly, this is what I want to change.
How can I detect when what have to apply on the string?
Actually the content returns UTF-8, but inside there are parts that are not.
I can't say I can rely on mb_detect_encoding()
. I had some freaky false positives a while back.
The most universal way I found to work well in every case was:
if (preg_match('!!u', $string))
{
// This is UTF-8
}
else
{
// Definitely not UTF-8
}
function str_to_utf8 ($str) {
$decoded = utf8_decode($str);
if (mb_detect_encoding($decoded , 'UTF-8', true) === false)
return $str;
return $decoded;
}
var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)
var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)
You can use
-
mb_detect_encoding
— Detect character encoding
The character set might also be available in the HTTP response headers or in the response data itself.
Example:
var_dump(
mb_detect_encoding(
file_get_contents('http://stackoverflow.com/questions/4407854')
),
$http_response_header
);
Output (codepad):
string(5) "UTF-8"
array(9) {
[0]=>
string(15) "HTTP/1.1 200 OK"
[1]=>
string(33) "Cache-Control: public, max-age=11"
[2]=>
string(38) "Content-Type: text/html; charset=utf-8"
[3]=>
string(38) "Expires: Fri, 10 Dec 2010 10:40:07 GMT"
[4]=>
string(44) "Last-Modified: Fri, 10 Dec 2010 10:39:07 GMT"
[5]=>
string(7) "Vary: *"
[6]=>
string(35) "Date: Fri, 10 Dec 2010 10:39:55 GMT"
[7]=>
string(17) "Connection: close"
[8]=>
string(21) "Content-Length: 34119"
}