How to handle user input of invalid UTF-8 characters?

Solution 1:

The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, and they are not forced to submit that in that way. Crappy form submission bots are a good example...

I usually ignore bad characters, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions. If you use iconv, you also have the option to transliterate bad characters.

Here is an example using iconv():

$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);

If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis. Something like this would probably do just fine:

function utf8_clean($str)
{
    return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}

$clean_GET = array_map('utf8_clean', $_GET);

if (serialize($_GET) != serialize($clean_GET))
{
    $_GET = $clean_GET;
    $error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}

// $_GET is clean!

You may also want to normalize new lines and strip (non-)visible control chars, like this:

function Clean($string, $control = true)
{
    $string = iconv('UTF-8', 'UTF-8//IGNORE', $string);

    if ($control === true)
    {
            return preg_replace('~\p{C}+~u', '', $string);
    }

    return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
}

Code to convert from UTF-8 to Unicode code points:

function Codepoint($char)
{
    $result = null;
    $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

    if (is_array($codepoint) && array_key_exists(1, $codepoint))
    {
        $result = sprintf('U+%04X', $codepoint[1]);
    }

    return $result;
}

echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072

It is probably faster than any other alternative, but I haven't tested it extensively though.


Example:

$string = 'hello world�';

// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);

function Bad_Codepoint($string)
{
    $result = array();

    foreach ((array) $string as $char)
    {
        $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

        if (is_array($codepoint) && array_key_exists(1, $codepoint))
        {
            $result[] = sprintf('U+%04X', $codepoint[1]);
        }
    }

    return implode('', $result);
}

This may be what you were looking for.

Solution 2:

Receiving invalid characters from your web application might have to do with the character sets assumed for HTML forms. You can specify which character set to use for forms with the accept-charset attribute:

<form action="..." accept-charset="UTF-8">

You also might want to take a look at similar questions on Stack Overflow for pointers on how to handle invalid characters, e.g., those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs.

Solution 3:

I put together a fairly simple class to check if input is in UTF-8 and to run through utf8_encode() as needs be:

class utf8
{

    /**
     * @param array $data
     * @param int $options
     * @return array
     */
    public static function encode(array $data)
    {
        foreach ($data as $key=>$val) {
            if (is_array($val)) {
                $data[$key] = self::encode($val, $options);
            } else {
                if (false === self::check($val)) {
                    $data[$key] = utf8_encode($val);
                }
            }
        }

        return $data;
    }

    /**
     * Regular expression to test a string is UTF8 encoded
     * 
     * RFC3629
     * 
     * @param string $string The string to be tested
     * @return bool
     * 
     * @link http://www.w3.org/International/questions/qa-forms-utf-8.en.php
     */
    public static function check($string)
    {
        return preg_match('%^(?:
            [\x09\x0A\x0D\x20-\x7E]              # ASCII
            | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
            |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
            | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
            |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
            |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
            | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
            |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
            )*$%xs',
            $string);
    }
}

// For example
$data = utf8::encode($_POST);

Solution 4:

For completeness to this question (not necessarily the best answer)...

function as_utf8($s) {
    return mb_convert_encoding($s, "UTF-8", mb_detect_encoding($s));
}

Solution 5:

There is a multibyte extension for PHP. See Multibyte String

You should try the mb_check_encoding() function.