How to iterate UTF-8 string in PHP?
How to iterate a UTF-8 string character by character using indexing?
When you access a UTF-8 string with the bracket operator $str[0]
the utf-encoded character consists of 2 or more elements.
For example:
$str = "Kąt";
$str[0] = "K";
$str[1] = "�";
$str[2] = "�";
$str[3] = "t";
but I would like to have:
$str[0] = "K";
$str[1] = "ą";
$str[2] = "t";
It is possible with mb_substr
but this is extremely slow, ie.
mb_substr($str, 0, 1) = "K"
mb_substr($str, 1, 1) = "ą"
mb_substr($str, 2, 1) = "t"
Is there another way to interate the string character by character without using mb_substr
?
Solution 1:
Use preg_split. With "u" modifier it supports UTF-8 unicode.
$chrArray = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
Solution 2:
Preg split will fail over very large strings with a memory exception and mb_substr is slow indeed, so here is a simple, and effective code, which I'm sure, that you could use:
function nextchar($string, &$pointer){
if(!isset($string[$pointer])) return false;
$char = ord($string[$pointer]);
if($char < 128){
return $string[$pointer++];
}else{
if($char < 224){
$bytes = 2;
}elseif($char < 240){
$bytes = 3;
}else{
$bytes = 4;
}
$str = substr($string, $pointer, $bytes);
$pointer += $bytes;
return $str;
}
}
This I used for looping through a multibyte string char by char and if I change it to the code below, the performance difference is huge:
function nextchar($string, &$pointer){
if(!isset($string[$pointer])) return false;
return mb_substr($string, $pointer++, 1, 'UTF-8');
}
Using it to loop a string for 10000 times with the code below produced a 3 second runtime for the first code and 13 seconds for the second code:
function microtime_float(){
list($usec, $sec) = explode(' ', microtime());
return ((float)$usec + (float)$sec);
}
$source = 'árvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógép';
$t = Array(
0 => microtime_float()
);
for($i = 0; $i < 10000; $i++){
$pointer = 0;
while(($chr = nextchar($source, $pointer)) !== false){
//echo $chr;
}
}
$t[] = microtime_float();
echo $t[1] - $t[0].PHP_EOL.PHP_EOL;