Case insensitive string comparison

I would like to compare two variables to see if they are the same, but I want this comparison to be case-insensitive.

For example, this would be case sensitive:

if($var1 == $var2){
   ...
}

But I want this to be case insensitive, how would I approach this?

This is fairly simple; you just need to call strtolower() on both variables.

If you need to deal with Unicode or international character sets, you can use mb_strtolower().

Please note that other answers suggest using strcasecmp()—that function does not handle multibyte characters, so results for any UTF-8 string will be bogus.

strcasecmp() returns 0 if the strings are the same (apart from case variations) so you can use:

if (strcasecmp($var1, $var2) == 0) {
}

If your string is in a single byte encoding, it's simple:

if(strtolower($var1) === strtolower($var2))

If your string is UTF-8, you have to consider the complexity of Unicode: to-lower-case and to-upper-case are not bijective functions, i.e. if you have a lower case character, transform it to upper case, and transform it back to lower case, you may not end up with the same code point (and the same holds true if you start with an upper case character).

E.g.

"İ" (Latin Capital Letter I with Dot Above, U+0130) is an upper case character, with "i" (Latin Small Letter I, U+0069) as its lower case variant – and "i"'s upper case variant is "I" (Latin Capital Letter I, U+0049).
"ı" (Latin Small Letter Dotless I, U+0131) is a lower case character, with "I" (Latin Capital Letter I, U+0049) as its upper case variant – and "I"'s lower case variant is "i" (Latin Small Letter I, U+0069)

So mb_strtolower('ı') === mb_strtolower('i') returns false, even though they have the same upper case character. If you really want a case-insensitive string comparison function, you have to compare to upper case AND the lower case version:

if(mb_strtolower($string1) === mb_strtolower($string2)
  || mb_strtoupper($string1) === mb_strtoupper($string2))

I've run a query against the Unicode database from https://codepoints.net (https://dumps.codepoints.net) and I've found 180 code point for which I found a different character when taking a lower case characters's upper case's lower case, and 8 code point for which I found a different character when taking an upper case characters's lower case's upper case

But it gets worse: the same grapheme cluster seen by the user, may have multiple ways of encoding it: "ä" may be represented as Latin Small Letter a with Diaeresis (U+00E4) or as Latin Small Letter A (U+0061) and Combining Diaeresis (U+0308) – and if you compare them at a byte level, this won't return true!

But there is a solution for this in Unicode: Normalization! There are four different forms: NFC, NFD, NFKC, NFKD. For string comparison, NFC and NFD are equivalent and NFKC and NFKD are equivalent. I'd take NFKC as it is shorter than NFKD, and "ﬀ" (Latin Small Ligature ff, U+FB00) will be transformed to two normal "f" (but 2⁵ will also be expanded to 25…).

The resulting function becomes:

function mb_is_string_equal_ci($string1, $string2) {
    $string1_normalized = Normalizer::normalize($string1, Normalizer::FORM_KC);
    $string2_normalized = Normalizer::normalize($string2, Normalizer::FORM_KC);
    return mb_strtolower($string1_normalized) === mb_strtolower($string2_normalized)
            || mb_strtoupper($string1_normalized) === mb_strtoupper($string2_normalized);
}

Please note:

you need the intl package for the Normalizer
you should optimize this function by first checking if they're just equal^^
you may want to use NFC instead of NFKC, because NFKC removes too many formatting distinctions for your taste
you have to decide for yourself, if you really need all this complexity or if you prefer a simpler variant of this function

Case insensitive string comparison

Related

Recent Posts