Ignoring accented letters in string comparison
I need to compare 2 strings in C# and treat accented letters the same as non-accented letters. For example:
string s1 = "hello";
string s2 = "héllo";
s1.Equals(s2, StringComparison.InvariantCultureIgnoreCase);
s1.Equals(s2, StringComparison.OrdinalIgnoreCase);
These 2 strings need to be the same (as far as my application is concerned), but both of these statements evaluate to false. Is there a way in C# to do this?
EDIT 2012-01-20: Oh boy! The solution was so much simpler and has been in the framework nearly forever. As pointed out by knightpfhor :
string.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace);
Here's a function that strips diacritics from a string:
static string RemoveDiacritics(string text)
{
string formD = text.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
foreach (char ch in formD)
{
UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(ch);
if (uc != UnicodeCategory.NonSpacingMark)
{
sb.Append(ch);
}
}
return sb.ToString().Normalize(NormalizationForm.FormC);
}
More details on MichKap's blog (RIP...).
The principle is that is it turns 'é' into 2 successive chars 'e', acute. It then iterates through the chars and skips the diacritics.
"héllo" becomes "he<acute>llo", which in turn becomes "hello".
Debug.Assert("hello"==RemoveDiacritics("héllo"));
Note: Here's a more compact .NET4+ friendly version of the same function:
static string RemoveDiacritics(string text)
{
return string.Concat(
text.Normalize(NormalizationForm.FormD)
.Where(ch => CharUnicodeInfo.GetUnicodeCategory(ch)!=
UnicodeCategory.NonSpacingMark)
).Normalize(NormalizationForm.FormC);
}
If you don't need to convert the string and you just want to check for equality you can use
string s1 = "hello";
string s2 = "héllo";
if (String.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace) == 0)
{
// both strings are equal
}
or if you want the comparison to be case insensitive as well
string s1 = "HEllO";
string s2 = "héLLo";
if (String.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase) == 0)
{
// both strings are equal
}
I had to do something similar but with a StartsWith method. Here is a simple solution derived from @Serge - appTranslator.
Here is an extension method:
public static bool StartsWith(this string str, string value, CultureInfo culture, CompareOptions options)
{
if (str.Length >= value.Length)
return string.Compare(str.Substring(0, value.Length), value, culture, options) == 0;
else
return false;
}
And for one liners freaks ;)
public static bool StartsWith(this string str, string value, CultureInfo culture, CompareOptions options)
{
return str.Length >= value.Length && string.Compare(str.Substring(0, value.Length), value, culture, options) == 0;
}
Accent incensitive and case incensitive startsWith can be called like this
value.ToString().StartsWith(str, CultureInfo.InvariantCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase)
The following method CompareIgnoreAccents(...)
works on your example data. Here is the article where I got my background information: http://www.codeproject.com/KB/cs/EncodingAccents.aspx
private static bool CompareIgnoreAccents(string s1, string s2)
{
return string.Compare(
RemoveAccents(s1), RemoveAccents(s2), StringComparison.InvariantCultureIgnoreCase) == 0;
}
private static string RemoveAccents(string s)
{
Encoding destEncoding = Encoding.GetEncoding("iso-8859-8");
return destEncoding.GetString(
Encoding.Convert(Encoding.UTF8, destEncoding, Encoding.UTF8.GetBytes(s)));
}
I think an extension method would be better:
public static string RemoveAccents(this string s)
{
Encoding destEncoding = Encoding.GetEncoding("iso-8859-8");
return destEncoding.GetString(
Encoding.Convert(Encoding.UTF8, destEncoding, Encoding.UTF8.GetBytes(s)));
}
Then the use would be this:
if(string.Compare(s1.RemoveAccents(), s2.RemoveAccents(), true) == 0) {
...
A more simple way to remove accents:
Dim source As String = "áéíóúç"
Dim result As String
Dim bytes As Byte() = Encoding.GetEncoding("Cyrillic").GetBytes(source)
result = Encoding.ASCII.GetString(bytes)