How to count unique characters in string [duplicate]

Solution 1:

You can use LINQ:

var count = myString.Distinct().Count();

It uses a fact, that string implements IEnumerable<char>.

Without LINQ, you can do the same stuff Distinct does internally and use HashSet<char>:

var count = (new HashSet<char>(myString)).Count;

Solution 2:

If you handle only ANSI text in English (or characters from BMP) then 80% times if you write:

myString.Distinct().Count()

You will live happy and won't ever have any trouble. Let me post this answer only for who will really need to handle that in the proper way. I'd say everyone should but I know it's not true (quote from Wikipedia):

Because the most commonly used characters are all in the Basic Multilingual Plane, handling of surrogate pairs is often not thoroughly tested. This leads to persistent bugs and potential security holes, even in popular and well-reviewed application software (e.g. CVE-2008-2938, CVE-2012-2135)

Problem of our first naïve solution is that it doesn't handle Unicode properly and it also doesn't consider what user perceive as character. Let's try "𠀑".Distinct().Count() and your code will wrongly return...2 because its UTF-16 representation is 0xD840 0xDC11 (BTW each of them, alone, is not a valid Unicode character because they're high and low surrogate, respectively).

Here I won't be very strict about terms and definitions so please refer to www.unicode.org as reference. For a (much) more broad discussion please read How can I perform a Unicode aware character by character comparison?, encoding isn't only issue you have to consider.

1) It doesn't take into account that .NET System.Char doesn't represent a character (or more specifically a grapheme) but a code unit of a UTF-16 encoded text (possible, for example, with ideographic characters). Often they coincide but now always.

2) If you're counting what user thinks (or perceives) as a character then this will fail again because it doesn't check combined characters like ا́ (many examples of this in Arabic language). There are duplicates that exists for historical reasons: for example é it's both a single Unicode code point and a combination (then that code will fail).

3) We're talking about a western/American definition of character. If you're counting characters for end-users you may need to change your definition to what they expect (for example in Korean language definition of character may not be so obvious, another example is Czech text ch that is always counted as a single character). Finally don't forget some strange things when you convert characters to upper case/lower case (for example in German language ß is SS in upper case, see also this post).

Encoding

C# strings are encoded as UTF-16 (char is two bytes) but UTF-16 isn't a fixed size encoding and char should be properly called code unit. What does it mean? That you may have a string where Length is 2 but actually user will see (and it's actually is) just one character (then count should be 1).

If you need to handle this properly then you have to make things much more complicated (and slow). Fortunately Char class has some helpful methods to handle surrogates.

Following code is untested (and for illustration purposes so absolutely not optimized, I'm sure it can be done much better than this) so get it just as starting point for further investigations:

int CountCharacters(string text)
{
    HashSet<string> characters = new HashSet<string>();

    string currentCharacter = "";

    for (int i = 0; i < text.Length; ++i)
    {
        if (Char.IsHighSurrogate(text, i))
        {
            // Do not count this, next one will give the full pair
            currentCharacter = text[i].ToString();
            continue;
        }
        else if (Char.IsLowSurrogate(text, i))
        {
            // Our "character" is encoded as previous one plus this one
            currentCharacter += text[i];
        }
        else
            currentCharacter = text[i].ToString();

        if (!characters.Contains(currentCharacter))
            characters.Add(currentCharacter);
    }

    return characters.Count;
}

Note that this example doesn't handle duplicates (when same character may have different codes or can be a single code point or a combined character).

Combined Characters

If you have to handle combined characters (and of course encoding) then best way to do it is to use StringInfo class. You'll enumerate (and then count) both combined and encoded characters:

StringInfo.GetTextElementEnumerator(text).Walk()
    .Distinct().Count();

Walk() is a trivial to implement extension method that simply walks through all IEnumerator elements (we need it because GetTextElementEnumerator() returns IEnumerator instead of IEnumerable).

Please note that after text has been properly splitted it can be counted with our first solution (the point is that brick isn't char but a sequence of char (for simplicity here returned as string itself). Again this code doesn't handle duplicates.

Culture

There is not much you can do to handle issues listed at point 3. Each language has its own rules and to support them all can be a pain. More examples about culture issues on this longer specific post.

It's important to be aware of them (so you have to know little bit about languages you're targeting) and don't forget that Unicode and few translated resx files won't make your application global.

If text processing is important in your application you can solve many issues using specialized DLLs for each locale you support (to count characters, to count words and so on) like Word Processors do. For example, issues I listed can be simply solved using dictionaries. What I usually do is to do not use standard .NET functions for strings (also because of some bugs), I create a Unicode class with static methods for everything I need (character counting, conversions, comparison) and many specialized derived classes for each supported language. At run-time that static methods will user current thread culture name to pick proper implementation from a dictionary and to delegate work to that. A skeleton may be something like this:

abstract class Unicode
{
    public static string CountCharacters(string text)
    {
        return GetConcreteClass().CountCharactersCore(text);
    }

    protected virtual string CountCharactersCore(string text)
    {
        // Default implementation, overridden in derived classes if needed
        return StringInfo.GetTextElementEnumerator(text).Cast<string>()
            .Distinct().Count();
    }

    private Dictionary<string, Unicode> _implementations;

    private Unicode GetConcreteClass()
    {
        string cultureName = Thread.Current.CurrentCulture.Name;

        // Check if concrete class has been loaded and put in dictionary
        ...

        return _implementations[cultureName];
    }
}