How can I perform a culture-sensitive "starts-with" operation from the middle of a string?

I'll consider the problem of many<->one/many casemappings first and separately from handling different Normalization forms.

For example:

x heiße y
  ^--- cursor

Matches heisse but then moves cursor 1 too much. And:

x heisse y
  ^--- cursor

Matches heiße but then moves cursor 1 too less.

This will apply to any character that doesn't have a simple one-to-one mapping.

You would need to know the length of the substring that was actually matched. But Compare, IndexOf ..etc throw that information away. It could be possible with regular expressions but the implementation doesn't do full case folding and so doesn't match ß to ss/SS in case-insensitive mode even though .Compare and .IndexOf do. And it would probably be costly to create new regexes for every candidate anyway.

The simplest solution to this is to just internally store strings in case folded form and do binary comparisons with case folded candidates. Then you can move the cursor correctly with just .Length since the cursor is for internal representation. You also get most of the lost performance back from not having to use CompareOptions.IgnoreCase.

Unfortunately there is no case fold function built-in and the poor man's case folding doesn't work either because there is no full case mapping - the ToUpper method doesn't turn ß into SS.

For example this works in Java (and even in Javascript), given string that is in Normal Form C:

//Poor man's case folding.
//There are some edge cases where this doesn't work
public static String toCaseFold( String input, Locale cultureInfo ) {
    return input.toUpperCase(cultureInfo).toLowerCase(cultureInfo);
}

Fun to note that Java's ignore case comparison doesn't do full case folding like C#'s CompareOptions.IgnoreCase. So they are opposite in this regard: Java does full casemapping, but simple case folding - C# does simple casemapping, but full case folding.

So it's likely that you need a 3rd party library to case fold your strings before using them.

Before doing anything you have to be sure that your strings are in normal form C. You can use this preliminary quick check optimized for Latin script:

public static bool MaybeRequiresNormalizationToFormC(string input)
{
    if( input == null ) throw new ArgumentNullException("input");

    int len = input.Length;
    for (int i = 0; i < len; ++i)
    {
        if (input[i] > 0x2FF)
        {
            return true;
        }
    }

    return false;
}

This gives false positives but not false negatives, I don't expect it to slow down 460k parses/s at all when using Latin script characters even though it needs to be performed on every string. With a false positive you would use IsNormalized to get a true negative/positive and only after that normalize if necessary.

So in conclusion, the processing is to ensure normal form C first, then case fold. Do binary comparisons with the processed strings and move cursor as you are moving it currently.

See if this meets the requirement .. :

public static partial class GlobalizationExtensions {
    public static int IsPrefix(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, CompareOptions options
        ) {
        if(compareInfo.IndexOf(source, prefix, startIndex, options)!=startIndex)
            return ~0;
        else
            // source is started with prefix
            // therefore the loop must exit
            for(int length2=0, length1=prefix.Length; ; )
                if(0==compareInfo.Compare(
                        prefix, 0, length1, 
                        source, startIndex, ++length2, options))
                    return length2;
    }
}

compareInfo.Compare only performs once source started with prefix; if it didn't, then IsPrefix returns -1; otherwise, the length of characters used in source.

However, I have no idea except increment length2 by 1 with the following case:

var candidate="ßssß\u00E9\u0302";
var text="abcd ssßss\u0065\u0301\u0302sss";

var count=
    culture.CompareInfo.IsPrefix(text, candidate, 5, CompareOptions.IgnoreCase);

update:

I tried to improve a little bit of perf., but it isn't proved whether there's bug in the following code:

public static partial class GlobalizationExtensions {
    public static int Compare(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, ref int length2, 
        CompareOptions options) {
        int length1=prefix.Length, v2, v1;

        if(0==(v1=compareInfo.Compare(
            prefix, 0, length1, source, startIndex, length2, options))
            ) {
            return 0;
        }
        else {
            if(0==(v2=compareInfo.Compare(
                prefix, 0, length1, source, startIndex, 1+length2, options))
                ) {
                ++length2;
                return 0;
            }
            else {
                if(v1<0||v2<0) {
                    length2-=2;
                    return -1;
                }
                else {
                    length2+=2;
                    return 1;
                }
            }
        }
    }

    public static int IsPrefix(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, CompareOptions options
        ) {
        if(compareInfo.IndexOf(source, prefix, startIndex, options)
                !=startIndex)
            return ~0;
        else
            for(int length2=
                    Math.Min(prefix.Length, source.Length-(1+startIndex)); ; )
                if(0==compareInfo.Compare(
                        source, prefix, startIndex, ref length2, options))
                    return length2;
    }
}

I tested with the particular case, and the comparision down to about 3.

This is actually possible without normalization and without using IsPrefix.

We need to compare the same number of text elements as opposed to the same number of characters, but still return the number of matching characters.

I've created a copy of the MatchCaseInsensitive method from ValueCursor.cs in Noda Time and modified it slightly so that it can be used in a static context:

// Noda time code from MatchCaseInsensitive in ValueCursor.cs
static int IsMatch_Original(string source, int index, string match, CompareInfo compareInfo)
{
    unchecked
    {
        if (match.Length > source.Length - index)
        {
            return 0;
        }

        // TODO(V1.2): This will fail if the length in the input string is different to the length in the
        // match string for culture-specific reasons. It's not clear how to handle that...
        if (compareInfo.Compare(source, index, match.Length, match, 0, match.Length, CompareOptions.IgnoreCase) == 0)
        {
            return match.Length;
        }

        return 0;
    }
}

(Just included for reference, it is the code that won't compare properly as you know)

The following variant of that method uses StringInfo.GetNextTextElement which is provided by the framework. The idea is to compare text element by text element to find a match and if found return the actual number of matching characters in the source string:

// Using StringInfo.GetNextTextElement to match by text elements instead of characters
static int IsMatch_New(string source, int index, string match, CompareInfo compareInfo)
{
    int sourceIndex = index;
    int matchIndex = 0;

    // Loop until we reach the end of source or match
    while (sourceIndex < source.Length && matchIndex < match.Length)
    {
        // Get text elements at the current positions of source and match
        // Normally that will be just one character but may be more in case of Unicode combining characters
        string sourceElem = StringInfo.GetNextTextElement(source, sourceIndex);
        string matchElem = StringInfo.GetNextTextElement(match, matchIndex);

        // Compare the current elements.
        if (compareInfo.Compare(sourceElem, matchElem, CompareOptions.IgnoreCase) != 0)
        {
            return 0; // No match
        }

        // Advance in source and match (by number of characters)
        sourceIndex += sourceElem.Length;
        matchIndex += matchElem.Length;
    }

    // Check if we reached end of source and not end of match
    if (matchIndex != match.Length)
    {
        return 0; // No match
    }

    // Found match. Return number of matching characters from source.
    return sourceIndex - index;
}

That method works just fine at least according to my test cases (which basically just test a couple of variants of the strings you've provided: "b\u00e9d" and "be\u0301d").

However, the GetNextTextElement method creates a substring for each text element so this implementation requires alot of substring comparisons - which will have an impact on performance.

So, I created another variant that does not use GetNextTextElement but instead skips over Unicode combining characters to find the actual match length in characters:

// This should be faster
static int IsMatch_Faster(string source, int index, string match, CompareInfo compareInfo)
{
    int sourceLength = source.Length;
    int matchLength = match.Length;
    int sourceIndex = index;
    int matchIndex = 0;

    // Loop until we reach the end of source or match
    while (sourceIndex < sourceLength && matchIndex < matchLength)
    {
        sourceIndex += GetTextElemLen(source, sourceIndex, sourceLength);
        matchIndex += GetTextElemLen(match, matchIndex, matchLength);
    }

    // Check if we reached end of source and not end of match
    if (matchIndex != matchLength)
    {
        return 0; // No match
    }

    // Check if we've found a match
    if (compareInfo.Compare(source, index, sourceIndex - index, match, 0, matchIndex, CompareOptions.IgnoreCase) != 0)
    {
        return 0; // No match
    }

    // Found match. Return number of matching characters from source.
    return sourceIndex - index;
}

That method uses the following two helpers:

static int GetTextElemLen(string str, int index, int strLen)
{
    bool stop = false;
    int elemLen;

    for (elemLen = 0; index < strLen && !stop; ++elemLen, ++index)
    {
        stop = !IsCombiningCharacter(str, index);
    }

    return elemLen;
}

static bool IsCombiningCharacter(string str, int index)
{
    switch (CharUnicodeInfo.GetUnicodeCategory(str, index))
    {
        case UnicodeCategory.NonSpacingMark:
        case UnicodeCategory.SpacingCombiningMark:
        case UnicodeCategory.EnclosingMark:
            return true;

        default:
            return false;
    }
}

I haven't done any bench marking, so I don't really know whether the faster method is actually faster. Nor have I done any extended testing.

But this should answer your question on how to perform cultural sensitive substring matching for strings that may include Unicode combining characters.

These are the test cases I've used:

static Tuple<string, int, string, int>[] tests = new []
{
    Tuple.Create("x b\u00e9d y", 2, "be\u0301d", 3),
    Tuple.Create("x be\u0301d y", 2, "b\u00e9d", 4),

    Tuple.Create("x b\u00e9d", 2, "be\u0301d", 3),
    Tuple.Create("x be\u0301d", 2, "b\u00e9d", 4),

    Tuple.Create("b\u00e9d y", 0, "be\u0301d", 3),
    Tuple.Create("be\u0301d y", 0, "b\u00e9d", 4),

    Tuple.Create("b\u00e9d", 0, "be\u0301d", 3),
    Tuple.Create("be\u0301d", 0, "b\u00e9d", 4),

    Tuple.Create("b\u00e9", 0, "be\u0301d", 0),
    Tuple.Create("be\u0301", 0, "b\u00e9d", 0),
};

The tuple values are:

The source string (haystack)
The starting position in source.
The match string (needle).
The expected match length.

Running those tests on the three methods yields this result:

Test #0: Orignal=BAD; New=OK; Faster=OK
Test #1: Orignal=BAD; New=OK; Faster=OK
Test #2: Orignal=BAD; New=OK; Faster=OK
Test #3: Orignal=BAD; New=OK; Faster=OK
Test #4: Orignal=BAD; New=OK; Faster=OK
Test #5: Orignal=BAD; New=OK; Faster=OK
Test #6: Orignal=BAD; New=OK; Faster=OK
Test #7: Orignal=BAD; New=OK; Faster=OK
Test #8: Orignal=OK; New=OK; Faster=OK
Test #9: Orignal=OK; New=OK; Faster=OK

The last two tests are testing the case when the source string is shorter than the match string. In this case the original (Noda time) method will succeed as well.

How can I perform a culture-sensitive "starts-with" operation from the middle of a string?

Related

Recent Posts