How does Stack Overflow generate its SEO-friendly URLs?

Here's how we do it. Note that there are probably more edge conditions than you realize at first glance.

This is the second version, unrolled for 5x more performance (and yes, I benchmarked it). I figured I'd optimize it because this function can be called hundreds of times per page.

/// <summary>
/// Produces optional, URL-friendly version of a title, "like-this-one". 
/// hand-tuned for speed, reflects performance refactoring contributed
/// by John Gietzen (user otac0n) 
/// </summary>
public static string URLFriendly(string title)
{
    if (title == null) return "";

    const int maxlen = 80;
    int len = title.Length;
    bool prevdash = false;
    var sb = new StringBuilder(len);
    char c;

    for (int i = 0; i < len; i++)
    {
        c = title[i];
        if ((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9'))
        {
            sb.Append(c);
            prevdash = false;
        }
        else if (c >= 'A' && c <= 'Z')
        {
            // tricky way to convert to lowercase
            sb.Append((char)(c | 32));
            prevdash = false;
        }
        else if (c == ' ' || c == ',' || c == '.' || c == '/' || 
            c == '\\' || c == '-' || c == '_' || c == '=')
        {
            if (!prevdash && sb.Length > 0)
            {
                sb.Append('-');
                prevdash = true;
            }
        }
        else if ((int)c >= 128)
        {
            int prevlen = sb.Length;
            sb.Append(RemapInternationalCharToAscii(c));
            if (prevlen != sb.Length) prevdash = false;
        }
        if (i == maxlen) break;
    }

    if (prevdash)
        return sb.ToString().Substring(0, sb.Length - 1);
    else
        return sb.ToString();
}

To see the previous version of the code this replaced (but is functionally equivalent to, and 5x faster), view revision history of this post (click the date link).

Also, the RemapInternationalCharToAscii method source code can be found here.

Here is my version of Jeff's code. I've made the following changes:

The hyphens were appended in such a way that one could be added, and then need removing as it was the last character in the string. That is, we never want “my-slug-”. This means an extra string allocation to remove it on this edge case. I’ve worked around this by delay-hyphening. If you compare my code to Jeff’s the logic for this is easy to follow.
His approach is purely lookup based and missed a lot of characters I found in examples while researching on Stack Overflow. To counter this, I first peform a normalisation pass (AKA collation mentioned in Meta Stack Overflow question Non US-ASCII characters dropped from full (profile) URL), and then ignore any characters outside the acceptable ranges. This works most of the time...
... For when it doesn’t I’ve also had to add a lookup table. As mentioned above, some characters don’t map to a low ASCII value when normalised. Rather than drop these I’ve got a manual list of exceptions that is doubtless full of holes, but it is better than nothing. The normalisation code was inspired by Jon Hanna’s great post in Stack Overflow question How can I remove accents on a string?.

The case conversion is now also optional.

public static class Slug
{
    public static string Create(bool toLower, params string[] values)
    {
        return Create(toLower, String.Join("-", values));
    }

    /// <summary>
    /// Creates a slug.
    /// References:
    /// http://www.unicode.org/reports/tr15/tr15-34.html
    /// https://meta.stackexchange.com/questions/7435/non-us-ascii-characters-dropped-from-full-profile-url/7696#7696
    /// https://stackoverflow.com/questions/25259/how-do-you-include-a-webpage-title-as-part-of-a-webpage-url/25486#25486
    /// https://stackoverflow.com/questions/3769457/how-can-i-remove-accents-on-a-string
    /// </summary>
    /// <param name="toLower"></param>
    /// <param name="normalised"></param>
    /// <returns></returns>
    public static string Create(bool toLower, string value)
    {
        if (value == null)
            return "";

        var normalised = value.Normalize(NormalizationForm.FormKD);

        const int maxlen = 80;
        int len = normalised.Length;
        bool prevDash = false;
        var sb = new StringBuilder(len);
        char c;

        for (int i = 0; i < len; i++)
        {
            c = normalised[i];
            if ((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9'))
            {
                if (prevDash)
                {
                    sb.Append('-');
                    prevDash = false;
                }
                sb.Append(c);
            }
            else if (c >= 'A' && c <= 'Z')
            {
                if (prevDash)
                {
                    sb.Append('-');
                    prevDash = false;
                }
                // Tricky way to convert to lowercase
                if (toLower)
                    sb.Append((char)(c | 32));
                else
                    sb.Append(c);
            }
            else if (c == ' ' || c == ',' || c == '.' || c == '/' || c == '\\' || c == '-' || c == '_' || c == '=')
            {
                if (!prevDash && sb.Length > 0)
                {
                    prevDash = true;
                }
            }
            else
            {
                string swap = ConvertEdgeCases(c, toLower);

                if (swap != null)
                {
                    if (prevDash)
                    {
                        sb.Append('-');
                        prevDash = false;
                    }
                    sb.Append(swap);
                }
            }

            if (sb.Length == maxlen)
                break;
        }
        return sb.ToString();
    }

    static string ConvertEdgeCases(char c, bool toLower)
    {
        string swap = null;
        switch (c)
        {
            case 'ı':
                swap = "i";
                break;
            case 'ł':
                swap = "l";
                break;
            case 'Ł':
                swap = toLower ? "l" : "L";
                break;
            case 'đ':
                swap = "d";
                break;
            case 'ß':
                swap = "ss";
                break;
            case 'ø':
                swap = "o";
                break;
            case 'Þ':
                swap = "th";
                break;
        }
        return swap;
    }
}

For more details, the unit tests, and an explanation of why Facebook's URL scheme is a little smarter than Stack Overflows, I've got an expanded version of this on my blog.

How does Stack Overflow generate its SEO-friendly URLs?

Related

Recent Posts