How do I count the letters in Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch?

Like many problems to do with strings, this can be done in a simple way with a regex.

>>> word = 'Llanfairpwllgwyn|gyllgogerychwyrndrobwllllantysiliogogogoch'
>>> import re
>>> pattern = re.compile(r'ch|dd|ff|ng|ll|ph|rh|th|[^\W\d_]', flags=re.IGNORECASE)
>>> len(pattern.findall(word))
51

The character class [^\W\d_] (from here) matches word-characters that are not digits or underscores, i.e. letters, including those with diacritics.


You can get the length by replacing all the double letters with a . (or any other character, ? would do just fine), and measuring the length of the resulting string (subtracting the amount of |):

def get_length(name):
    name = name.lower()
    doubles = ['ch', 'dd', 'ff', 'ng', 'll', 'ph', 'rh', 'th']
    for double in doubles:
        name = name.replace(double, '.')
    return len(name) - name.count('|')

name = 'Llanfairpwllgwyn|gyllgogerychwyrndrobwllllantysiliogogogoch'
print(get_length(name))
>>> 51

  1. Step through the string letter by letter
  2. If you are at index n and and s[n:n+2] is a digraph, add or increment a dictionary with the digraph as the key, and increment the index by 1 as well so you don't start on the second digraph character. If it's not a digraph, just add or increment the letter to the dict and go to the next letter.
  3. If you see the | character, don't count it, just skip.
  4. And don't forget to lowercase.

When you've seen all the letters, the loop ends and you add all the counts in the dict.

Here's my code, it works on your three examples:

from collections import defaultdict

digraphs=['ch','dd','ff','ng','ll','ph','rh','th']
breakchars=['|']


def welshcount(word):
    word = word.lower()
    index = 0
    counts = defaultdict(int)  # keys start at 0 if not already present
    while index < len(word):
        if word[index:index+2] in digraphs:
            counts[word[index:index+2]] += 1
            index += 1
        elif word[index] in breakchars:
            pass  # in case you want to do something here later
        else:  # plain old letter
            counts[word[index]] += 1

        index += 1

    return sum(counts.values())

word1='llong'
#ANSWER NEEDS TO BE 3 (ll o ng)

word2='llon|gyfarch'
#ANSWER NEEDS TO BE 9 (ll o n g y f a r ch)

word3='Llanfairpwllgwyn|gyllgogerychwyrndrobwllllantysiliogogogoch'
#ANSWER NEEDS TO BE 51 (Ll a n f a i r p w ll g w y n g y ll g o g e r y ch w y r n d r o b w ll ll a n t y s i l i o g o g o g o ch)

print(welshcount(word1))
print(welshcount(word2))
print(welshcount(word3))

You could use a Combining Grapheme Joiner (+u034F) character to join the letters and then take your character count and take away the number of these joiners * 2.

http://www.comisiynyddygymraeg.cymru/English/Part%203/10%20Locales%20alphabets%20and%20character%20sets/10.2%20Alphabets/Pages/10-2-4-Combining-Grapheme-Joiner.aspx

The Welsh Language Commissioner also addresses the issue here: http://www.comisiynyddygymraeg.cymru/English/Part%203/10%20Locales%20alphabets%20and%20character%20sets/10.2%20Alphabets/Pages/10-2-1-Character-vs--letter-counts.aspx