How to find and count emoticons in a string using python?
Solution 1:
First, there is no need to encode here at all. You're got a Unicode string, and the re
engine can handle Unicode, so just use it.
A character class can include a range of characters, by specifying the first and last with a hyphen in between. And you can specify Unicode characters that you don't know how to type with \U
escape sequences. So:
import re
s=u"Smiley emoticon rocks!\U0001f600 I like you.\U0001f601"
count = len(re.findall(ru'[\U0001f600-\U0001f650]', s))
Or, if the string is big enough that building up the whole findall
list seems wasteful:
emoticons = re.finditer(ru'[\U0001f600-\U0001f650]', s)
count = sum(1 for _ in emoticons)
Counting words, you can do separately:
wordcount = len(s.split())
If you want to do it all at once, you can use an alternation group:
word_and_emoticon_count = len(re.findall(ru'\w+|[\U0001f600-\U0001f650]', s))
As @strangefeatures points out, Python versions before 3.3 allowed "narrow Unicode" builds. And, for example, most CPython Windows builds are narrow. In narrow builds, characters can only be in the range U+0000
to U+FFFF
. There's no way to search for these characters, but that's OK, because they're don't exist to search for; you can just assume they don't exist if you get an "invalid range" error compiling the regexp.
Except, of course, that there's a good chance that wherever you're getting your actual strings from, they're UTF-16-BE or UTF-16-LE, so the characters do exist, they're just encoded into surrogate pairs. And you want to match those surrogate pairs, right? So you need to translate your search into a surrogate-pair search. That is, convert your high and low code points into surrogate pair code units, then (in Python terms) search for:
(lead == low_lead and lead != high_lead and low_trail <= trail <= DFFF or
lead == high_lead and lead != low_lead and DC00 <= trail <= high_trail or
low_lead < lead < high_lead and DC00 <= trail <= DFFF)
You can leave off the second condition in the last case if you're not worried about accepting bogus UTF-16.
If it's not obvious how that translates into regexp, here's an example for the range [\U0001e050-\U0001fbbf]
in UTF-16-BE:
(\ud838[\udc50-\udfff])|([\ud839-\ud83d].)|(\ud83e[\udc00-\udfbf])
Of course if your range is small enough that low_lead == high_lead
this gets simpler. For example, the original question's range can be searched with:
\ud83d[\ude00-\ude50]
One last trick, if you don't actually know whether you're going to get UTF-16-LE or UTF-16-BE (and the BOM is far away from the data you're searching): Because no surrogate lead or trail code unit is valid as a standalone character or as the other end of a pair, you can just search in both directions:
(\ud838[\udc50-\udfff])|([\ud839-\ud83d][\udc00-\udfff])|(\ud83e[\udc00-\udfbf])|
([\udc50-\udfff]\ud838)|([\udc00-\udfff][\ud839-\ud83d])|([\udc00-\udfbf]\ud83e)
Solution 2:
My solution includes the emoji
and regex
modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like π¨βπ©βπ¦βπ¦ once, although it consists of 4 emojis.
import emoji
import regex
def split_count(text):
emoji_counter = 0
data = regex.findall(r'\X', text)
for word in data:
if any(char in emoji.UNICODE_EMOJI for char in word):
emoji_counter += 1
# Remove from the given text the emojis
text = text.replace(word, '')
words_counter = len(text.split())
return emoji_counter, words_counter
Testing:
line = "hello π©πΎβπ emoji hello π¨βπ©βπ¦βπ¦ how are π you todayπ
π½π
π½"
counter = split_count(line)
print("Number of emojis - {}, number of words - {}".format(counter[0], counter[1]))
Output:
Number of emojis - 5, number of words - 7