I was trying to take out all emoji chars out of a string (like a sanitizer). But I cannot find a complete set of emoji values.

What is the complete set of emoji chars' UTF16 values?


Solution 1:

The Unicode standard's Unicode® Technical Report #51 includes a list of emoji (emoji-data.txt):

...
21A9 ;  text ;  L1 ;    none ;  j   # V1.1 (↩) LEFTWARDS ARROW WITH HOOK
21AA ;  text ;  L1 ;    none ;  j   # V1.1 (↪) RIGHTWARDS ARROW WITH HOOK
231A ;  emoji ; L1 ;    none ;  j   # V1.1 (⌚) WATCH
231B ;  emoji ; L1 ;    none ;  j   # V1.1 (⌛) HOURGLASS
...

I believe you would want to remove each character listed in this document which had a Default_Emoji_Style of emoji.

There is no way, other than reference to a definition list like this, to identify the emoji characters in Unicode. As the reference to the FAQ says, they are spread throughout different blocks.

Solution 2:

I have composed list based on Joe's and Doctor.Who's answers:

U+00A9, U+00AE, U+203C, U+2049, U+20E3, U+2122, U+2139, U+2194-2199, U+21A9-21AA, U+231A, U+231B, U+2328, U+23CF, U+23E9-23F3, U+23F8-23FA, U+24C2, U+25AA, U+25AB, U+25B6, U+25C0, U+25FB-25FE, U+2600-27EF, U+2934, U+2935, U+2B00-2BFF, U+3030, U+303D, U+3297, U+3299, U+1F000-1F02F, U+1F0A0-1F0FF, U+1F100-1F64F, U+1F680-1F6FF, U+1F910-1F96B, U+1F980-1F9E0

Solution 3:

unicode-range: U+0080-02AF, U+0300-03FF, U+0600-06FF, U+0C00-0C7F, U+1DC0-1DFF, U+1E00-1EFF, U+2000-209F, U+20D0-214F, U+2190-23FF, U+2460-25FF, U+2600-27EF, U+2900-29FF, U+2B00-2BFF, U+2C60-2C7F, U+2E00-2E7F, U+3000-303F, U+A490-A4CF, U+E000-F8FF, U+FE00-FE0F, U+FE30-FE4F, U+1F000-1F02F, U+1F0A0-1F0FF, U+1F100-1F64F, U+1F680-1F6FF, U+1F910-1F96B, U+1F980-1F9E0;

Solution 4:

Emoji ranges are updated for every new version of Unicode Emoji. Ranges below are correct for version 14.0

Here is my gist for an advanced version of this code.

def is_contains_emoji(p_string_in_unicode):
    """
    Instead of searching all chars of a text in a emoji lookup dictionary this function just
    checks whether any char in the text is in unicode emoji range
    It is much faster than a dictionary lookup for a large text
    However it only tells whether a text contains an emoji. It does not return the found emojis
    """
    range_min = ord(u'\U0001F300') # 127744
    range_max = ord(u"\U0001FAF6") # 129782
    range_min_2 = 126980
    range_max_2 = 127569
    range_min_3 = 169
    range_max_3 = 174
    range_min_4 = 8205
    range_max_4 = 12953
    if p_string_in_unicode:
        for a_char in p_string_in_unicode:
            char_code = ord(a_char)
            if range_min <= char_code <= range_max:
                # or range_min_2 <= char_code <= range_max_2 or range_min_3 <= char_code <= range_max_3 or range_min_4 <= char_code <= range_max_4:
                return True
            elif range_min_2 <= char_code <= range_max_2:
                return True
            elif range_min_3 <= char_code <= range_max_3:
                return True
            elif range_min_4 <= char_code <= range_max_4:
                return True
        return False
    else:
        return False