How to count the correct length of a string with emojis in javascript?

I've a little problem.

I'm using NodeJS as backend. Now, an user has a field "biography", where the user can write something about himself.

Suppose that this field has 220 maxlength, and suppose this as input:

๐Ÿ‘ถ๐Ÿป๐Ÿ‘ฆ๐Ÿป๐Ÿ‘ง๐Ÿป๐Ÿ‘จ๐Ÿป๐Ÿ‘ฉ๐Ÿป๐Ÿ‘ฑ๐Ÿปโ€โ™€๏ธ๐Ÿ‘ฑ๐Ÿป๐Ÿ‘ด๐Ÿป๐Ÿ‘ต๐Ÿป๐Ÿ‘ฒ๐Ÿป๐Ÿ‘ณ๐Ÿปโ€โ™€๏ธ๐Ÿ‘ณ๐Ÿป๐Ÿ‘ฎ๐Ÿปโ€โ™€๏ธ๐Ÿ‘ฎ๐Ÿป๐Ÿ‘ท๐Ÿปโ€โ™€๏ธ๐Ÿ‘ท๐Ÿป๐Ÿ’‚๐Ÿปโ€โ™€๏ธ๐Ÿ’‚๐Ÿป๐Ÿ•ต๐Ÿปโ€โ™€๏ธ๐Ÿ‘ฉ๐Ÿปโ€โš•๏ธ๐Ÿ‘จ๐Ÿปโ€โš•๏ธ๐Ÿ‘ฉ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ๐Ÿ‘จ๐Ÿปโ€๐ŸŒพ 

As you can see there aren't 220 emojis (there are 37 emojis), but if I do in my nodejs server

console.log(bio.length)

where bio is the input text, I got 221. How could I "parse" the string input to get the correct length? Is it a problem about unicode?

SOLVED

I used this library: https://github.com/orling/grapheme-splitter

I tried that:

var Grapheme = require('grapheme-splitter');
var splitter = new Grapheme();
console.log(splitter.splitGraphemes(bio).length);

and the length is 37. It works very well!


Solution 1:

  1. str.length gives the count of UTF-16 units.

  2. Unicode-proof way to get string length in codepoints (in characters) is [...str].length as iterable protocol splits the string to codepoints.

  3. If we need the length in graphemes (grapheme clusters), we have these native ways:

    a. Unicode property escapes in RegExp. See for example: Unicode-aware version of \w or Matching emoji.

    b. Intl.Segmenter โ€” coming soon, probably in ES2021. Can be tested with a flag in the last V8 versions (realization was synced with the last spec in V8 86). Unflagged (shipped) in V8 87.

See also:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

  • What every JavaScript developer should know about Unicode

  • JavaScript has a Unicode problem

  • Unicode-aware regular expressions in ES2015

  • ES6 Strings (and Unicode, โค) in Depth

  • JavaScript for impatient programmers. Unicode โ€“ a brief introduction

Solution 2:

TL;DR there are solutions, but they donโ€™t work in every case. Unicode can feel like a dark art.

There seems to be limitations in various solutions I have seen presented, with the issue going beyond emojis and covering other characters in the Unicode range. Consider รฉ can be stored as รฉ or e + โ€˜, if using combing characters. This can even lead to two strings that look the same not being equal. Also note, in certain cases a single emoji can be 11 characters when stored and as a result 22 bytes, assuming UTF16.

The way this is handled and how characters are combined, or displayed, can even vary between browsers and operating systems. So, while you may think you cracked it, there is a risk another environment breaks this. Be sure to test where it matters.

Now, there is the front-end vs back-end problem: you solved the character count problem so it works well for human users, now your single emoji blows right past the allocated field size in the database. Less of an issue with databases such as mongo, but can be one with SQL databases, where field allocation was conservative. This means how you solve your problem will depend where the hardest limitation comes in.

Note, that a basic solution does involve converting a string to an array and getting the length, accepting limitations:

Array.from(str)

This will fall apart when characters are combined and dealing with astral planes.

A few high level approaches, that take into account limitations:

  • use approaches that solve the front-end issue, as best as possible, and then ensure storage issues are resolved
  • be more conservative with the advertised front-end limits, if the database or other storage canโ€™t be adjusted
  • limit the character types that can be entered
  • clearly indicate limitations of the length calculation

Additionally, given the complexity of the issue it may be worth seeing if there is a popular JS library that already deals with this? I did not find one at the time of writing. Hopefully this is something that would become core to Javascript at some point.

Other pages to read:

  • https://blog.jonnew.com/posts/poo-dot-length-equals-two
  • https://mathiasbynens.be/notes/javascript-unicode
  • https://www.contentful.com/blog/2016/12/06/unicode-javascript-and-the-emoji-family/
  • https://dmitripavlutin.com/what-every-javascript-developer-should-know-about-unicode/