What's up with these Unicode combining characters and how can we filter them?
Solution 1:
What's up with these unicode characters?
That's a character with a series of combining characters. Because the combining characters in question want to go above the base character, they stack up (literally). For instance, the case of
ก้้้้้้้้้้้้้้้้้้้้
...it's an ก (Thai character ko kai) (U+0E01) followed by 20 copies of the Thai combining character mai tho (U+0E49).
How can we sanitize this?
You could pre-process the text and limit the number of combining characters that can be applied to a single character, but the effort may not be worth the reward. You'd need the data sheets for all the current characters so you'd know whether they were combining or what, and you'd need to be sure to allow at least a few because some languages are written with several diacritics on a single base. Now, if you want to limit comments to the Latin character set, that would be an easier range check, but of course that's only an option if you want to limit comments to just a few languages. More information, code sheets, etc. at unicode.org.
BTW, if you ever want to know how some character was composed, for another question just recently I coded up a quick-and-dirty "Unicode Show Me" page on JSBin. You just copy and paste the text into the text area, and it shows you all of the code points (~characters) that the text is made up of, with links such as those above to the page describing each character. It only works for code points in the range U+FFFF and under, because it's written in JavaScript and to handle characters above U+FFFF in JavaScript you have to do more work than I wanted to do for that question (because in JavaScript, a "character" is always 16 bits, which means for some languages a character can be split across two separate JavaScript "characters" and I didn't account for that), but it's handy for most texts...
Solution 2:
If you have a regex engine with decent Unicode support, it's trivial to sanitize this kind of strings. In Perl, for example, you can remove all but the first combining mark from every (user-perceived) character like this:
#!/usr/bin/perl
use strict;
use utf8;
binmode(STDOUT, ':utf8');
my $string = "กิิ ก้้ ก็็ ก็็ กิิ ก้้ ก็็ กิิ ก้้ กิิ ก้้ ก็็ ก็็ กิิ ก้้ ก็็ กิิ ก้้";
$string =~ s/(\p{Mark})\p{Mark}+/$1/g; # Strip excess combining marks
print("$string\n");
This will print:
กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้ กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้
Solution 3:
"How can we sanitize this" is best answered above by T.J Crowder
However, I think sanitization is the wrong approach, and Cristy has it right with overflow:hidden
on the css containing element.
At least, that's how I'm solving it.
Solution 4:
Ok this one took me a while to figure out, I was under impression that combining characters to produce zalgo are limited to these. So I expected following regex to catch the freaks.
([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]{2,})
and it didn't work...
The catch is that list in wiki does not cover full range of combining characters.
What gave me a hint is "ก้้้้้้้้้้้้้้้้้้้้".charCodeAt(2).toString(16)
= "e49" which in not within a range of combining, it falls into 'Private use'.
In C# they fall under UnicodeCategory.NonSpacingMark
and following script flushes them out:
[Test]
public void IsZalgo()
{
var zalgo = new[] { UnicodeCategory.NonSpacingMark };
File.Delete("IsModifyLike.html");
File.AppendAllText("IsModifyLike.html", "<table>");
for (var i = 0; i < 65535; i++)
{
var c = (char)i;
if (zalgo.Contains(Char.GetUnicodeCategory(c)))
{
File.AppendAllText("IsModifyLike.html", string.Format("<tr><td>{0}</td><td>{1}</td><td>{2}</td><td>A&#{3};&#{3};&#{3}</td></tr>\n", i.ToString("X"), c, Char.GetUnicodeCategory(c), i));
}
}
File.AppendAllText("IsModifyLike.html", "</table>");
}
By looking at the table generated you should be able to see which ones do stack.
One range that is missing on wiki is 06D6-06DC
another 0730-0749
.
UPDATE:
Here's updated regex that should fish out all the zalgo including ones bypassed in 'normal' range.
([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62]{2,})
The hardest bit is to identify them, once you have done that - there's multitude of solutions including some good ones above.
Hope this saves you some time.