Should I use \d or [0-9] to match digits in a Perl regex?

It seems to me very dangerous to use \d, It is a poor design decision in the language, as in most cases you want [0-9]. Huffman-coding would dictate the use of \d for ASCII numbers.

Most of the previous posters have already highlighted why you should use [0-9], so let me give you a bit more data:

If I read the unicode charts correctly '۷۰' is a number (70 in indic, don't take my word for it).

Try this:

$ perl -le '$one = chr 0xFF11; print "$one + 1 = ", $one+1;'
１ + 1 = 1

Here is a partial list of valid numbers (which may or may not show up properly in your browser, depending on the fonts you use), for each number, only the first of those being interpreted as a number when doing arithmetics with Perl, as shown above:

 ZERO:  0٠۰߀०০੦૦୦௦౦೦൦๐໐０
 ONE:   1١۱߁१১੧૧୧௧౧೧൧๑໑１
 TWO:   2٢۲߂२২੨૨୨௨౨೨൨๒໒２
 THREE: 3٣۳߃३৩੩૩୩௩౩೩൩๓໓３
 FOUR:  4٤۴߄४৪੪૪୪௪౪೪൪๔໔４
 FIVE:  5٥۵߅५৫੫૫୫௫౫೫൫๕໕５
 SIX:   6٦۶߆६৬੬૬୬௬౬೬൬๖໖６
 SEVEN: 7٧۷߇७৭੭૭୭௭౭೭൭๗໗７
 EIGHT: 8٨۸߈८৮੮૮୮௮౮೮൮๘໘８
 NINE:  9٩۹߉९৯੯૯୯௯౯೯൯๙໙９��

Are you still not convinced?

For maximum safety, I'd suggest using [0-9] any time you don't specifically intend to match all unicode-defined digits.

Per perldoc perluniintro, Perl does not support using digits other than [0-9] as numbers, so I would definitely use [0-9] if the following are both true:

You want to use the result as a number (such as performing mathematical operations on it or storing it somewhere that only accepts proper numbers (e.g. an INT column in a database)).
It is possible non-digits [^0-9] would be present in the data in such a way that the regular expression could match them. (Note that this one should always be considered true for untrusted/hostile input.)

If either of these are false, there will only rarely be reason to specifically not use \d (and you'll probably be able to tell when that is the case), and if you're trying to match all unicode-defined digits, you'll definitely want to use \d.

According to perlreref, '\d' is locale-aware and Unicode aware.

However, if the codeset you are using is not Unicode, then you don't need to worry about the Unicode digits, and if the codeset you are using is something like Latin-1 (ISO 8859-1, or 8859-15), then the locale-awareness won't hurt you either because the codeset does not include any other digit characters.

So, for many people, much of the time, you can use '\d' without concern. However, if Unicode data is part of your work, then you need to consider what you are after more carefully.

Just like nuking the site from orbit, [0-9] is the only way to be sure. Yeah, it is ugly. Yeah, the choice to make \d be UNICODE and locale aware was stupid. But this is our bed and we have to lie in it.

As for the people ducking their heads in the sand saying it doesn't effect the character set they are using today, well you may be using that character set today, but the rest of the world is using UTF-8 now and you will be using it soon as well. Remember to code like the guy who maintains your code is a homicidal maniac who knows where you live.

Oh, and as for Perl modules using \d vs [0-9], even the core still has UNICODE problems.

If you do in fact mean any digit, but want to be able to do math with the results, you can use Text::Unidecode:

#!/usr/bin/perl

use strict;
use warnings;

use Text::Unidecode;

my $number = "\x{1811}\x{1812}\x{1813}\x{1814}\x{1815}";
print "$number is ", unidecode($number), "\n";

After some more testing it looks like Text::Unidecode doesn't handle all digit characters correctly. I am writing a module that will work.

Should I use \d or [0-9] to match digits in a Perl regex?

Related

Recent Posts