Using alternation or character class for single character matching?
(Note: Title doesn't seem to clear -- if someone can rephrase this I'm all for it!)
Given this regex: (.*_e\.txt)
, which matches some filenames, I need to add some other single character suffixes in addition to the e
. Should I choose a character class or should I use an alternation for this? (Or does it really matter??)
That is, which of the following two seems "better", and why:
a) (.*(e|f|x)\.txt)
, or
b) (.*[efx]\.txt)
Solution 1:
Use [efx]
- that's exactly what character classes are designed for: to match one of the included characters. Therefore it's also the most readable and shortest solution.
I don't know if it's faster, but I would be very much surprised if it wasn't. It definitely won't be slower.
My reasoning (without ever having written a regex engine, so this is pure conjecture):
The regex token [abc]
will be applied in a single step of the regex engine: "Is the next character one of a
, b
, or c
?"
(a|b|c)
however tells the regex engine to
- remember the current position in the string for backtracking, if necessary
- check if it's possible to match
a
. If so, success. If not: - check if it's possible to match
b
. If so, success. If not: - check if it's possible to match
c
. If so, success. If not: - give up.
Solution 2:
Here is a benchmark:
updated according to tchrist comment, the difference is more significant
#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;
use Benchmark qw(:all);
my @l;
foreach(qw/b c d f g h j k l m n ñ p q r s t v w x z B C D F G H J K L M N ñ P Q R S T V W X Z/) {
push @l, "abc$_.txt";
}
my $re1 = qr/^(.*(b|c|d|f|g|h|j|k|l|m|n|ñ|p|q|r|s|t|v|w|x|z)\.txt)$/;
my $re2 = qr/^(.*[bcdfghjklmnñpqrstvwxz]\.txt)$/;
my $cpt;
my $count = -3;
my $r = cmpthese($count, {
'alternation' => sub {
for(@l) {
$cpt++ if $_ =~ $re1;
}
},
'class' => sub {
for(@l) {
$cpt++ if $_ =~ $re2;
}
}
});
result:
Rate alternation class
alternation 2855/s -- -50%
class 5677/s 99% --
Solution 3:
With a single character, it's going to have such a minimal difference that it won't matter. (unless you're doing LOTS of operations)
However, for readability (and a slight performance increase) you should be using the character class method.
For a bit further information - opening a round bracket (
causes Perl to start backtracking for that current position, which, as you don't have further matches to go against, you really don't need for your regex. A character class will not do this.