Using alternation or character class for single character matching?

(Note: Title doesn't seem to clear -- if someone can rephrase this I'm all for it!)

Given this regex: (.*_e\.txt), which matches some filenames, I need to add some other single character suffixes in addition to the e. Should I choose a character class or should I use an alternation for this? (Or does it really matter??)

That is, which of the following two seems "better", and why:

a) (.*(e|f|x)\.txt), or

b) (.*[efx]\.txt)


Solution 1:

Use [efx] - that's exactly what character classes are designed for: to match one of the included characters. Therefore it's also the most readable and shortest solution.

I don't know if it's faster, but I would be very much surprised if it wasn't. It definitely won't be slower.

My reasoning (without ever having written a regex engine, so this is pure conjecture):

The regex token [abc] will be applied in a single step of the regex engine: "Is the next character one of a, b, or c?"

(a|b|c) however tells the regex engine to

  • remember the current position in the string for backtracking, if necessary
  • check if it's possible to match a. If so, success. If not:
  • check if it's possible to match b. If so, success. If not:
  • check if it's possible to match c. If so, success. If not:
  • give up.

Solution 2:

Here is a benchmark:

updated according to tchrist comment, the difference is more significant

#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;
use Benchmark qw(:all);

my @l;
foreach(qw/b c d f g h j k l m n ñ p q r s t v w x z B C D F G H J K L M N ñ P Q R S T V W X Z/) {
    push @l, "abc$_.txt";
}

my $re1 = qr/^(.*(b|c|d|f|g|h|j|k|l|m|n|ñ|p|q|r|s|t|v|w|x|z)\.txt)$/;
my $re2 = qr/^(.*[bcdfghjklmnñpqrstvwxz]\.txt)$/;
my $cpt;

my $count = -3;
my $r = cmpthese($count, {
    'alternation' => sub {
        for(@l) {
            $cpt++ if $_ =~ $re1;
        }
    },
    'class' => sub {
        for(@l) {
            $cpt++ if $_ =~ $re2;
        }
    }
});

result:

              Rate alternation       class
alternation 2855/s          --        -50%
class       5677/s         99%          --

Solution 3:

With a single character, it's going to have such a minimal difference that it won't matter. (unless you're doing LOTS of operations)

However, for readability (and a slight performance increase) you should be using the character class method.

For a bit further information - opening a round bracket ( causes Perl to start backtracking for that current position, which, as you don't have further matches to go against, you really don't need for your regex. A character class will not do this.