Regular expression puzzler
Solution 1:
The quantifier {3}
in the pattern [iny]{3}
means to match a character with that pattern (either i
or n
or y
), and then another character with the same pattern, and then another. Three -- one after another. So your string unify
doesn't have that, but can muster two at most, ni
.
That's been explained in other answers already. What I'd like to add is an answer to a clarification in comments: how to check for these characters appearing 3 times in the string, scattered around at will. Apart from matching that whole substring, as shown already, we can use a lookahead:
(?=[iny].*[iny].*[iny])
This does not "consume" any characters but rather "looks" ahead for the pattern, not advancing the engine from its current position. As such it can be very useful as a subpattern, in combination with other patterns in a larger regex.
A Perl example, to copy-paste on the command line:
perl -wE'say "Match" if "unify" =~ /(?=[iny].*[iny].*[iny])/'
The drawback to this, as well as to consuming the whole such substring, is the literal spelling out of all three subpatterns; what when the number need be decided dynamically? Or when it's twelve? The pattern can be built at runtime of course. In Perl, one way
my $pattern = '(?=' . join('.*', ('[iny]')x3) . ')';
and then use that in the regex.
For the sake of performance, for long strings and many repetitions, make that .*
non-greedy
(?=[iny].*?[iny].*?[iny])
(when forming the pattern dynamically join with .*?
)
A simple benchmark for illustration (in Perl)
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use List::Util qw(shuffle);
use Benchmark qw( cmpthese );
# For how many seconds to run each option (-r N, default 3),
# how many times to repeat for the test string (-n N, default 2)
my ($runfor, $n) = (3, 2);
GetOptions('r:i' => \$runfor, 'n:i' => \$n);
my $str = 'aa'
. join('', map { (shuffle 'b'..'t')x$n, 'a' } 1..$n)
. 'a'x($n+1)
. 'zzz';
my $pat_greedy = '(?=' . join('.*', ('a')x$n) . ')';
my $pat_non_greedy = '(?=' . join('.*?', ('a')x$n) . ')';
#my $pat_greedy = join('.*', ('a')x$n); # test straight match,
#my $pat_non_greedy = join('.*?', ('a')x$n); # not lookahead
sub match_repeated {
my ($s, $pla) = @_;
return ( $s =~ /$pla(.*z)/ ) ? "match" : "no match";
}
cmpthese(-$runfor, {
greedy => sub { match_repeated($str, $pat_greedy) },
non_greedy => sub { match_repeated($str, $pat_non_greedy) },
});
(Shuffling of that string is probably unneeded but I feared optimizations intruding.)
When a string is made with the factor of 20 (program.pl -n 20
) the output is
Rate greedy non_greedy
greedy 56.3/s -- -100%
non_greedy 90169/s 159926% --
So ... some 1600 times better non-greedy. That test string is 7646 characters long and the pattern to match has 20 subpatterns (a
) with .*
between them (in greedy case); so there's a lot going on there. With default 2
, so for a short string and a simpler pattern, the difference is 10%
.
Btw, to test for straight-up matches (not using lookahead) just move those comment signs around the pattern variables, and it's nearly twice as bad:
Rate greedy non_greedy
greedy 56.5/s -- -100%
non_greedy 171949/s 304117% --
Solution 2:
The letters n
, i
, and y
aren't all adjacent. There's an f
in between them.
/[iny]{3}/
matches any string that contains a substring of three letters taken from the set {i, n, y}
. The letters can be in any order; they can even be repeated.
Choosing three characters three times, with replacement, means there are 33 = 27 matching substrings:
-
iii
,iin
,iiy
,ini
,inn
,iny
,iyi
,iyn
,iyy
-
nii
,nin
,niy
,nni
,nnn
,nny
,nyi
,nyn
,nyy
-
yii
,yin
,yiy
,yni
,ynn
,yny
,yyi
,yyn
,yyy
To match non-adjacent letters you can use one of these:
-
[iny].*[iny].*[iny]
-
[iny](.*[iny]){2}
-
([iny].*){3}
(The last option will work fine on its own since your search is unanchored, but might not be suitable as part of a larger regex. The final .*
could match more than you intend.)