How to determine if a number is a prime with regex?
You said you understand this part, but just to emphasize, the String generated has a length equal to the number supplied. So the string has three characters if and only if n == 3
.
.?
The first part of the regex says, "any character, zero or one times". So basically, is there zero or one character-- or, per what I mentioned above, n == 0 || n == 1
. If we have the match, then return the negation of that. This corresponds with the fact that zero and one are NOT prime.
(..+?)\\1+
The second part of the regex is a little trickier, relying on groups and backreferences. A group is anything in parentheses, which will then be captured and stored by the regex engine for later use. A backreference is a matched group that is used later on in the same regex.
The group captures 1 character, then 1 or more of any character. (The + character means one or more, but ONLY of the previous character or group. So this is not "two or four or six etc. characters", but rather "two or three etc." The +? is like +, but it tries to match as few characters as possible. + normally tries to gobble the whole string if it can, which is bad in this case because it prevents the backreference part from working.)
The next part is the backreference: That same set of characters (two or more), appearing again. Said backreference appears one or more times.
So. The captured group corresponds to a natural number of characters (from 2 onward) captured. Said group then appears some natural number of times (also from 2 onward). If there IS a match, this implies that it's possible to find a product of two numbers greater than or equal to 2 that match the n-length string... meaning you have a composite n. So again, return the negation of the successful match: n is NOT prime.
If no match can be found, then you can't come up with a your product of two natural numbers greater than or equal to 2... and you have both a non-match and a prime, hence again the returning of the negation of the match result.
Do you see it now? It's unbelievably tricky (and computationally expensive!) but it's also kind of simple at the same time, once you get it. :-)
I can elaborate if you have further questions, like on how regex parsing actually works. But I'm trying to keep this answer simple for now (or as simple as it can ever be).
I will explain the regex part outside of primality testing: the following regex, given a String s
which consists of repeating String t
, finds t
.
System.out.println(
"MamamiaMamamiaMamamia".replaceAll("^(.*)\\1+$", "$1")
); // prints "Mamamia"
The way it works is that the regex captures (.*)
into \1
, and then sees if there's \1+
following it. Using the ^
and $
ensures that a match must be of the whole string.
So, in a way, we're given String s
, which is a "multiple" of String t
, and the regex will find such t
(the longest possible, since \1
is greedy).
Once you understand why this regex works, then (ignoring the first alternate in OP's regex for now) explaining how it's used for primality testing is simple.
- To test primality of
n
, first generate aString
of lengthn
(filled with the samechar
) - The regex captures a
String
of some length (sayk
) into\1
, and tries to match\1+
to the rest of theString
- If there is a match, then
n
is a proper multiple ofk
, and thereforen
is not prime. - If there's no match, then no such
k
exists that dividesn
, andn
is therefore a prime
- If there is a match, then
How does
.?|(..+?)\1+
match prime numbers?
Actually, it doesn't! It matches String
whose length is NOT prime!
-
.?
: The first part of the alternation matchesString
of length0
or1
(NOT prime by definition) -
(..+?)\1+
: The second part of the alternation, a variation of the regex explained above, matchesString
of lengthn
that is "a multiple" of aString
of lengthk >= 2
(i.e.n
is a composite, NOT a prime).- Note that the reluctant modifier
?
is actually not needed for correctness, but it may help speed up the process by trying smallerk
first
- Note that the reluctant modifier
Note the !
boolean
complement operator in the return
statement: it negates the matches
. It's when the regex DOESN'T match, n
is prime! It's a double-negative logic, so no wonder it's kind of confusing!!
Simplification
Here's a simple rewriting of the code to make it more readable:
public static boolean isPrime(int n) {
String lengthN = new String(new char[n]);
boolean isNotPrimeN = lengthN.matches(".?|(..+?)\\1+");
return !isNotPrimeN;
}
The above is essentially the same as the original Java code, but broken apart into multiple statements with assignments to local variables to make the logic easier to understand.
We can also simplify the regex, using finite repetition, as follows:
boolean isNotPrimeN = lengthN.matches(".{0,1}|(.{2,})\\1+");
Again, given a String
of length n
, filled with the same char
,
-
.{0,1}
checks ifn = 0,1
, NOT prime -
(.{2,})\1+
checks ifn
is a proper multiple ofk >= 2
, NOT prime
With the exception of the reluctant modifier ?
on \1
(omitted for clarity), the above regex is identical to the original.
More fun regex
The following regex uses similar technique; it should be educational:
System.out.println(
"OhMyGod=MyMyMyOhGodOhGodOhGod"
.replaceAll("^(.+)(.+)(.+)=(\\1|\\2|\\3)+$", "$1! $2! $3!")
); // prints "Oh! My! God!"
See also
- Regular expressions: who's greedier
Nice regex trick (though very inefficient)... :)
The regex defines non-primes as follows:
N is not prime if and only if N<=1 OR N is divisible by some K>1.
Instead of passing the simple digital representation of N to the regex engine, it is fed with a sequence of length N, composed of a repeating character. The first part of the disjunction checks for N=0 or N=1, and the second one looks for a divisor K>1, using backreferences. It forces the regex engine to find some non empty sub-sequence that can be repeated at least twice in order to form the sequence. If such a subsequence exists, it means that its length divides N, hence N is not prime.
/^1?$|^(11+?)\1+$/
Apply to numbers after conversion to base 1 (1=1, 2=11, 3=111, ...). Non-primes will match this. If it doesn't match, it is prime.
Explanation here.