Match correct addition of two binary numbers with PCRE regex
Solution 1:
TL;DR: Yes, it indeed is possible (use with Jx
flags):
(?(DEFINE)
(?<add> \s*\+\s* )
(?<eq> \s*=\s* )
(?<carry> (?(cl)(?(cr)|\d0)|(?(cr)\d0|(*F))) )
(?<digitadd> (?(?= (?(?= (?(l1)(?(r1)|(*F))|(?(r1)(*F))) )(?&carry)|(?!(?&carry))) )1|0) )
(?<recursedigit>
(?&add) 0*+ (?:\d*(?:0|1(?<r1>)))? (?(ro)|(?=(?<cr>1)?))\k<r> (?&eq) \d*(?&digitadd)\k<f>\b
| (?=\d* (?&add) 0*+ (?:\k<r>(?<ro>)|\d*(?<r>\d\k<r>)) (?&eq) \d*(?<f>\d\k<f>)\b) \d(?&recursedigit)
)
(?<checkdigit> (?:0|1(?<l1>)) (?=(?<cl>1)?) (?<r>) (?<f>) (?&recursedigit) )
(?<carryoverflow>
(?<x>\d+) 0 (?<y> \k<r> (?&eq) 0*\k<x>1 | 1(?&y)0 )
| (?<z> 1\k<r> (?&eq) 0*10 | 1(?&z)0 )
)
(?<recurseoverflow>
(?&add) 0*+ (?(rlast) \k<r> (?&eq) 0*+(?(ro)(?:1(?=0))?|1)\k<f>\b
| (?:(?<remaining>\d+)(?=0\d* (?&eq) \d*(?=1)\k<f>\b)\k<r> (?&eq) (*PRUNE) 0*\k<remaining>\k<f>\b
| (?&carryoverflow)\k<f>\b))
| (?=\d* (?&add) 0*+ (?:\k<r>(?<ro>)|(?=(?:\d\k<r>(?&eq)(?<rlast>))?)\d*(?<r>\d\k<r>)) (?&eq) \d*(?<f>\d\k<f>)\b)
\d(?&recurseoverflow)
)
(?<s>
(| 0*? (?<arg>[01]+) (?&add) 0+ | 0+ (?&add) 0*? (?<arg>[01]+)) (?&eq) (*PRUNE) 0* \k<arg>
| 0*+
(?=(?<iteratedigits> (?=(?&checkdigit))\d (?:\b|(?&iteratedigits)) ))
(?=[01]+ (?&add) [01]+ (?&eq) [01]+ \b)
(?<r>) (?<f>) (?&recurseoverflow)
)
)
\b(?&s)\b
Live demo: https://regex101.com/r/yD1kL7/26
[Update: Due to a bug in PCRE, the code only works for all cases with PCRE JIT active; thanks to Qwerp-Derp for noticing; without JIT e.g. 100 + 101 = 1001
fails to match.]
That's quite a monster regex. So, let's build it step by step to understand what's going on.
Hint: For easier memorization and following through the explanation, let me explain the names of the single or two digit capturing group names (all except the first two are flags [see below]):
r => right; it contains the part of the right operand right to a given offset
f => final; it contains the part of the final operand (the result) right to a given offset
cl => carry left; the preceding digit of the left operand was 1
cr => carry right; the preceding digit of the right operand was 1
l1 => left (is) 1; the current digit of the left operand is 1
r1 => right (is) 1; the current digit of the right operand is 1
ro => right overflow; the right operand is shorter than the current offset
rlast => right last; the right operand is at most as long as the current offset
For a more readable +
and =
with possible leading and trailing spaces, there are two capturing groups (?<add> \s*\+\s*)
and (?<eq> \s*=\s*)
.
We are performing an addition. As it is regex, we need to verify each digit at once. So, have a look at the math behind this:
Checking addition of a single digit
current digit = left operand digit + right operand digit + carry of last addition
How do we know the carry?
We can simply look at the last digit:
carry = left operand digit == 1 && right operand digit == 1
|| (left operand digit == 1 || right operand digit == 1) && result digit == 0
This logic is provided by the capturing group carry
, defined as follows:
(?<carry> (?(cl)(?(cr)|\d0)|(?(cr)\d0|(*F))) )
with cl
being whether the last left operand digit was 1 or not and cr
whether the last right operand digit was 1 or not; \d0
is to check whether the last digit in the result was 0.
Note: (?(cl) ... | ...)
is a conditional construct checking whether a capturing group has been defined or not. Due to capturing groups being scoped to each level of recursion, this is easily usable as a mechanism to set a boolean flag (can be set with (?<cl>)
anywhere) which can later be conditionally acted upon.
Then the actual addition is a simple:
is_one = ((left operand digit == 1) ^ (right operand digit == 1)) ^ carry
expressed by the digitadd
capturing group (using a ^ b == (a && !b) || (!a && b)
, using l1
being whether the digit of the left operand is equal to 1 and r1
equivalently for the right digit:
(?<digitadd> (?(?= (?(?= (?(l1)(?(r1)|(*F))|(?(r1)(*F))) )(?&carry)|(?!(?&carry))) )1|0) )
Checking addition at a given offset
Now, that we can verify, given a defined cr
, cl
, l1
and r1
whether the digit at the result is correct by simply invoking (?&digitadd)
at that offset.
... at that offset. That's the next challenge, finding said offset.
The fundamental issue is, given three strings with a known separator in between, how to find the correct offset from the right.
For example 1***+****0***=****1***
(the separators are +
and =
here, and *
denoting any arbitrary digit).
Or even, more fundamentally: 1***+****0***=1
.
This can be matched with:
(?<fundamentalrecursedigit>
\+ \d*(?:1(?<r1>)|0)\k<r> = (?(r1) (?(l1) 0 | 1) | (?(l1) 1 | 0) ) \b
| (?=\d* + \d*(?<r>\d\k<r>) =) \d (?&fundamentalrecursedigit)
)
(?<fundamentalcheckdigit>
# Note: (?<r>) is necessary to initialize the capturing group to an empty string
(?:1(?<l1>)|0) (?<r>) (?&fundamentalrecursedigit)
)
(Many thanks here to nhahdth for his solution to this issue — altered here a little to fit the example)
First we're storing information about the digit at the current offset ((?:1(?<l1>)|0)
- set the flag (i.e. capturing group which can be checked against with (?(flag) ... | ...)
) l1
when the current digit is 1.
Then we build the string to the right of the searched offset of the right operand recursively with (?=\d* + \d*(?<r>\d\k<r>) =) \d (?&fundamentalrecursedigit)
, which advances by one digit (on the left operand) on each level of recursion and adds one digit more to the right part of the right operand: (?<r> \d \k<r>)
re-defines the r
capturing group and adds another digit to the already existing capture (referenced with \k<r>
).
Thus, as this recurses as long as there are digits on the left operand and exactly one digit is added to the r
capturing group per level of recursion, that capturing group will contain exactly as many characters as there were digits on the left operand.
Now, at the end of the recursion (i.e. when the separator +
is reached), we can go straight to the correct offset via \d*(?:1(?<r1>)|0)\k<r>
, as the searched digit will now be exactly the digit before what the capturing group r
matched.
Having now also the r1
flag conditionally set, we can go to the end, checking the result for correctness with simple conditionals: (?(r1) (?(l1) 0 | 1) | (?(l1) 1 | 0)
.
Given this, it is trivial to extend this to 1***+****0***=****1***
:
(?<fundamentalrecursedigit>
\+ \d*(?:1(?<r1>)|0)\k<r> = \d*(?(r1) (?(l1) 0 | 1) | (?(l1) 1 | 0) )\k<f> \b
| (?=\d* + \d*(?<r>\d\k<r>) = \d*(?<f>\d\k<f>)\b) \d (?&fundamentalrecursedigit)
)
(?<fundamentalcheckdigit>
(?:1(?<l1>)|0) (?<r>) (?<f>) (?&fundamentalrecursedigit)
)
by using the exactly same trick, collecting also the right part of the result in an f
capturing group and accessing the offset right before what this capturing group f
matched.
Let's add support for a carry, which really is just setting the cr
and cl
flags whether the next digit was 1 via (?=(?<cr/cl>1)?)
after the current digits of the left and right operands:
(?<carryrecursedigit>
\+ \d* (?:1(?<r1>)|0) (?=(?<cr>1)?) \k<r> = \d* (?&digitadd) \k<f> \b
| (?=\d* + \d*(?<r>\d\k<r>) = \d*(?<f>\d\k<f>)\b) \d (?&carryrecursedigit)
)
(?<carrycheckdigit>
(?:1(?<l1>)|0) (?=(?<cl>1)?) (?<r>) (?<f>) (?&carryrecursedigit)
)
Checking inputs of equal lengths
Now, we could be done here if we were to pad all the inputs with enough zeroes:
\b
(?=(?<iteratedigits> (?=(?&carrycheckdigit)) \d (?:\b|(?&iteratedigits)) ))
[01]+ (?&add) [01]+ (?&eq) [01]+
\b
i.e. assert for each digit of the left operand recursively that the addition can be performed and then succeed.
But obviously, we aren't done yet. What about:
- the left operand being longer than the right?
- the right operand being longer than the left?
- the left operand being as long or longer than the right operand and the result having a carry at the most significant digit (i.e. needs a leading 1)?
Handle a left operand longer than the right one
That one is pretty trivial, just stop trying to add digits to the r
capturing group when we have captured it completely, set a flag (here: ro
) to not consider it eligible for carry anymore and make a digit leading r
optional (by (?:\d* (?:1(?<r1>)|0))?
):
(?<recursedigit>
\+ (?:\d* (?:1(?<r1>)|0))? (?(ro)|(?=(?<cr>1)?)) \k<r> = \d* (?&digitadd) \k<f> \b
| (?=\d* + (?:\k<r>(?<ro>)|\d*(?<r>\d\k<r>)) = \d*(?<f>\d\k<f>)\b) \d (?&recursedigit)
)
(?<checkdigit>
(?:1(?<l1>)|0) (?=(?<cl>1)?) (?<r>) (?<f>) (?&recursedigit)
)
This is now handling the right operand as if it were zero-padded; r1
and cr
now just never will be set after that point. Which is all what we need.
It might be easy to be confused here why we can set the ro
flag immediately when exceeding the right operators length and immediately ignore the carry; the reason is checkdigit
already consuming the first digit at the current position, thus we are actually already more than one digit past the end of the right operand.
The right operand happens to be longer than the left
This is now a bit harder. We can't cram it into recursedigit
as that one will just iterate as often as there are digits in the left operand. Thus, we need a separate match for that.
There are now a few cases to consider:
- There is a carry from the addition of the most significant digit of the left operand
- There is no carry.
For the former case we want to match 10 + 110 = 1000
, 11 + 101 = 1000
; for the latter case we want to match 1 + 10 = 11
or 1 + 1000 = 1001
.
To simplify our task, we're going to ignore leading zeroes for now. Then we know that the most significant digit is 1. There is now no carry only and only if:
- the digit at the current offset (i.e. the offset of the most significant digit of the left operand) in the right operand is 0
- and there was no carry from the previous offset, meaning the digit at the current in the result is 1.
This translates into the following assertion:
\d+(?=0)\k<r> (?&eq) \d*(?=1)\k<f>\b
In that case, we can capture the first \d+
with (?<remaining>\d+)
and require it to be in front of \k<f>
(the part on the right of the current offset of the result):
(?<remaining>\d+)\k<r> (?&eq) \k<remaining>\k<f>\b
Otherwise, if there is a carry, we need to increment the left part of the right operand:
(?<carryoverflow>
(?<x>\d+) 0 (?<y> \k<r> (?&eq) \k<x>1 | 1(?&y)0 )
| (?<z> 1\k<r> (?&eq) 10 | 1(?&z)0 )
)
and use it as:
(?&carryoverflow)\k<f>\b
This carryoverflow
capturing group works by copying the left part of the right operand, finding the last zero there and replacing all ones less significant than that zero by zeroes and the zero by an one. In case there is no zero in that part, the ones are simply all replaced by a zero and a leading one is added.
Or to express it less wordily (with n being arbitrary, so that it fits):
(?<x>\d+) 0 1^n \k<r> (?&eq) \k<x> 1 0^n \k<f>\b
| 1^n \k<r> (?&eq) 1 0^n \k<f>\b
So, let's apply our usual technique to figure out the parts on the right of the operands:
(?<recurselongleft>
(?&add) (?:(?<remaining>\d+)(?=(?=0)\k<r> (?&eq) \d*(?=1)\k<f>\b)\k<r> (?&eq) (*PRUNE) \k<remaining>\k<f>\b
| (?&carryoverflow)\k<f>\b)
| (?=\d* (?&add) \d*(?<r>\d\k<r>) (?&eq) \d*(?<f>\d\k<f>)\b) \d(?&recurselongleft)
)
Note that I've added a (*PRUNE)
after (?&eq)
in the first branch in order to avoid backtracking into the second branch with carryoverflow
, in case there happens to be no carry and the result not matching.
Note: We do not do any checks against the \k<f>
part here, this is managed by the carrycheckdigit
capturing group from above.
The case of the leading 1
We sure do not want 1 + 1 = 0
to be matched. Which it is though if we go by checkdigit
alone. So, there are different circumstances when that leading 1 is necessary (if not covered yet by the previous case of the right operand being longer):
- Both operands (being without leading zeroes) are of equal length (i.e. they both have an 1 at their most significant digit, which, added together, leaves a carry)
- The left operand is longer and there is a carry at the most significant digit or both strings are just as long.
The former case handles inputs like 10 + 10 = 100
, the second case handles 110 + 10 = 1000
as well as 1101 + 11 = 10100
and the last one trivially 111 + 10 = 1001
.
The first case can be handled by setting a flag ro
whether the left operand is longer than the right one, which then can be checked against at the end of recursion:
(?<recurseequallength>
(?&add) \k<r> (?&eq) (?(ro)|1)\k<f>\b
| (?=\d* (?&add) (?:\k<r>(?<ro>) | \d*(?<r>\d\k<r>)) (?&eq) \d*(?<f>\d\k<f>)\b) \d(?&recurseequallength)
)
The second case means we just need to check for the existence of a carry in case of ro
(i.e. the right operand being shorter). The carry can as usually be checked (as the most significant digit is always 1 and the right operands digit is implicitly 0 then) with a trivial (?:1(?=0))?\k<f>\b
- if there was a carry the digit at the current offset in the result will be 0.
That's easily possible, as, after all, all the other digits up to the current offset will all be verified by checkdigit
before. Hence we could just check the carry locally here.
Thus we can add this to the first branch of the recurseequallength
alternation:
(?<recurseoverflow>
(?&add) (?(rlast) \k<r> (?&eq) (?(ro)(?:1(?=0))?|1)\k<f>\b
| (?:(?<remaining>\d+)(?=0\d* (?&eq) \d*(?=1)\k<f>\b)\k<r> (?&eq) (*PRUNE) \k<remaining>\k<f>\b
| (?&carryoverflow)\k<f>\b))
| (?=\d* (?&add) (?:\k<r>(?<ro>)|(?=(?:\d\k<r>(?&eq)(?<rlast>))?)\d*(?<r>\d\k<r>)) (?&eq) \d*(?<f>\d\k<f>)\b)
\d(?&recurseoverflow)
)
Then to wire everything together: first check checkdigit
for each individual digit (same as for the simple zero-padded case from before) and then initialize the different capturing groups used by recurseoverflow
:
\b
(?=(?<iteratedigits> (?=(?&checkdigit))\d (?:\b|(?&iteratedigits)) ))
(?=[01]+ (?&add) [01]+ (?&eq) [01]+ \b)
(?<r>) (?<f>) (?&recurseoverflow)
\b
What about the zeroes?
0 + x = x
and x + 0 = x
are still unhandled and will fail.
Instead of hacking out big capturing groups to handle it uglily, we resort to handling them manually:
(0*? (?<arg>[01]+) (?&add) 0+ | 0+ (?&add) 0*? (?<arg>[01]+)) (?&eq) 0* \k<arg>
Note: When using in alternation with the main branch, we need to put a (*PRUNE)
after the (?&eq)
in order to avoid jumping into that main branch when any operand is zero and the match failing.
Now, we also always assumed there were no leading zeroes in the inputs for simplifying our expressions. If you look at the initial regex, you will find many occurrences of 0*
and 0*+
(possessive to avoid backtracking into it and … unexpected things happening) in order to skip the leading zeroes as we are assuming at some places that the leftmost digit is an 1.
Conclusion
And that's it. We achieved matching only correct additions of binary numbers.
A small note about the relatively new J
flag: it allows to redefine capturing groups. This is in first place important in order to initialize the capturing groups to an empty value. Additionally it simplifies some conditionals (like addone
) as we only have to check for one value instead of two. Compare (?(a) ... | ...)
vs. (?(?=(?(a)|(?(b)|(*F)))) ... | ...)
. Also, it is not possible to reorder the multiply defined capturing groups arbitrarily inside the (?(DEFINE) ...)
construct.
Final note: Binary addition is not a Chomsky type 3 (i.e. regular) language. This is a PCRE specific answer, using PCRE specific features. [Other regex flavors like .NET may be able to solve it too, but not all may.]
If there are any questions, please leave a comment, I'll then try to clarify this in this answer.