Code Golf: Regex parser
The goal
Today's Code Golf challenge is to create a regex parser in as few characters as possible.
The syntax
No, I'm not asking you to match Perl-style regular expressions. There's already a very reliable interpreter for those, after all! :-)
Here's all you need to know about regex syntax for this challenge:
- A term is defined as a single literal character, or a regular expression within grouping parentheses
()
. - The
*
(asterisk) character represents a Kleene star operation on the previous TERM. This means zero or more of the previous term, concatenated together. - The
+
(plus) character represents a convenient shortcut:a+
is equivalent toaa*
, meaning one or more of the previous term. - The
?
(question mark) character represents zero or one of the previous term. - The
|
(pipe) character represents an alternation, meaning that the REGULAR EXPRESSIONS on either side can be used in the match. - All other characters are assumed to be literal. You may assume that all other characters are within
[0-9A-Za-z]
(i.e., all English alphanumerics).
Or, put another way: *
/+
/?
have highest precedence, then concatenation, then alternation. Since alternation has lower precedence than concatenation, its use within a regex without parentheses causes it to be bound to the full regex on each side. *
and +
and ?
, on the other hand, would just apply to the immediately preceding term.
The challenge
Your challenge is to write a program that will compile or interpret a regular expression (as defined above) and then test a number of strings against it.
I'm leaving input up to you. My recommendation would be that the regex should probably come first, and then any number of strings to be tested against it; but if you want to make it last, that's fine. If you want to put everything in command-line arguments or into stdin, or the regex in command-line and the strings in stdin, or whatever, that's fine. Just show a usage example or two.
Output should be true
or false
, one per line, to reflect whether or not the regex matches.
Notes:
- I shouldn't need to say this... but don't use any regex libraries in your language! You need to compile or interpret the pattern yourself. (Edit: You may use regex if it's required for splitting or joining strings. You just can't use it to directly solve the problem, e.g., converting the input regex into a language regex and using that.)
- The regular expression must COMPLETELY match the input string for this challenge. (Equivalently, if you're familiar with Perl-like regex, assume that start- and end-of-string anchoring is in place for all matches)
- For this challenge, all of the special characters
()*+?|
are not expected to occur literally. If one comes up in the input, it is safe to assume that no pattern can match the string in question. - Input strings to test should be evaluated in a case-sensitive manner.
The examples
For the examples, I'm assuming everything is done in command-line arguments, regex first. (As I said above, input is up to you.) myregex
here represents your invocation of the program.
> myregex easy easy Easy hard
true
false
false
> myregex ab*a aa abba abab b
true
true
false
false
> myregex 0*1|10 1 10 0110 00001
true
true
false
true
> myregex 0*(1|1+0) 1 10 0110 00001
true
true
true
true
> myregex a?b+|(a+b|b+a?)+ abb babab aaa aabba a b
true
true
false
true
false
true
NOTE: Sorry, forgot to make community wiki! :-(
C (interpreting), 630 622 617 615 582 576 573 572 558 554 544 538 529 504 502 500 499 chars
This understands parentheses, and works on the regexp live (i.e. not parsed first)
#define C char
#define M m(s,o
m(C*s,C*o,C*S,C*p,C*P,C T){C*n=P-1,*q=s,h=*P==41,c=1;for(;h*c;c-=*n--==40)c+=*n==41;
c=*P-42;c=p-P?c-82?T&&c&~1&&c-21?h?2:*S==*P&s<S?M,S-1,p,n,2)||(T&4&&M,S-1,p,P,T|1)):
4:M,T?S:o,p,P-1,T?c&~1?3:7-c:0):T&&s==S||M,o,p,P-1,2):T&&s==S;if(c==2)for(c=4;q<=S;q
++)c|=m(q,S,S,n+h,P-h,2)?M,q,p,n,2)||q<S&T/4&&M,q,p,P,T&6):0;return
c&4?c&1?:T&1&&M,S,p,n,2)||M,o,p,n,0):c;}main(C*w,C**v){C*u;for(w=*++v;*++v;)puts(m(*v
-1,u,u=index(*v,0)-1,w-1,index(w,0)-1,2)?"true":"false");}
compiling with -Wall complains about argc being char. Will crash on illegal patterns.
This parses regexp and string from right to left, saving a few chars.
input on argv, output in reverse normal order:
$ ./a.out ab*a aa abba abab b
true
true
false
false
$ ./a.out "0*1|10" 1 10 0110 00001
true
true
false
true
$ ./a.out "0*(1|1+0)" 1 10 0110 00001
true
true
true
true
$ ./a.out "a?b+|(a+b|b+a?)+" abb babab aaa aabba a b
true
true
false
true
false
true
$ ./a.out "((A|B|C)+(a|(bbbbb)|bb|c)+)+" ABCABCaccabbbbbaACBbbb
false
$ ./a.out "((A|B|C)+(a|(bbbbb)|bb|c)+)+" ABCABCaccabbbbbaACBbbbb
true
GolfScript - 254 chars
n%([]:B:$:_"()"@*{:I"()*+|?"[{}/]?[{[[0B$,+:B))\;)]_]+}{B)):ß;:B;qß(:ß;}{8q}{[[0ß0$,)]]+}:8{[[0B-1=:ß)]]+:$q}{ß>$ß<\([0+$,+]\++}:q{[[I$,:ß)]]+}]=~:$}/;{n+[0]:3\{:c;;3:1_:3;{,}{)[$=]_*2/{~\.{c={3|:3}*;}{;.1|1,\:1,<{+0}*;}if}/}/;}/;1$,?)"true""false"if n}%
Somewhat straightforwardly, the first loop converts the regex into an NFA, which the second loop runs.
Sun Aug 22 00:58:24 EST 2010
(271→266) changed variable names to remove spacesSun Aug 22 01:07:11 EST 2010
(266→265) made []
a variableSun Aug 22 07:05:50 EST 2010
(265→259) made null state transitions inlineSun Aug 22 07:19:21 EST 2010
(259→256) final state made implicitMon Feb 7 19:24:19 EST 2011
(256→254) using "()""str"*
$ echo "ab*a aa abba abab b"|tr " " "\n"|golfscript regex.gs
true
true
false
false
$ echo "0*1|10 1 10 0110 00001"|tr " " "\n"|golfscript regex.gs
true
true
false
true
$ echo "0*(1|1+0) 1 10 0110 00001"|tr " " "\n"|golfscript regex.gs
true
true
true
true
$ echo "a?b+|(a+b|b+a?)+ abb babab aaa aabba a b"|tr " " "\n"|golfscript regex.gs
true
true
false
true
false
true
$ echo "((A|B|C)+(a|(bbbbb)|bb|c)+)+ ABCABCaccabbbbbaACBbbb ABCABCaccabbbbbaACBbbbb"|tr " " "\n"|golfscript regex.gs
false
true
C (parsing+matching) 727 670 chars
This is not fully golfed to the bare minimum yet, but I wanted to try and see if parsing the regexp first would make a difference to interpreting it live. It does, because it costs more, although both parsing and matching are easier to write/understand.
#define Q struct q
#define C char
#define R return
Q{Q*u,*n,*a;C c,m};Q*P(C*p,C*e){Q*r=calloc(99,1);C*n=p+1,c=1,w;if(p==e)R
r;if(*p==40){for(;c;)c+=(*n==40)-(*n++==41);r->u=P(p+1,n-1);}else
if(*p=='|'){r->a=P(p+1,e);R r;}else r->c=*p;if(n<e){if(*n==43)*n=42,r->n=P(p,e);else
w=*n==42|*n==63,r->n=P(n+w,e),r->m=w?*n:0;r->a=r->n->a;}R r;}M(Q*r,C*s,C*o,C*z){C*p,
e;e=r?r->m==63|r->m==42&&M(r->n,s,o,z)?:*s&&r->c==*s?M(r->m==42?r:r->n,s+1,o,z):2:s
==z;if(e-2)R e;for(p=s,e=0;!r->c&p<=z;p++)e|=M(r->u,s,s,p)&(r->m!=42|p>s)&&M(r->m==
42?r:r->n,p,p,z);R e||r->a&&M(r->a,o,o,z);}main(C
c,C**v){for(;--c>1;)puts(M(P(v[1],index(v[1],0)),v[c],v[c],index(v[c],0))?"true":"false");}
it parses a regexp into a struct, where each struct has:
- the character to be matched or a pointer to a struct of a subpattern to be matched
- a pointer to the next symbol struct (if any)
- a pointer to the next alternative pattern (if any)
- a multiplier which is \0,
*
or?
--(pat)+
is parsed into(pat)(pat)*
right away which makes matching far easier
Haskell: 610 612 627
main=getLine>>=f.words
d=reverse
u=0<1
j=[]
f(r:s)=mapM_(print.any null.c(d$b$'(':r++")"))s
c%(x,y)=(c:x,y)
s _ _ _[]=(j,j)
s n a b (x:y)|n<1&&x==a=(j,x:y)|x==a=f(-1)|x==b=f 1|u=f 0where f k=x%s(n+k)a b y
b r|m==j=r|u=b$d(o++"(("++x)++")("++z++")/)"++w where(c,m)=s 0'|''!'r;_:v=m;(o,_:x)=s 0'('')'$d c;(z,_:w)=s 0')''('v
(!)g f x=f x>>=g
c[]=(:j)
c r=f!c s where(s,f)=i r
p q@(r:s)|r=='('=(s,(:j))|u=(a,f!g)where(w,f)=i q;(a,g)=p w
_?[]=j
c?(h:r)|c==h=[r]|u=j
i(r:q)=maybe(q,(r?))id$lookup r$zip")/*+?"$p q:zip[e,w,w,w][\s->f s++g s,\s->s:l s,l,\s->s:f s]where(w,f)=i q;(e,g)=i w;l s|f s==j=j|u=f s++(f s>>=l)
Ungolfed:
import Control.Monad
import Data.List
-- (aa|bb|cc) --> )|)cc()|)bb()aa(((
testRegex = "a?b+|(a+b|b+a?)+"
interpret regex = any null . interpret' regex
interpret' regex = compile (rewrite regex)
mapFst f (x, y) = (f x, y)
splitWhileBalanced _ _ _ "" = ("", "")
splitWhileBalanced n b1 b2 (x:xs)
| n < 1 && x == b1 = ("", x:xs)
| x == b1 = f (-1)
| x == b2 = f 1
| otherwise = f 0
where
f k = mapFst (x:) $ splitWhileBalanced (n+k) b1 b2 xs
rewrite regex = reverse $ moveBars $ '(' : regex ++ ")"
moveBars regex
| mtBar == "" = regex
| otherwise = moveBars $ reverse (hOpen ++ "((" ++ tOpen) ++ ")(" ++ hClose ++ ")/)" ++ tClose
where
(hBar, mtBar) = splitWhileBalanced 0 '|' '!' regex -- '!' is a dummy character
b:tBar = mtBar
(hOpen, o:tOpen) = splitWhileBalanced 0 '(' ')' $ reverse hBar
(hClose, c:tClose) = splitWhileBalanced 0 ')' '(' $ tBar
compile "" = \x -> [x]
compile rs = f <=< compile rs'
where
(rs', f) = compile' rs
paren regex@(r:rs0)
| r == '(' = (rs0, \x -> [x])
| otherwise = (rs2, f <=< g)
where
(rs1, f) = compile' regex
(rs2, g) = paren rs1
compile' (r:rs0) = case r of
'/' -> (rs2, bar)
'*' -> (rs1, star)
'+' -> (rs1, plus)
'?' -> (rs1, mark)
')' -> paren rs0
_ -> (rs0, lit)
where
(rs1, f) = compile' rs0
(rs2, g) = compile' rs1
bar str = f str ++ g str
plus str
| null (f str) = []
| otherwise = f str ++ (f str >>= plus)
star str = str : plus str
mark str = str : f str
lit = maybe [] (\x -> [x]) . stripPrefix [r]
Memo:
Modifies |
to a suffix form and then reverses the entire regex. Now the operators are in prefix form, making it easy to parse. The program parses the regex like this. The list monad is used for nondeterminism.
Usage:
> runghc Regex.hs
a?b+|(a+b|b+a?)+ abb babab aaa aabba a b
True
True
False
True
False
True
Ruby 1.9: 709 chars
R=gets.chop;s='';k=[];n=a=0
G={?(=>(A="(a-=1;s<<0)if a>1;")+"k<<[n,a];n=a=0",
Y=?|=>(B="s<<0while 0<a-=1;")+"n+=1",
?)=>B+(C="s<<?|while 0<=n-=1;")+"n,a=k.pop"+F=";a+=1",
?*=>D="s<<c",?+=>D,??=>D}
R.each_char{|c|eval G[c]||A+D+F};eval B+C
def P l,s;l.map{|a|a<<s};end
J={??=>(K="a=k.pop;")+"k<<[{Y=>n=[a[0]]},a[1]<<n]",
?*=>K+(H="P a[1],s={Y=>n=[a[0]]};")+"k<<[s,[n]]",
?+=>K+H+"k<<[a[0],[n]]",
Y=>(I=K+"b=k.pop;")+"k<<[{Y=>[a[0],b[0]]},a[1]+b[1]]",
?\0=>I+"P b[1],a[0];k<<[b[0],a[1]]"}
k=[];s.each_char{|c|eval J[c]||"k<<[{c=>a=[]},[a]]"}
e=k[0];P e[1],R;
p $<.map{|l|s=l.chop;*a=e[0]
s.each_char{|c|eval@f="n=a;a=a.map{|h|h[Y]||[h]}.flatten"while a!=n
a=a.inject([]){|a,h|a+(h[c]||[])}}
eval@f;a.include? R}
(This will also works in Ruby 1.8 with 45 more chars by adding the alias below)
Test with type testcase.txt | ruby regex.rb
An implementation of Russ Cox's NFA parser in Ruby. A state is represented as a hash with a single key which holds an array of next states.
Ungolfed:
#needed to run on ruby 1.8
class String
alias :each_char :each_byte
end
## Infix to Postfix
R=gets.chop
p "regexp = #{R}"
k=[]
s='' #will hold postfix string
n=a=0 #count braNches and Atoms
postfix = R.each_char{|c|
case c
when ?(
(a-=1;s<<0)if a>1
k<<[n,a]
n=a=0
when ?|
s<<0while 0<a-=1
n+=1
when ?)
s<<0while 0<a-=1
s<<?|while 0<=n-=1
n,a=k.pop;a+=1
when ?*,?+,??
s<< c
else
(a-=1;s<<0)if a>1
s<< c
a+=1
end
}
s<<0while 0<a-=1
s<<?|while 0<=n-=1
## Postfix to NFA
# State is {char=>s0=[state,state]]
# Frag is [state, [s0,...]]
Y=?| #choice indicator
X={:true=>[R]} #matcstate
def patch l,s #l is list of arrays, s is state. Put s in each array
l.map{|a| a<< s }
end
k=[]
s.each_char{|c|
case c
when ??
a=k.pop
s={Y=>n=[a[0]]}
k<<[s,a[1]<<n]
when ?*
a=k.pop
s={Y=>n=[a[0]]}
patch(a[1],s)
k<<[s,[n]]
when ?+
a=k.pop
s={Y=>n=[a[0]]}
patch(a[1],s)
k<<[a[0],[n]]
when ?|
b=k.pop
a=k.pop
s={Y=>[a[0],b[0]]}
k<<[s,a[1]+b[1]]
when 0
b=k.pop
a=k.pop
patch(a[1],b[0])
k<<[a[0],b[1]]
else
k<<[{c=>a=[]},[a]]
end
}
e=k.pop
patch(e[1],X)
#Running the NFA
E=[e[0]] #Evaluator
p $<.map{|l|s=l.chop
p "evaluating: #{s}"
a=E
s.each_char{|c|
begin #skip past split nodes
s=a.size
a=a.map{|h|h[?|]||[h]}.flatten
end while a.size!=s
a=a.inject([]){|a,h|
a+(h[c]||[])} #add next state or null
}
a=a.map{|h|h[?|]||[h]}.flatten
r = a.include? X #check for end of pattern
p r
r
}