recognizing multi-line sections with lark grammar
I'm trying to write a simple grammar to parse text with multi-line sections.. I'm not able to wrap my head around how to do it. Here's the grammar that I've written so far - would appreciate any help here.
ps: I realize that lark is overkill for this problem but this is just a very simplified version of what I'm trying to parse.
from unittest import TestCase
from lark import Lark
text = '''
[section 1]
line 1.1
line 1.2
[section 2]
line 2.1
'''
class TestLexer(TestCase):
def test_basic(self):
p = Lark(r"""
_LB: "["
_RB: "]"
_NL: /\n/+
name: /[^]]+/
content: /.+/s
section: _NL* _LB name _RB _NL* content
doc: section*
""", parser='lalr', start='doc')
parsed = p.parse(text)
Solution 1:
The problem is that your content
regex can match anywhere with any length, meaning that the rest of the grammar can't work correctly. Instead you should have a terminal restricted to a single line and give it a lower priority then the rest.
p = Lark(r"""
_NL: /\n/+
name: /[^]]+/
content: (ANY_LINE _NL)+
ANY_LINE.-1: /.+/
section: _NL* "[" name "]" _NL* content
doc: section*
""", parser='lalr', start='doc')
You may need some extra work now to convert the content
rule into exactly what you want, but since you claim that this isn't actually your exact problem I wont bother with that here.