ANTLR4 catch an entire line of arbitrary data

I have a grammar with command lines starting with a / and "data lines" which is everything that does not start with a slash.

I just can't get it to be parsed correctly, the following rule

FM_DATA: ( ('\r' | '\n' | '\r\n') ~'/') -> mode(DATA_MODE);

does almost what I need but for a data line of

abcde

the following tokens are generated

[@23,170:171='\na',<4>,4:72]
[@24,172:175='bcde',<103>,5:1]

so the first character is swallowed by the rule.

I also tried

FM_DATA: ( {getCharPositionInLine() == 0}? ~'/') -> mode(DATA_MODE);

but this causes even weirder things.

What's the correct rule for getting this to work as expected ?

TIA - Alex


Solution 1:

The ... -> more command can be used to let the first char (or first part of a lexer rule) not be consumed (yet).

A quick demo:

lexer grammar FmDataLexer;

NewLine
 : [\r\n]+ -> skip
 ;

CommandStart
 : '/' -> pushMode(CommandMode)
 ;

FmDataStart
 : . -> more, pushMode(FmDataMode)
 ;

mode CommandMode;

 CommandLine
  : ~[\r\n]+ -> popMode
  ;

mode FmDataMode;

 FmData
  : ~[\r\n]+ -> popMode
  ;

If you run the following code:

FmDataLexer lexer = new FmDataLexer(CharStreams.fromString("abcde\n/mu"));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();

for (Token t : stream.getTokens()) {
  System.out.printf("%-20s '%s'\n", FmDataLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}

you'll get this output:

FmData               'abcde'
CommandStart         '/'
CommandLine          'mu'
EOF                  '<EOF>'

See: https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md#mode-pushmode-popmode-and-more