[antlr-interest] lexer or parser for comments and remarks?

Sun Apr 15 04:38:30 PDT 2007

Richard Bown wrote:
> Maybe I was blathering a bit too much there - is this so simple and
> stoopid it's in a FAQ or an example that I've missed?  Anyone got any
> recommendations or pointers?   Even a RTFM would do.
> 
> Richard

How about using the tokens command to define a REM token like so?

tokens {
   REM="rem";
}

And using

REMARK :
   REM ( ~( '\r' | '\n' ) ) * NEWLINE
   { setType(Token.SKIP); }
   ;

should favor "rem" for REMARK instead to recognize an ID token. At least
that should work for ANTLR 3 - I don't know much about the difference to
ANTLR 2.7.7.

Best regards,
Johannes Luber

> Richard Bown wrote:
>> Hi,
>>
>> I've got a lexer non-determinism which is making me go back and
>> forward between trying to fix the lexer or trying to write a good
>> parser rule for the problem - neither of which I can seem to get
>> right.  This is with Antlr 2.7.5 producing C#.  Apologies for the
>> lengthy explanation.
>>
>> The rules are for handling SQL comments and similar one liners (I'm
>> treating these all as single line statements).   Whilst this is
>> relatively simple for non-alphabetic characters - you can write a
>> lexer rule such as this:
>>
>> COMMENT :
>>     '-' '-' ( ~('\r' | '\n') )* NEWLINE { setType(Token.SKIP); }
>> ;
>>
>> If you then define say a similar line for "rem" statements:
>>
>> REMARK :
>>     'r' 'e' 'm' ( ~( '\r' | '\n' ) ) * NEWLINE
>> { setType(Token.SKIP); }
>> ;
>>
>> This is fine - but if you also have a lexer rule for matching
>> identifiers i.e. (slightly simplified):
>>
>> ID :
>>     'A-Z' | 'a-z' | '_' | '0'..'9'
>> ;
>>
>> You get non-determinisms of course with the "rem" rule and any other
>> rules that use alphanumerics.
>>
>> The other way I've attempted to solve this is to just catch the 'rem'
>> elements in the parser.   This is fine (and it would be a more useful
>> solution if I wanted to do some simple processing of the comment
>> lines) but then if there are any interesting and unusual characters in
>> the remainder of the "rem" line then the parser doesn't match on these
>> elements.  I've tried to be exhaustive about the type of 'words' that
>> the remainder of the comment lines can contain but then I start to
>> trip over lexer rules again and we go around in circles.
>>
>> So I've been going back to the lexer to solve this - and whilst with
>> the non-determinisms things almost work, the lexer rules also greedily
>> slurp  up parts of legitimate ids.   One thing to fix that is to force
>> the lexer to match from the start of the line only I s'pose... but it
>> all seems like putting one hack on top of another.
>>
>> Any clues how to handle this elegantly in the parser?   To me it would
>> make sense to dump these lexer rules and handle all of these types in
>> the parser - I just need an equivalent to "match from here to end of
>> line" for the parser perhaps?
>>
>> Rgds,
>> Richard Bown
>>
>