[antlr-interest] lexer or parser for comments and remarks?
Richard Bown
richard.bown at ferventsoftware.com
Wed Apr 4 13:16:40 PDT 2007
Hi,
I've got a lexer non-determinism which is making me go back and forward
between trying to fix the lexer or trying to write a good parser rule
for the problem - neither of which I can seem to get right. This is
with Antlr 2.7.5 producing C#. Apologies for the lengthy explanation.
The rules are for handling SQL comments and similar one liners (I'm
treating these all as single line statements). Whilst this is
relatively simple for non-alphabetic characters - you can write a lexer
rule such as this:
COMMENT :
'-' '-' ( ~('\r' | '\n') )* NEWLINE { setType(Token.SKIP); }
;
If you then define say a similar line for "rem" statements:
REMARK :
'r' 'e' 'm' ( ~( '\r' | '\n' ) ) * NEWLINE
{ setType(Token.SKIP); }
;
This is fine - but if you also have a lexer rule for matching
identifiers i.e. (slightly simplified):
ID :
'A-Z' | 'a-z' | '_' | '0'..'9'
;
You get non-determinisms of course with the "rem" rule and any other
rules that use alphanumerics.
The other way I've attempted to solve this is to just catch the 'rem'
elements in the parser. This is fine (and it would be a more useful
solution if I wanted to do some simple processing of the comment lines)
but then if there are any interesting and unusual characters in the
remainder of the "rem" line then the parser doesn't match on these
elements. I've tried to be exhaustive about the type of 'words' that
the remainder of the comment lines can contain but then I start to trip
over lexer rules again and we go around in circles.
So I've been going back to the lexer to solve this - and whilst with the
non-determinisms things almost work, the lexer rules also greedily slurp
up parts of legitimate ids. One thing to fix that is to force the
lexer to match from the start of the line only I s'pose... but it all
seems like putting one hack on top of another.
Any clues how to handle this elegantly in the parser? To me it would
make sense to dump these lexer rules and handle all of these types in
the parser - I just need an equivalent to "match from here to end of
line" for the parser perhaps?
Rgds,
Richard Bown
More information about the antlr-interest
mailing list