[antlr-interest] lexer or parser for comments and remarks?
Richard Bown
richard.bown at ferventsoftware.com
Sun Apr 15 01:11:00 PDT 2007
Maybe I was blathering a bit too much there - is this so simple and
stoopid it's in a FAQ or an example that I've missed? Anyone got any
recommendations or pointers? Even a RTFM would do.
Richard
Richard Bown wrote:
> Hi,
>
> I've got a lexer non-determinism which is making me go back and forward
> between trying to fix the lexer or trying to write a good parser rule
> for the problem - neither of which I can seem to get right. This is
> with Antlr 2.7.5 producing C#. Apologies for the lengthy explanation.
>
> The rules are for handling SQL comments and similar one liners (I'm
> treating these all as single line statements). Whilst this is
> relatively simple for non-alphabetic characters - you can write a lexer
> rule such as this:
>
> COMMENT :
> '-' '-' ( ~('\r' | '\n') )* NEWLINE { setType(Token.SKIP); }
> ;
>
> If you then define say a similar line for "rem" statements:
>
> REMARK :
> 'r' 'e' 'm' ( ~( '\r' | '\n' ) ) * NEWLINE
> { setType(Token.SKIP); }
> ;
>
> This is fine - but if you also have a lexer rule for matching
> identifiers i.e. (slightly simplified):
>
> ID :
> 'A-Z' | 'a-z' | '_' | '0'..'9'
> ;
>
> You get non-determinisms of course with the "rem" rule and any other
> rules that use alphanumerics.
>
> The other way I've attempted to solve this is to just catch the 'rem'
> elements in the parser. This is fine (and it would be a more useful
> solution if I wanted to do some simple processing of the comment lines)
> but then if there are any interesting and unusual characters in the
> remainder of the "rem" line then the parser doesn't match on these
> elements. I've tried to be exhaustive about the type of 'words' that
> the remainder of the comment lines can contain but then I start to trip
> over lexer rules again and we go around in circles.
>
> So I've been going back to the lexer to solve this - and whilst with the
> non-determinisms things almost work, the lexer rules also greedily slurp
> up parts of legitimate ids. One thing to fix that is to force the
> lexer to match from the start of the line only I s'pose... but it all
> seems like putting one hack on top of another.
>
> Any clues how to handle this elegantly in the parser? To me it would
> make sense to dump these lexer rules and handle all of these types in
> the parser - I just need an equivalent to "match from here to end of
> line" for the parser perhaps?
>
> Rgds,
> Richard Bown
>
More information about the antlr-interest
mailing list