[antlr-interest] lexer or parser for comments and remarks?

Sun Apr 15 01:11:00 PDT 2007

Maybe I was blathering a bit too much there - is this so simple and 
stoopid it's in a FAQ or an example that I've missed?  Anyone got any 
recommendations or pointers?   Even a RTFM would do.

Richard

Richard Bown wrote:
> Hi,
> 
> I've got a lexer non-determinism which is making me go back and forward 
> between trying to fix the lexer or trying to write a good parser rule 
> for the problem - neither of which I can seem to get right.  This is 
> with Antlr 2.7.5 producing C#.  Apologies for the lengthy explanation.
> 
> The rules are for handling SQL comments and similar one liners (I'm 
> treating these all as single line statements).   Whilst this is 
> relatively simple for non-alphabetic characters - you can write a lexer 
> rule such as this:
> 
> COMMENT :
>     '-' '-' ( ~('\r' | '\n') )* NEWLINE { setType(Token.SKIP); }
> ;
> 
> If you then define say a similar line for "rem" statements:
> 
> REMARK :
>     'r' 'e' 'm' ( ~( '\r' | '\n' ) ) * NEWLINE
> { setType(Token.SKIP); }
> ;
> 
> This is fine - but if you also have a lexer rule for matching 
> identifiers i.e. (slightly simplified):
> 
> ID :
>     'A-Z' | 'a-z' | '_' | '0'..'9'
> ;
> 
> You get non-determinisms of course with the "rem" rule and any other 
> rules that use alphanumerics.
> 
> The other way I've attempted to solve this is to just catch the 'rem' 
> elements in the parser.   This is fine (and it would be a more useful 
> solution if I wanted to do some simple processing of the comment lines) 
> but then if there are any interesting and unusual characters in the 
> remainder of the "rem" line then the parser doesn't match on these 
> elements.  I've tried to be exhaustive about the type of 'words' that 
> the remainder of the comment lines can contain but then I start to trip 
> over lexer rules again and we go around in circles.
> 
> So I've been going back to the lexer to solve this - and whilst with the 
> non-determinisms things almost work, the lexer rules also greedily slurp 
>  up parts of legitimate ids.   One thing to fix that is to force the 
> lexer to match from the start of the line only I s'pose... but it all 
> seems like putting one hack on top of another.
> 
> Any clues how to handle this elegantly in the parser?   To me it would 
> make sense to dump these lexer rules and handle all of these types in 
> the parser - I just need an equivalent to "match from here to end of 
> line" for the parser perhaps?
> 
> Rgds,
> Richard Bown
>