[antlr-interest] lexer or parser for comments and remarks?

Wed Apr 4 13:16:40 PDT 2007

Hi,

I've got a lexer non-determinism which is making me go back and forward 
between trying to fix the lexer or trying to write a good parser rule 
for the problem - neither of which I can seem to get right.  This is 
with Antlr 2.7.5 producing C#.  Apologies for the lengthy explanation.

The rules are for handling SQL comments and similar one liners (I'm 
treating these all as single line statements).   Whilst this is 
relatively simple for non-alphabetic characters - you can write a lexer 
rule such as this:

COMMENT :
	'-' '-' ( ~('\r' | '\n') )* NEWLINE { setType(Token.SKIP); }
;

If you then define say a similar line for "rem" statements:

REMARK :
	'r' 'e' 'm' ( ~( '\r' | '\n' ) ) * NEWLINE
{ setType(Token.SKIP); }
;

This is fine - but if you also have a lexer rule for matching 
identifiers i.e. (slightly simplified):

ID :
	'A-Z' | 'a-z' | '_' | '0'..'9'
;

You get non-determinisms of course with the "rem" rule and any other 
rules that use alphanumerics.

The other way I've attempted to solve this is to just catch the 'rem' 
elements in the parser.   This is fine (and it would be a more useful 
solution if I wanted to do some simple processing of the comment lines) 
but then if there are any interesting and unusual characters in the 
remainder of the "rem" line then the parser doesn't match on these 
elements.  I've tried to be exhaustive about the type of 'words' that 
the remainder of the comment lines can contain but then I start to trip 
over lexer rules again and we go around in circles.

So I've been going back to the lexer to solve this - and whilst with the 
non-determinisms things almost work, the lexer rules also greedily slurp 
  up parts of legitimate ids.   One thing to fix that is to force the 
lexer to match from the start of the line only I s'pose... but it all 
seems like putting one hack on top of another.

Any clues how to handle this elegantly in the parser?   To me it would 
make sense to dump these lexer rules and handle all of these types in 
the parser - I just need an equivalent to "match from here to end of 
line" for the parser perhaps?

Rgds,
Richard Bown