[antlr-interest] lexer or parser for comments and remarks?
Johannes Luber
jaluber at gmx.de
Sun Apr 15 04:38:30 PDT 2007
Richard Bown wrote:
> Maybe I was blathering a bit too much there - is this so simple and
> stoopid it's in a FAQ or an example that I've missed? Anyone got any
> recommendations or pointers? Even a RTFM would do.
>
> Richard
How about using the tokens command to define a REM token like so?
tokens {
REM="rem";
}
And using
REMARK :
REM ( ~( '\r' | '\n' ) ) * NEWLINE
{ setType(Token.SKIP); }
;
should favor "rem" for REMARK instead to recognize an ID token. At least
that should work for ANTLR 3 - I don't know much about the difference to
ANTLR 2.7.7.
Best regards,
Johannes Luber
> Richard Bown wrote:
>> Hi,
>>
>> I've got a lexer non-determinism which is making me go back and
>> forward between trying to fix the lexer or trying to write a good
>> parser rule for the problem - neither of which I can seem to get
>> right. This is with Antlr 2.7.5 producing C#. Apologies for the
>> lengthy explanation.
>>
>> The rules are for handling SQL comments and similar one liners (I'm
>> treating these all as single line statements). Whilst this is
>> relatively simple for non-alphabetic characters - you can write a
>> lexer rule such as this:
>>
>> COMMENT :
>> '-' '-' ( ~('\r' | '\n') )* NEWLINE { setType(Token.SKIP); }
>> ;
>>
>> If you then define say a similar line for "rem" statements:
>>
>> REMARK :
>> 'r' 'e' 'm' ( ~( '\r' | '\n' ) ) * NEWLINE
>> { setType(Token.SKIP); }
>> ;
>>
>> This is fine - but if you also have a lexer rule for matching
>> identifiers i.e. (slightly simplified):
>>
>> ID :
>> 'A-Z' | 'a-z' | '_' | '0'..'9'
>> ;
>>
>> You get non-determinisms of course with the "rem" rule and any other
>> rules that use alphanumerics.
>>
>> The other way I've attempted to solve this is to just catch the 'rem'
>> elements in the parser. This is fine (and it would be a more useful
>> solution if I wanted to do some simple processing of the comment
>> lines) but then if there are any interesting and unusual characters in
>> the remainder of the "rem" line then the parser doesn't match on these
>> elements. I've tried to be exhaustive about the type of 'words' that
>> the remainder of the comment lines can contain but then I start to
>> trip over lexer rules again and we go around in circles.
>>
>> So I've been going back to the lexer to solve this - and whilst with
>> the non-determinisms things almost work, the lexer rules also greedily
>> slurp up parts of legitimate ids. One thing to fix that is to force
>> the lexer to match from the start of the line only I s'pose... but it
>> all seems like putting one hack on top of another.
>>
>> Any clues how to handle this elegantly in the parser? To me it would
>> make sense to dump these lexer rules and handle all of these types in
>> the parser - I just need an equivalent to "match from here to end of
>> line" for the parser perhaps?
>>
>> Rgds,
>> Richard Bown
>>
>
More information about the antlr-interest
mailing list