[antlr-interest] lexer or parser for comments and remarks?
Richard Bown
richard.bown at ferventsoftware.com
Mon Apr 16 11:59:40 PDT 2007
I've defined the "rem" as a token but I still get non-determinisms in
the lexer between this and the REGULAR_ID. I'm not sure this is a red
herring but the REGULAR_ID is declaring testLiteralsTable as follows
(based on Lubos Vnuk's SqlSQL2/DmlSQL2 grammar):
REGULAR_ID :
( ( NATIONAL_CHAR_STRING_LIT {$setType(NATIONAL_CHAR_STRING_LIT);}
| BIT_STRING_LIT {$setType(BIT_STRING_LIT);}
| HEX_STRING_LIT {$setType(HEX_STRING_LIT);} )
| (SIMPLE_LETTER) (SIMPLE_ID)* {$setType(testLiteralsTable(REGULAR_ID));} )
;
If I add the following rule:
REMARK :
"rem" ( ~( '\r' | '\n' ) ) * NEWLINE { $setType(Token.SKIP); }
;
then I get non-determinisms between the two even if "rem" is in the
imported token table (or even explicitly defined in tokens {} ).
Conversely if I put the "rem" statement in the parser then I can't
define a stream of catch-all-until-the-end-of-the-line token.
Of course my fundamental problem is I'm hacking grammars without
actually learning what's going on under the covers too much (very
naughty) - treating it as regexps etc. So forgive my laziness but still
happy to take pointers to RTFMs etc.
I might in fact use this as an excuse to download Ter's book. I
presume that Antlr 3.0 and 2.7.5 have enough similarities to make it
worth my while at this time.
Rgds,
Richard
Johannes Luber wrote:
> Richard Bown wrote:
>> Maybe I was blathering a bit too much there - is this so simple and
>> stoopid it's in a FAQ or an example that I've missed? Anyone got any
>> recommendations or pointers? Even a RTFM would do.
>>
>> Richard
>
> How about using the tokens command to define a REM token like so?
>
> tokens {
> REM="rem";
> }
>
> And using
>
> REMARK :
> REM ( ~( '\r' | '\n' ) ) * NEWLINE
> { setType(Token.SKIP); }
> ;
>
> should favor "rem" for REMARK instead to recognize an ID token. At least
> that should work for ANTLR 3 - I don't know much about the difference to
> ANTLR 2.7.7.
>
> Best regards,
> Johannes Luber
>
>> Richard Bown wrote:
>>> Hi,
>>>
>>> I've got a lexer non-determinism which is making me go back and
>>> forward between trying to fix the lexer or trying to write a good
>>> parser rule for the problem - neither of which I can seem to get
>>> right. This is with Antlr 2.7.5 producing C#. Apologies for the
>>> lengthy explanation.
>>>
>>> The rules are for handling SQL comments and similar one liners (I'm
>>> treating these all as single line statements). Whilst this is
>>> relatively simple for non-alphabetic characters - you can write a
>>> lexer rule such as this:
>>>
>>> COMMENT :
>>> '-' '-' ( ~('\r' | '\n') )* NEWLINE { setType(Token.SKIP); }
>>> ;
>>>
>>> If you then define say a similar line for "rem" statements:
>>>
>>> REMARK :
>>> 'r' 'e' 'm' ( ~( '\r' | '\n' ) ) * NEWLINE
>>> { setType(Token.SKIP); }
>>> ;
>>>
>>> This is fine - but if you also have a lexer rule for matching
>>> identifiers i.e. (slightly simplified):
>>>
>>> ID :
>>> 'A-Z' | 'a-z' | '_' | '0'..'9'
>>> ;
>>>
>>> You get non-determinisms of course with the "rem" rule and any other
>>> rules that use alphanumerics.
>>>
>>> The other way I've attempted to solve this is to just catch the 'rem'
>>> elements in the parser. This is fine (and it would be a more useful
>>> solution if I wanted to do some simple processing of the comment
>>> lines) but then if there are any interesting and unusual characters in
>>> the remainder of the "rem" line then the parser doesn't match on these
>>> elements. I've tried to be exhaustive about the type of 'words' that
>>> the remainder of the comment lines can contain but then I start to
>>> trip over lexer rules again and we go around in circles.
>>>
>>> So I've been going back to the lexer to solve this - and whilst with
>>> the non-determinisms things almost work, the lexer rules also greedily
>>> slurp up parts of legitimate ids. One thing to fix that is to force
>>> the lexer to match from the start of the line only I s'pose... but it
>>> all seems like putting one hack on top of another.
>>>
>>> Any clues how to handle this elegantly in the parser? To me it would
>>> make sense to dump these lexer rules and handle all of these types in
>>> the parser - I just need an equivalent to "match from here to end of
>>> line" for the parser perhaps?
>>>
>>> Rgds,
>>> Richard Bown
>>>
>
More information about the antlr-interest
mailing list