[antlr-interest] lexer or parser for comments and remarks?

Mon Apr 16 11:59:40 PDT 2007

I've defined the "rem" as a token but I still get non-determinisms in
the lexer between this and the REGULAR_ID.   I'm not sure this is a red
herring but the REGULAR_ID is declaring testLiteralsTable as follows
(based on Lubos Vnuk's SqlSQL2/DmlSQL2 grammar):

REGULAR_ID :
	( ( NATIONAL_CHAR_STRING_LIT 		{$setType(NATIONAL_CHAR_STRING_LIT);}
	  | BIT_STRING_LIT {$setType(BIT_STRING_LIT);}
	  | HEX_STRING_LIT {$setType(HEX_STRING_LIT);} 	)
	| (SIMPLE_LETTER) (SIMPLE_ID)* {$setType(testLiteralsTable(REGULAR_ID));} )
;

If I add the following rule:

REMARK :
	"rem" ( ~( '\r' | '\n' ) ) * NEWLINE { $setType(Token.SKIP); }
;

then I get non-determinisms between the two even if "rem" is in the
imported token table (or even explicitly defined in tokens {} ).

Conversely if I put the "rem" statement in the parser then I can't
define a stream of catch-all-until-the-end-of-the-line token.

Of course my fundamental problem is I'm hacking grammars without
actually learning what's going on under the covers too much (very
naughty) - treating it as regexps etc.  So forgive my laziness but still 
happy to take pointers to RTFMs etc.

I might in fact use this as an excuse to download Ter's book.   I 
presume that Antlr 3.0 and 2.7.5 have enough similarities to make it 
worth my while at this time.

Rgds,
Richard

Johannes Luber wrote:
> Richard Bown wrote:
>> Maybe I was blathering a bit too much there - is this so simple and
>> stoopid it's in a FAQ or an example that I've missed?  Anyone got any
>> recommendations or pointers?   Even a RTFM would do.
>>
>> Richard
> 
> How about using the tokens command to define a REM token like so?
> 
> tokens {
>    REM="rem";
> }
> 
> And using
> 
> REMARK :
>    REM ( ~( '\r' | '\n' ) ) * NEWLINE
>    { setType(Token.SKIP); }
>    ;
> 
> should favor "rem" for REMARK instead to recognize an ID token. At least
> that should work for ANTLR 3 - I don't know much about the difference to
> ANTLR 2.7.7.
> 
> Best regards,
> Johannes Luber
> 
>> Richard Bown wrote:
>>> Hi,
>>>
>>> I've got a lexer non-determinism which is making me go back and
>>> forward between trying to fix the lexer or trying to write a good
>>> parser rule for the problem - neither of which I can seem to get
>>> right.  This is with Antlr 2.7.5 producing C#.  Apologies for the
>>> lengthy explanation.
>>>
>>> The rules are for handling SQL comments and similar one liners (I'm
>>> treating these all as single line statements).   Whilst this is
>>> relatively simple for non-alphabetic characters - you can write a
>>> lexer rule such as this:
>>>
>>> COMMENT :
>>>     '-' '-' ( ~('\r' | '\n') )* NEWLINE { setType(Token.SKIP); }
>>> ;
>>>
>>> If you then define say a similar line for "rem" statements:
>>>
>>> REMARK :
>>>     'r' 'e' 'm' ( ~( '\r' | '\n' ) ) * NEWLINE
>>> { setType(Token.SKIP); }
>>> ;
>>>
>>> This is fine - but if you also have a lexer rule for matching
>>> identifiers i.e. (slightly simplified):
>>>
>>> ID :
>>>     'A-Z' | 'a-z' | '_' | '0'..'9'
>>> ;
>>>
>>> You get non-determinisms of course with the "rem" rule and any other
>>> rules that use alphanumerics.
>>>
>>> The other way I've attempted to solve this is to just catch the 'rem'
>>> elements in the parser.   This is fine (and it would be a more useful
>>> solution if I wanted to do some simple processing of the comment
>>> lines) but then if there are any interesting and unusual characters in
>>> the remainder of the "rem" line then the parser doesn't match on these
>>> elements.  I've tried to be exhaustive about the type of 'words' that
>>> the remainder of the comment lines can contain but then I start to
>>> trip over lexer rules again and we go around in circles.
>>>
>>> So I've been going back to the lexer to solve this - and whilst with
>>> the non-determinisms things almost work, the lexer rules also greedily
>>> slurp  up parts of legitimate ids.   One thing to fix that is to force
>>> the lexer to match from the start of the line only I s'pose... but it
>>> all seems like putting one hack on top of another.
>>>
>>> Any clues how to handle this elegantly in the parser?   To me it would
>>> make sense to dump these lexer rules and handle all of these types in
>>> the parser - I just need an equivalent to "match from here to end of
>>> line" for the parser perhaps?
>>>
>>> Rgds,
>>> Richard Bown
>>>
>