[antlr-interest] Generated lexer is affected by parser rules?! A bug?

Mon May 19 05:56:17 PDT 2008

Hi Terrance,

I just tested the two grammars mentioned with the latest antlr with the idea
that this might have been fixed since last August.

However, the case is still the same.

Either I am missing something in the basics, or this is a bug. Could you
please verify which is the case here?

Thanks,
Hari

On 5/17/08, Haralambi Haralambiev <hharalambiev at gmail.com> wrote:
>
> Just revised the very simple grammar.
>
> Could someone point out what is the difference between the following two
> grammars:
> -----------
> lexer grammar testStringLiteral1;
>
> StringLiteral : Apos ~Apos* Apos;
>
> fragment
> Apos : '\'';
> -----------
>
> and
>
> -----------
> lexer grammar testStringLiteral2;
>
> StringLiteral : '\'' ~'\''* '\'';
> -----------
>
> When generated to Java file - they differ, while I expected not to!
>
> -Hari
>
> On 5/17/08, Haralambi Haralambiev <hharalambiev at gmail.com> wrote:
>>
>> Hello,
>>
>> A colleague of mine is working on some grammar and I was bemused when
>> she told me that a string literal '50' was throwing an error, while the '00'
>> was not throwing.
>>
>> The exception said "mismatched character '5' expecting set null".
>>
>> So, I started investigating... the lexer rule for string literal is the
>> following:
>> -----------
>> fragment
>> Apos : '\'';
>>
>> StringLiteral: Apos ~Apos* Apos
>> -----------
>>
>> Everything seemed fine, except that in the generated java code, the
>> mStringLiteral method had the following line:
>>
>> -----------
>> mApos();
>> // ...NewTest.g:84:9: (~ Apos )*
>> loop2:
>> do {
>> int alt2=2;
>> int LA2_0 = input.LA(1);
>>
>> if ( ((LA2_0>='\u0000' && LA2_0<='&')||(LA2_0>='(' && LA2_0<='\uFFFE')) )
>> {
>> alt2=1;
>> }
>>
>> switch (alt2) {
>> case 1 :
>> // ...NewTest.g:197:9: ~ Apos
>> {
>> *if ( (input.LA(1)>='\u0000' && input.LA(1)<='4')||(input.LA(1)>='6' &&
>> input.LA(1)<='\uFFFE') ) {*
>> input.consume();
>>
>> }
>> -----------
>>
>> This was totally unexpected (checking if the character is different than
>> '5'), so I did the following experiment:
>>
>>    - I removed all the parser rules.
>>    - I changed the grammar to a lexer grammar.
>>
>> When I generated the lexer, the corrupt if statement mentioned above was
>> changed to the following:
>>
>> -----------
>> switch (alt2) {
>> case 1 :
>> // ...NewTest.g:84:9: ~ Apos
>> {
>> *if ( (input.LA(1)>='\u0000' &&
>> input.LA(1)<='\u0014')||(input.LA(1)>='\u0016' && input.LA(1)<='\uFFFE') ) {
>> *
>> input.consume();
>>
>> }*
>> *-----------
>>
>> So, now the situation changed and the mentioned string '50' is OK, but it
>> is obvious that the check is wrong.
>>
>> I tested a simple grammar with the Apos and StringLiteral lexer
>> rules only:
>> -----------
>> lexer grammar testStringLiteral;
>>
>> StringLiteral : Apos ~Apos* Apos;
>> Apos : '\'';
>> -----------
>>
>> it generates the following if, which I consider again wrong:
>> -----------
>> *if ( (input.LA(1)>='\u0000' &&
>> input.LA(1)<='\u0003')||(input.LA(1)>='\u0005' && input.LA(1)<='\uFFFE') ) {
>> *
>> input.consume();
>>
>> }
>> -----------
>>
>> Taking into account the things said above,
>> I have two question:
>>
>>    - Why the parser rules affect the lexer class?
>>    - Why is this if clause before the consume() method different than the
>>    if clause that is deciding the alternative?
>>
>> Of course, I assume that I could have made some stupid mistake, so please
>> excuse me if I have done so.
>>
>> Best regards,
>> Hari
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080519/40986419/attachment.html