[antlr-interest] Generated lexer is affected by parser rules?! A bug?

Tue May 20 10:16:40 PDT 2008

uh wait. sorry.The generated code as a bug as you say; I was looking  
at the NFA and DFA representations. Adding a bug.

http://www.antlr.org:8888/browse/ANTLR-268

Ter
On May 20, 2008, at 10:11 AM, Terence Parr wrote:

> These Both worked perfectly with 3.1b1.  in one you are calling a  
> rule, which is the only difference I see.
> Ter
> On May 17, 2008, at 4:35 AM, Haralambi Haralambiev wrote:
>
>> Just revised the very simple grammar.
>>
>> Could someone point out what is the difference between the  
>> following two grammars:
>> -----------
>> lexer grammar testStringLiteral1;
>>
>> StringLiteral : Apos ~Apos* Apos;
>>
>> fragment
>> Apos : '\'';
>> -----------
>>
>> and
>>
>> -----------
>> lexer grammar testStringLiteral2;
>>
>> StringLiteral : '\'' ~'\''* '\'';
>> -----------
>>
>> When generated to Java file - they differ, while I expected not to!
>>
>> -Hari
>>
>> On 5/17/08, Haralambi Haralambiev <hharalambiev at gmail.com> wrote:  
>> Hello,
>>
>> A colleague of mine is working on some grammar and I was bemused  
>> when she told me that a string literal '50' was throwing an error,  
>> while the '00' was not throwing.
>>
>> The exception said "mismatched character '5' expecting set null".
>>
>> So, I started investigating... the lexer rule for string literal is  
>> the following:
>> -----------
>> fragment
>> Apos	:	'\'';
>>
>> StringLiteral:	Apos ~Apos* Apos
>> -----------
>>
>> Everything seemed fine, except that in the generated java code, the  
>> mStringLiteral method had the following line:
>>
>> -----------
>> mApos();
>> // ...NewTest.g:84:9: (~ Apos )*
>> loop2:
>> do {
>> int alt2=2;
>> int LA2_0 = input.LA(1);
>>
>> if ( ((LA2_0>='\u0000' && LA2_0<='&')||(LA2_0>='(' &&  
>> LA2_0<='\uFFFE')) ) {
>> alt2=1;
>> }
>>
>> switch (alt2) {
>> case 1 :
>> // ...NewTest.g:197:9: ~ Apos
>> {
>> if ( (input.LA(1)>='\u0000' && input.LA(1)<='4')||(input.LA(1)>='6'  
>> && input.LA(1)<='\uFFFE') ) {
>> input.consume();
>>
>> }
>> -----------
>>
>> This was totally unexpected (checking if the character is different  
>> than '5'), so I did the following experiment:
>> 	• I removed all the parser rules.
>> 	• I changed the grammar to a lexer grammar.
>> When I generated the lexer, the corrupt if statement mentioned  
>> above was changed to the following:
>>
>> -----------
>> switch (alt2) {
>> case 1 :
>> // ...NewTest.g:84:9: ~ Apos
>> {
>> if ( (input.LA(1)>='\u0000' && input.LA(1)<='\u0014')|| 
>> (input.LA(1)>='\u0016' && input.LA(1)<='\uFFFE') ) {
>> input.consume();
>>
>> }
>> -----------
>>
>> So, now the situation changed and the mentioned string '50' is OK,  
>> but it is obvious that the check is wrong.
>>
>> I tested a simple grammar with the Apos and StringLiteral lexer  
>> rules only:
>> -----------
>> lexer grammar testStringLiteral;
>>
>> StringLiteral	:	Apos ~Apos* Apos;	
>> Apos	 :	'\'';
>> -----------
>>
>> it generates the following if, which I consider again wrong:
>> -----------
>> if ( (input.LA(1)>='\u0000' && input.LA(1)<='\u0003')|| 
>> (input.LA(1)>='\u0005' && input.LA(1)<='\uFFFE') ) {
>> input.consume();
>>
>> }
>> -----------
>>
>> Taking into account the things said above,
>> I have two question:
>> 	• Why the parser rules affect the lexer class?
>> 	• Why is this if clause before the consume() method different than  
>> the if clause that is deciding the alternative?
>> Of course, I assume that I could have made some stupid mistake, so  
>> please excuse me if I have done so.
>>
>> Best regards,
>> Hari
>>
>