[antlr-interest] Generated lexer is affected by parser rules?! A bug?

Sat May 17 04:35:19 PDT 2008

Just revised the very simple grammar.

Could someone point out what is the difference between the following two
grammars:
-----------
lexer grammar testStringLiteral1;

StringLiteral : Apos ~Apos* Apos;

fragment
Apos : '\'';
-----------

and

-----------
lexer grammar testStringLiteral2;

StringLiteral : '\'' ~'\''* '\'';
-----------

When generated to Java file - they differ, while I expected not to!

-Hari

On 5/17/08, Haralambi Haralambiev <hharalambiev at gmail.com> wrote:
>
> Hello,
>
> A colleague of mine is working on some grammar and I was bemused when
> she told me that a string literal '50' was throwing an error, while the '00'
> was not throwing.
>
> The exception said "mismatched character '5' expecting set null".
>
> So, I started investigating... the lexer rule for string literal is the
> following:
> -----------
> fragment
> Apos : '\'';
>
> StringLiteral: Apos ~Apos* Apos
> -----------
>
> Everything seemed fine, except that in the generated java code, the
> mStringLiteral method had the following line:
>
> -----------
> mApos();
> // ...NewTest.g:84:9: (~ Apos )*
> loop2:
> do {
> int alt2=2;
> int LA2_0 = input.LA(1);
>
> if ( ((LA2_0>='\u0000' && LA2_0<='&')||(LA2_0>='(' && LA2_0<='\uFFFE')) ) {
> alt2=1;
> }
>
> switch (alt2) {
> case 1 :
> // ...NewTest.g:197:9: ~ Apos
> {
> *if ( (input.LA(1)>='\u0000' && input.LA(1)<='4')||(input.LA(1)>='6' &&
> input.LA(1)<='\uFFFE') ) {*
> input.consume();
>
> }
> -----------
>
> This was totally unexpected (checking if the character is different than
> '5'), so I did the following experiment:
>
>    - I removed all the parser rules.
>    - I changed the grammar to a lexer grammar.
>
> When I generated the lexer, the corrupt if statement mentioned above was
> changed to the following:
>
> -----------
> switch (alt2) {
> case 1 :
> // ...NewTest.g:84:9: ~ Apos
> {
> *if ( (input.LA(1)>='\u0000' &&
> input.LA(1)<='\u0014')||(input.LA(1)>='\u0016' && input.LA(1)<='\uFFFE') ) {
> *
> input.consume();
>
> }*
> *-----------
>
> So, now the situation changed and the mentioned string '50' is OK, but it
> is obvious that the check is wrong.
>
> I tested a simple grammar with the Apos and StringLiteral lexer rules only:
> -----------
> lexer grammar testStringLiteral;
>
> StringLiteral : Apos ~Apos* Apos;
> Apos : '\'';
> -----------
>
> it generates the following if, which I consider again wrong:
> -----------
> *if ( (input.LA(1)>='\u0000' &&
> input.LA(1)<='\u0003')||(input.LA(1)>='\u0005' && input.LA(1)<='\uFFFE') ) {
> *
> input.consume();
>
> }
> -----------
>
> Taking into account the things said above,
> I have two question:
>
>    - Why the parser rules affect the lexer class?
>    - Why is this if clause before the consume() method different than the
>    if clause that is deciding the alternative?
>
> Of course, I assume that I could have made some stupid mistake, so please
> excuse me if I have done so.
>
> Best regards,
> Hari
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080517/54540edc/attachment-0001.html