[antlr-interest] Found bug on lexer with ANTLR 3.0.1 with Python target

Wed Jul 2 02:44:14 PDT 2008

Hi Cesare,

[+antlr-interest, as this might be of interest for more people]

I don't think, this is a problem specific to Python.
You define INT as ('0'..'9')*, so an empty token is a valid int.
Technically the behavior is correct, albeit not very useful. The
correct solution would be to emit a warning or even error for tokens
that could match an empty sequence.

The quick fix is to make use ('0'..'9')+ for INT.

-Ben

On Wed, Jul 2, 2008 at 10:07 AM, Cesare Di Mauro
<cesare.dimauro at a-tono.com> wrote:
> Hi
>
> I think I have found a bug which makes lexer rules cause infinite loop.
>
> I have attached files for grammar, test application, and a simple text files that shows the bug (break it quickly with Ctrl-C, because after a few seconds it will hog all available memory).
>
> The text file has a simple line:
>
> TEST;: 12345
>
> which failes when reaching the ";", which is an unknown (to the lexer) character. Removing the ";" (TEST: 12345) the parsing executes correctly.
>
> Tracing with the debugger I found that the problem stays in the generated mTokens method in the lexer:
>
>    def mTokens(self):
>        # D:\\Test\\BadSyntax.g:1:8: ( T8 | ID | INT | NEWLINE | WHITE_SPACE )
>        alt5 = 5
>        LA5 = self.input.LA(1)
>        if LA5 == u':':
>            alt5 = 1
>        elif LA5 == u'A' or LA5 == u'B' or LA5 == u'C' or LA5 == u'D' or LA5 == u'E' or LA5 == u'F' or LA5 == u'G' or LA5 == u'H' or LA5 == u'I' or LA5 == u'J' or LA5 == u'K' or LA5 == u'L' or LA5 == u'M' or LA5 == u'N' or LA5 == u'O' or LA5 == u'P' or LA5 == u'Q' or LA5 == u'R' or LA5 == u'S' or LA5 == u'T' or LA5 == u'U' or LA5 == u'V' or LA5 == u'W' or LA5 == u'X' or LA5 == u'Y' or LA5 == u'Z' or LA5 == u'_' or LA5 == u'a' or LA5 == u'b' or LA5 == u'c' or LA5 == u'd' or LA5 == u'e' or LA5 == u'f' or LA5 == u'g' or LA5 == u'h' or LA5 == u'i' or LA5 == u'j' or LA5 == u'k' or LA5 == u'l' or LA5 == u'm' or LA5 == u'n' or LA5 == u'o' or LA5 == u'p' or LA5 == u'q' or LA5 == u'r' or LA5 == u's' or LA5 == u't' or LA5 == u'u' or LA5 == u'v' or LA5 == u'w' or LA5 == u'x' or LA5 == u'y' or LA5 == u'z':
>            alt5 = 2
>        elif LA5 == u'\n' or LA5 == u'\r':
>            alt5 = 4
>        elif LA5 == u'\t' or LA5 == u' ':
>            alt5 = 5
>        else:
>            alt5 = 3
>        if alt5 == 1:
>            # D:\\Test\\BadSyntax.g:1:10: T8
>            self.mT8()
>
>
>
>        elif alt5 == 2:
>            # D:\\Test\\BadSyntax.g:1:13: ID
>            self.mID()
>
>
>
>        elif alt5 == 3:
>            # D:\\Test\\BadSyntax.g:1:16: INT
>            self.mINT()
>
>
>
>        elif alt5 == 4:
>            # D:\\Test\\BadSyntax.g:1:20: NEWLINE
>            self.mNEWLINE()
>
>
>
>        elif alt5 == 5:
>            # D:\\Test\\BadSyntax.g:1:28: WHITE_SPACE
>            self.mWHITE_SPACE()
>
>
> After checking for the tab char and failed, the code executes the following else statement:
>
> else:
>            alt5 = 3
>
> which assumes that we have found (the beginning of) an INT token, so the lexer goes into an infinite loop.
> That's because an "empty" token (a token with the text attribute set to an empty string) will be generated, bug the ";" character will not be consumed, so the next time that fillbuffer() invokes mTokens(), the same pattern will be repeated, generating an unlimited growing of token list.
>
> Is it any way to fix this problem?
>
> Thanks in advance
>
> Cesare
>
> --
> Dott. Cesare Di Mauro
> A-Tono S.r.l.
> T.: (+39)095-7365314
> Information in this email is confidential and may be privileged. It is intended for the addresses only.
> If you have received it in error, please notify the sender immediately and delete it from your system. You should not otherwise copy it, retransmit it or use or disclose its content to anyone.
> Thank you for your co-operation.