[antlr-interest] python-lang parser to python target

Tue Jul 22 19:32:02 PDT 2008

On Sunday 20 July 2008 02:35:16 am Johannes Luber wrote:
> Benjamin Niemann schrieb:
> > there is a Python grammar in examples package. It's the 2.3 grammar,
> > but you may use parts of both to get a working Python2.5 grammar.

> Regarding the sample grammars: In the repository there are sample
> grammars for these languages as well. Ter is probably planning to update

Hi all,

Benjamin, Johannes, thanks for the advice.  Using it, I have a
partial port of the python 2.5 grammar to python language target, for
antlr 3.1:

http://redsymbol.net/files/antlr/Python-python-2.5-2008-07-22.tgz

I'd like to get this in a state where other people can use it.  Please
advise of any needed changes you see.  I have just tested the parsing,
and not any code generation based on it yet - exactly what I did is
described in the tarball's README.

I tested the resulting parser on the 2034 python files in a recent jython ASM 
branch checkout.  The README goes into detail - basically, two of those
files triggered errors; all the others parsed without errors, though
at least a few percent had one or more warning.

One big problem I see is that the generated PythonLexer.py has a
dangling elif clause - it prints this:

{{{
            elif alt28 == 2:
                # Python.g:615:10: 
<that's it - no statements in the block>
}}}

Due to Python's block structure by indentation, this is not correct
python syntax - there needs to be a "pass" statement, or the elif
clause needs to be omitted altogether.  The offending rule in the
grammar is:

{{{
CONTINUED_LINE
    :    '\\' ('\r')? '\n' (' '|'\t')*  { $channel=HIDDEN; }
         ( nl=NEWLINE 
{self.emit(ClassicToken(type=NEWLINE,text=nl.getText()))}
         |
         )
    ;
}}}

I tried removing the emtpy "|" line, like this:

{{{
CONTINUED_LINE
    :    '\\' ('\r')? '\n' (' '|'\t')*  { $channel=HIDDEN; }
         ( nl=NEWLINE 
{self.emit(ClassicToken(type=NEWLINE,text=nl.getText()))}
         )
    ;
}}}

Then the lexer code's syntax is correct.  However, the parser then
cannot correctly parse lines that are broken by a backslash (i.e., one
logical line split over two lines) -- for example:
{{{
** ./CPythonLib/plat-sunos5/STROPTS.py
line 836:8 required (...)+ loop did not match anything at character u'('
line 1396:24 required (...)+ loop did not match anything at character u'A'
line 1397:24 required (...)+ loop did not match anything at character u'A'

}}}

Can someone suggest a fix?  I tried just putting { pass;} in there,
but it is not placed at the correct indentation level.  Plus that is just 
hackish.

Any other feedback appreciated.

Thanks,
Aaron

-- 
Aaron Maxwell
http://redsymbol.net