AW: [antlr-interest] Newbie - How to count indentation level?
Koehne Kai
Kai.Koehne at student.hpi.uni-potsdam.de
Fri Jun 9 03:06:12 PDT 2006
Hi,
Have a look at the Python grammar. It introduces a separate stage, a token stream, between the lexer and the parser. This stage converts the explicit indentation information in virtual IDENT and DEDENT-symbols.
Imagine following input:
A
B
C
Your lexer could scan this as:
[A]
[NEWLINE]
[LEADING_WS, ' ']
[B]
[NEWLINE]
[C]
The token stream stage would generate:
[A]
[NEWLINE]
[INDENT]
[B]
[NEWLINE]
[DEDENT]
[C]
Which can be handled quite nicely by the parser (NOT TESTED!):
entity: A
| B
| C
statements: e:entity (NEWLINE! | EOF!) (in:INDENT! (statements)+ DEDENT!)?
{
if (#in != null)
#statements = #(#e, #in.getNextSibling());
}
program: (statements)+
{
#program = #([ROOT], #program);
}
Python grammar: http://www.antlr.org/grammar/1078018002577/python.tar.gz
TokenStream interface: http://www.antlr.org/javadoc/antlr/TokenStream.html
Regards,
Kai Koehne <http://www.antlr.org/grammar/1078018002577/python.tar.gz>
________________________________
Von: antlr-interest-bounces at antlr.org im Auftrag von Juho Jussila
Gesendet: Fr 09.06.2006 11:25
An: antlr-interest at antlr.org
Betreff: [antlr-interest] Newbie - How to count indentation level?
Hi
I'm trying to parse following simple text and build AST.
------------------------------
E01
H01
H04
H05
H06
H06
H07
H02
H05
H03
H08
H81
H09
H22
------------------------------
AST should be like this:
Root
/ \
E01 H01 ...
/ | \
/ | \
H04 H05 H07 ...
/ \
H06 H06
I managed to a create grammar, but the problem is that max indentation
level is hard coded. Is there a way to make this more generic and
allow unlimited indentation level ?
------------------------------
class P extends Parser;
options {
buildAST=true;
k=4;
}
tab1 : TAB;
tab2 : tab1 TAB;
start : (level1)* { ## = #([ROOT,"Root"], ##); }
;
level1 :
TUNNUS^ newline! (level2)*
;
level2:
tab1! TUNNUS^ newline! (level3)*
;
level3:
tab2! TUNNUS newline!
;
newline:
NEWLINE | EOF
;
class L extends Lexer;
options {
caseSensitive = false;
}
protected LETTER: ('a'..'ö');
protected NUMBER: ('0'..'9');
TUNNUS: LETTER (LETTER|NUMBER)*;
NEWLINE
: '\r' '\n' // DOS
| '\n' // UNIX
{ newline(); };
WS : (' ') { $setType(Token.SKIP); };
TAB : '\t';
------------------------------
Another attempt:
------------------------------
...
start : (level1)* { ## = #([ROOT,"Root"], ##); }
;
level[int i]
{ int count = 0; }
:
TUNNUS^ newline!
( { count < (i+1) }?
TAB
{ count++; }
)*
({ count == (i+1) }? (level[i+1]))*
;
...
------------------------------
But it doesn't work. Result in XML-format:
<Root>
<E01/>
<H01>
<H04/>
<H05>
<H06/>
<H06/>
<H07/>
<H02/>
<H05/>
<H03/>
<H08/>
<H81/>
<H09/>
<H22/>
</H05>
</H01>
</Root>
--
Thanks in advance
Juho Jussila
More information about the antlr-interest
mailing list