AW: [antlr-interest] Newbie - How to count indentation level?

Koehne Kai Kai.Koehne at student.hpi.uni-potsdam.de
Fri Jun 9 03:06:12 PDT 2006


Hi,
 
Have a look at the Python grammar. It introduces a separate stage, a token stream, between the lexer and the parser. This stage converts the explicit indentation information in virtual IDENT and DEDENT-symbols.
 
Imagine following input:
 
A
  B
C
 
Your lexer could scan this as:
 
[A]
[NEWLINE]
[LEADING_WS, '  ']
[B]
[NEWLINE]
[C]
 
The token stream stage would generate:
 
[A]
[NEWLINE]
[INDENT]
[B]
[NEWLINE]
[DEDENT]
[C] 
 
Which can be handled quite nicely by the parser (NOT TESTED!):
 
entity: A
       |  B
       |  C
 
statements: e:entity (NEWLINE! | EOF!) (in:INDENT! (statements)+ DEDENT!)?
                   {
                     if (#in != null)
                       #statements = #(#e, #in.getNextSibling());
                   }
          
program: (statements)+
              {
                #program = #([ROOT], #program);
              }
 
 
Python grammar: http://www.antlr.org/grammar/1078018002577/python.tar.gz
TokenStream interface: http://www.antlr.org/javadoc/antlr/TokenStream.html
 
Regards,
 
Kai Koehne <http://www.antlr.org/grammar/1078018002577/python.tar.gz> 

________________________________

Von: antlr-interest-bounces at antlr.org im Auftrag von Juho Jussila
Gesendet: Fr 09.06.2006 11:25
An: antlr-interest at antlr.org
Betreff: [antlr-interest] Newbie - How to count indentation level?



Hi

I'm trying to parse following simple text and build AST.

------------------------------
E01
H01
        H04
        H05
                H06
                H06
        H07
        H02
                H05
                H03
H08
H81
H09
        H22
------------------------------

AST should be like this:

          Root
          / \
       E01  H01  ...
           / | \
          /  |  \
        H04 H05 H07 ...
            / \
          H06 H06


I managed to a create grammar, but the problem is that max indentation
level is hard coded. Is there a way to make this more generic and
allow unlimited indentation level ?

------------------------------
class P extends Parser;
options {
    buildAST=true;
    k=4;
}
tab1 : TAB;
tab2 : tab1 TAB;

start : (level1)* { ## = #([ROOT,"Root"], ##); }
     ;
level1 :
        TUNNUS^ newline! (level2)*
        ;
level2:
        tab1! TUNNUS^ newline! (level3)*
        ;
level3:
        tab2! TUNNUS newline!
        ;
newline:
        NEWLINE | EOF
        ;


class L extends Lexer;
options {
    caseSensitive = false;
}
protected LETTER: ('a'..'ö');
protected NUMBER: ('0'..'9');
TUNNUS:     LETTER (LETTER|NUMBER)*;
NEWLINE
    :   '\r' '\n'    // DOS
    |   '\n'        // UNIX   
    { newline(); };
WS  :   (' ') { $setType(Token.SKIP); };
TAB : '\t';
------------------------------


Another attempt:
------------------------------
...
start : (level1)* { ## = #([ROOT,"Root"], ##); }
     ;

level[int i]
{ int count = 0; }
    :
        TUNNUS^ newline!
        ( { count < (i+1) }?
            TAB
            { count++; }    
        )*
        ({ count == (i+1) }? (level[i+1]))*
    ;
...
------------------------------

But it doesn't work. Result in XML-format:

<Root>
  <E01/>
  <H01>
    <H04/>
    <H05>
      <H06/>
      <H06/>
      <H07/>
      <H02/>
      <H05/>
      <H03/>
      <H08/>
      <H81/>
      <H09/>
      <H22/>
    </H05>
  </H01>
</Root>


--
Thanks in advance

Juho Jussila





More information about the antlr-interest mailing list