AW: [antlr-interest] Newbie - How to count indentation level?

Fri Jun 9 03:06:12 PDT 2006

Hi,

Have a look at the Python grammar. It introduces a separate stage, a token stream, between the lexer and the parser. This stage converts the explicit indentation information in virtual IDENT and DEDENT-symbols.

Imagine following input:

A
  B
C

Your lexer could scan this as:

[A]
[NEWLINE]
[LEADING_WS, '  ']
[B]
[NEWLINE]
[C]

The token stream stage would generate:

[A]
[NEWLINE]
[INDENT]
[B]
[NEWLINE]
[DEDENT]
[C] 

Which can be handled quite nicely by the parser (NOT TESTED!):

entity: A
       |  B
       |  C

statements: e:entity (NEWLINE! | EOF!) (in:INDENT! (statements)+ DEDENT!)?
                   {
                     if (#in != null)
                       #statements = #(#e, #in.getNextSibling());
                   }

program: (statements)+
              {
                #program = #([ROOT], #program);
              }

Python grammar: http://www.antlr.org/grammar/1078018002577/python.tar.gz
TokenStream interface: http://www.antlr.org/javadoc/antlr/TokenStream.html

Regards,

Kai Koehne <http://www.antlr.org/grammar/1078018002577/python.tar.gz> 

________________________________

Von: antlr-interest-bounces at antlr.org im Auftrag von Juho Jussila
Gesendet: Fr 09.06.2006 11:25
An: antlr-interest at antlr.org
Betreff: [antlr-interest] Newbie - How to count indentation level?

Hi

I'm trying to parse following simple text and build AST.

------------------------------
E01
H01
        H04
        H05
                H06
                H06
        H07
        H02
                H05
                H03
H08
H81
H09
        H22
------------------------------

AST should be like this:

          Root
          / \
       E01  H01  ...
           / | \
          /  |  \
        H04 H05 H07 ...
            / \
          H06 H06

I managed to a create grammar, but the problem is that max indentation
level is hard coded. Is there a way to make this more generic and
allow unlimited indentation level ?

------------------------------
class P extends Parser;
options {
    buildAST=true;
    k=4;
}
tab1 : TAB;
tab2 : tab1 TAB;

start : (level1)* { ## = #([ROOT,"Root"], ##); }
     ;
level1 :
        TUNNUS^ newline! (level2)*
        ;
level2:
        tab1! TUNNUS^ newline! (level3)*
        ;
level3:
        tab2! TUNNUS newline!
        ;
newline:
        NEWLINE | EOF
        ;

class L extends Lexer;
options {
    caseSensitive = false;
}
protected LETTER: ('a'..'ö');
protected NUMBER: ('0'..'9');
TUNNUS:     LETTER (LETTER|NUMBER)*;
NEWLINE
    :   '\r' '\n'    // DOS
    |   '\n'        // UNIX   
    { newline(); };
WS  :   (' ') { $setType(Token.SKIP); };
TAB : '\t';
------------------------------

Another attempt:
------------------------------
...
start : (level1)* { ## = #([ROOT,"Root"], ##); }
     ;

level[int i]
{ int count = 0; }
    :
        TUNNUS^ newline!
        ( { count < (i+1) }?
            TAB
            { count++; }    
        )*
        ({ count == (i+1) }? (level[i+1]))*
    ;
...
------------------------------

But it doesn't work. Result in XML-format:

<Root>
  <E01/>
  <H01>
    <H04/>
    <H05>
      <H06/>
      <H06/>
      <H07/>
      <H02/>
      <H05/>
      <H03/>
      <H08/>
      <H81/>
      <H09/>
      <H22/>
    </H05>
  </H01>
</Root>

--
Thanks in advance

Juho Jussila