[antlr-interest] How to handle python-like indented code blocks

Thu Nov 26 12:13:45 PST 2009

Eric,

What you need to do is either let TABs though and count them in the parser, or if they can only be at the start of the line, then set a flag to true when the lexer starts and whenever you hit newline. Based on the truth of that flag, you can let the tab through, count them and issue LEVEL1, LEVEL2, etc, or skip() them if the flag is false:

@lexer::members {
boolean countTabs = true;
}

fragment LEVEL1:;
fragment LEVEL2:;
fragment LEVEL3:;

TAB
@init {
int tabCount = 0;
}
: ('\n' { tabCount++; })+
  {
    if (countTabs) {
        switch (tabCount) {
          case 1: $type = LEVEL1; break;
          case 2: $type = LEVEL2; break;
          case 3: $type = LEVEL3; break;
          default: skip(); // too many levels error
        }
    }
    else {
        skip();
    }
    countTabs = false;
  }
;

NL : '\r'? '\n' { countTabs = true; skip(); }

Then you parser says:

struct : element+;

element : (elementAtom | level1Element)+ ;
level1Element : (LEVEL1 elementAtom | level2Element)+ ;

Or something similar to that - you can be smarter than that of course, for the sake of error processing/recovery.

Jim

PS: There did used to be a Python example that overrode nextToken() as well - look in the grammars section of the web site.

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Eric Bell
> Sent: Thursday, November 26, 2009 11:42 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] How to handle python-like indented code
> blocks
> 
> Could someone get me pointed in the right direction for how to parse
> grammars that use indenting to identify code blocks ... like in Python?
> 
> I searched around and looked through the source code for boo, which
> uses a
> python-like grammar, but I am a newbie to this and it's too much code
> for me
> to figure out.
> 
> I am trying to parse a file that defines nodes in a tree. Indenting is
> used
> to show that nodes are children of a parent, like this:
> 
>               s4\0 [n c] [r 0\22\33]
>                      s4s4\# [n t] [p s4]
>                      s4s5\.1 [n t] [p s5]
> 
> "s4", "s4s4", "s4s5" are node-names, with "s4s4" and "s4s5" being
> children
> of node "s4". The indenting uses a tab character, with one tab per
> indent
> level.
> 
> Thanks,
> 
> --eric
> 
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address