[antlr-interest] Context-sensitive lexer

Fri Jun 17 11:37:03 PDT 2011

Hi Jonas,

I would not put so much responsibility inside the lexer. This is really the
task of the parser.
How about something like this:

grammar test;

options {
  output=AST;
}

tokens {
  FILE;
  SECTIONS;
  LINE;
}

parse
  :  title (section NL)+ EOF -> ^(FILE title ^(SECTIONS section+))
  ;

title
  :  TITLE NL (anyWord+ NL)+ NL -> ^(TITLE anyWord+)
  ;

section
  :  SECTION NL (anyWordExceptEnd+ NL)+ END NL -> ^(SECTION
anyWordExceptEnd+)
  ;

anyWordExceptEnd
  :  WORD
  |  SECTION
  |  TITLE
  ;

anyWord
  :  anyWordExceptEnd
  |  END
  ;

SECTION
  :  'SECTION' '0'..'9'+
  ;

END
  :  'END'
  ;

TITLE
  :  'TITLE'
  ;

WORD
  :  ('a'..'z' | 'A'..'Z')+
  ;

NL
  :  '\r'? '\n'
  |  '\r'
  ;

SPACE
  :  (' ' | '\t') {$channel=HIDDEN;}
  ;

A small test class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String source =
        "TITLE            \n" +
        "some             \n"+
        "title            \n"+
        "text             \n" +
        "                 \n" +
        "SECTION1         \n" +
        " a b             \n" +
        " c               \n" +
        "END              \n" +
        "                 \n" +
        "SECTION2         \n" +
        "  SECTION2 text  \n" +
        "END              \n" +
        "                 \n" +
        "SECTION3         \n" +
        "  more text      \n" +
        "END              \n" +
        "\n";
    testLexer lexer = new testLexer(new ANTLRStringStream(source));
    testParser parser = new testParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

will produce the AST attached to this message.

Regards,

Bart.

On Fri, Jun 17, 2011 at 2:15 PM, Jonas <jonas.hagmar at gmail.com> wrote:

> Hi,
>
> I'm developing a parser for a file format where context is very
> important. I'm looking to
> 1) understand why my ANTLR parser gets into infinite loops
> 2) find out if there is any better way to implement context
> sensitivity than what I am doing with semantic predicates.
>
> A typical beginning of a file looks like this:
> TITLE
> some title text
>
> SECTION1
>  a=b*c
> END
>
> SECTION2
> ...
>
> SECTION3
> ...
>
> The syntax differs from section to section; the 'TITLE' section is
> terminated by the newline after the title text line, while other
> sections can e.g. use single quote string literals and be terminated
> by a keyword like 'END'. Here is a sample grammar, that gets into an
> infinite loop:
>
> grammar test;
>
> options {
>  output=AST;
> }
>
> @lexer::members {
>  static final int STATE_AT_BEGINNING = 0;
>  static final int STATE_IN_TITLE = 1;
>  static final int STATE_AFTER_TITLE = 2;
>  int lexerState = STATE_AT_BEGINNING;
> }
>
> file    :       title;
>
> title   :       BEGIN_TITLE TITLE_TEXT END_TITLE;
>
> BEGIN_TITLE
>        : {(lexerState == STATE_AT_BEGINNING)}? 'TITLE' WS_NL
> {lexerState=STATE_IN_TITLE;}
>        ;
>
> TITLE_TEXT
>        : {lexerState == STATE_IN_TITLE}? TEXT
>        ;
>
> END_TITLE
>        : {lexerState == STATE_IN_TITLE}? NL {lexerState=STATE_AFTER_TITLE;}
>        ;
>
> BLANK_ROW
>        : {!(lexerState == STATE_IN_TITLE)}? WS_NL
>        ;
>
> REMARK  : {!(lexerState == STATE_IN_TITLE)}? 'REMA' .* NL
>        ;
>
> fragment
> WS_NL   :       (' ' | '\t')* NL;
>
> fragment
> NL      :       '\r'? '\n';
>
> fragment
> TEXT    :       (~('\r' | '\n'))*;
>
> Best Regards,
> Jonas
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ast.png
Type: image/png
Size: 8574 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20110617/0a7ec400/attachment.png