[antlr-interest] Context-sensitive lexer

Fri Jun 17 14:09:18 PDT 2011

Hi Bart,

Thank you for the excellent input on the problem. I hope your approach
can be adapted to overcome all the difficulties coming from the
context sensitivity of the file format I have to deal with. For
example, the title text can be any character sequence, leading to a
definition of your WORD token that I fear might clash with patterns
needed to pick out identifiers in, e.g., algebraic expressions later
in the file. Moreover, the whitespace in the title text is actually
significant. If the title text is "foo$3        bar__!" (without the
quotes), that is exactly what the user expects to see when using the
program reading the file. In other places, whitespace acts like a list
separator, and in some places it should just be ignored. With your
approach, wouldn't that mean that I have to include the whitespace in
all relevant parser rules, even when it should be ignored?

As an alternative, I am considering using a JFlex lexer, which can
easily handle lexer state, coupled with an ANTLR parser and tree
parser. I have almost figured out how to do that, but to really get it
flying, it would be great to be able to run the ANTLRWorks debugger on
the resulting lexer-parser combination. I have seen some posts saying
that this is possible, but not how to do it. If I don't figure it out
myself, I might post a separate question regarding that.

Best Regards,
Jonas

On Fri, Jun 17, 2011 at 8:37 PM, Bart Kiers <bkiers at gmail.com> wrote:
> Hi Jonas,
> I would not put so much responsibility inside the lexer. This is really the
> task of the parser.
> How about something like this:
>
> grammar test;
> options {
>   output=AST;
> }
> tokens {
>   FILE;
>   SECTIONS;
>   LINE;
> }
> parse
>   :  title (section NL)+ EOF -> ^(FILE title ^(SECTIONS section+))
>   ;
> title
>   :  TITLE NL (anyWord+ NL)+ NL -> ^(TITLE anyWord+)
>   ;
> section
>   :  SECTION NL (anyWordExceptEnd+ NL)+ END NL -> ^(SECTION
> anyWordExceptEnd+)
>   ;
>
> anyWordExceptEnd
>   :  WORD
>   |  SECTION
>   |  TITLE
>   ;
> anyWord
>   :  anyWordExceptEnd
>   |  END
>   ;
>
> SECTION
>   :  'SECTION' '0'..'9'+
>   ;
> END
>   :  'END'
>   ;
> TITLE
>   :  'TITLE'
>   ;
> WORD
>   :  ('a'..'z' | 'A'..'Z')+
>   ;
>
> NL
>   :  '\r'? '\n'
>   |  '\r'
>   ;
>
> SPACE
>   :  (' ' | '\t') {$channel=HIDDEN;}
>   ;
>
> A small test class:
>
> import org.antlr.runtime.*;
> import org.antlr.runtime.tree.*;
> import org.antlr.stringtemplate.*;
> public class Main {
>   public static void main(String[] args) throws Exception {
>     String source =
>         "TITLE            \n" +
>         "some             \n"+
>         "title            \n"+
>         "text             \n" +
>         "                 \n" +
>         "SECTION1         \n" +
>         " a b             \n" +
>         " c               \n" +
>         "END              \n" +
>         "                 \n" +
>         "SECTION2         \n" +
>         "  SECTION2 text  \n" +
>         "END              \n" +
>         "                 \n" +
>         "SECTION3         \n" +
>         "  more text      \n" +
>         "END              \n" +
>         "\n";
>     testLexer lexer = new testLexer(new ANTLRStringStream(source));
>     testParser parser = new testParser(new CommonTokenStream(lexer));
>     CommonTree tree = (CommonTree)parser.parse().getTree();
>     DOTTreeGenerator gen = new DOTTreeGenerator();
>     StringTemplate st = gen.toDOT(tree);
>     System.out.println(st);
>   }
> }
>
> will produce the AST attached to this message.
> Regards,
> Bart.
>
>
> On Fri, Jun 17, 2011 at 2:15 PM, Jonas <jonas.hagmar at gmail.com> wrote:
>>
>> Hi,
>>
>> I'm developing a parser for a file format where context is very
>> important. I'm looking to
>> 1) understand why my ANTLR parser gets into infinite loops
>> 2) find out if there is any better way to implement context
>> sensitivity than what I am doing with semantic predicates.
>>
>> A typical beginning of a file looks like this:
>> TITLE
>> some title text
>>
>> SECTION1
>>  a=b*c
>> END
>>
>> SECTION2
>> ...
>>
>> SECTION3
>> ...
>>
>> The syntax differs from section to section; the 'TITLE' section is
>> terminated by the newline after the title text line, while other
>> sections can e.g. use single quote string literals and be terminated
>> by a keyword like 'END'. Here is a sample grammar, that gets into an
>> infinite loop:
>>
>> grammar test;
>>
>> options {
>>  output=AST;
>> }
>>
>> @lexer::members {
>>  static final int STATE_AT_BEGINNING = 0;
>>  static final int STATE_IN_TITLE = 1;
>>  static final int STATE_AFTER_TITLE = 2;
>>  int lexerState = STATE_AT_BEGINNING;
>> }
>>
>> file    :       title;
>>
>> title   :       BEGIN_TITLE TITLE_TEXT END_TITLE;
>>
>> BEGIN_TITLE
>>        : {(lexerState == STATE_AT_BEGINNING)}? 'TITLE' WS_NL
>> {lexerState=STATE_IN_TITLE;}
>>        ;
>>
>> TITLE_TEXT
>>        : {lexerState == STATE_IN_TITLE}? TEXT
>>        ;
>>
>> END_TITLE
>>        : {lexerState == STATE_IN_TITLE}? NL
>> {lexerState=STATE_AFTER_TITLE;}
>>        ;
>>
>> BLANK_ROW
>>        : {!(lexerState == STATE_IN_TITLE)}? WS_NL
>>        ;
>>
>> REMARK  : {!(lexerState == STATE_IN_TITLE)}? 'REMA' .* NL
>>        ;
>>
>> fragment
>> WS_NL   :       (' ' | '\t')* NL;
>>
>> fragment
>> NL      :       '\r'? '\n';
>>
>> fragment
>> TEXT    :       (~('\r' | '\n'))*;
>>
>> Best Regards,
>> Jonas
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe:
>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>