[antlr-interest] Context-sensitive lexer
Bart Kiers
bkiers at gmail.com
Fri Jun 17 11:37:03 PDT 2011
Hi Jonas,
I would not put so much responsibility inside the lexer. This is really the
task of the parser.
How about something like this:
grammar test;
options {
output=AST;
}
tokens {
FILE;
SECTIONS;
LINE;
}
parse
: title (section NL)+ EOF -> ^(FILE title ^(SECTIONS section+))
;
title
: TITLE NL (anyWord+ NL)+ NL -> ^(TITLE anyWord+)
;
section
: SECTION NL (anyWordExceptEnd+ NL)+ END NL -> ^(SECTION
anyWordExceptEnd+)
;
anyWordExceptEnd
: WORD
| SECTION
| TITLE
;
anyWord
: anyWordExceptEnd
| END
;
SECTION
: 'SECTION' '0'..'9'+
;
END
: 'END'
;
TITLE
: 'TITLE'
;
WORD
: ('a'..'z' | 'A'..'Z')+
;
NL
: '\r'? '\n'
| '\r'
;
SPACE
: (' ' | '\t') {$channel=HIDDEN;}
;
A small test class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"TITLE \n" +
"some \n"+
"title \n"+
"text \n" +
" \n" +
"SECTION1 \n" +
" a b \n" +
" c \n" +
"END \n" +
" \n" +
"SECTION2 \n" +
" SECTION2 text \n" +
"END \n" +
" \n" +
"SECTION3 \n" +
" more text \n" +
"END \n" +
"\n";
testLexer lexer = new testLexer(new ANTLRStringStream(source));
testParser parser = new testParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
will produce the AST attached to this message.
Regards,
Bart.
On Fri, Jun 17, 2011 at 2:15 PM, Jonas <jonas.hagmar at gmail.com> wrote:
> Hi,
>
> I'm developing a parser for a file format where context is very
> important. I'm looking to
> 1) understand why my ANTLR parser gets into infinite loops
> 2) find out if there is any better way to implement context
> sensitivity than what I am doing with semantic predicates.
>
> A typical beginning of a file looks like this:
> TITLE
> some title text
>
> SECTION1
> a=b*c
> END
>
> SECTION2
> ...
>
> SECTION3
> ...
>
> The syntax differs from section to section; the 'TITLE' section is
> terminated by the newline after the title text line, while other
> sections can e.g. use single quote string literals and be terminated
> by a keyword like 'END'. Here is a sample grammar, that gets into an
> infinite loop:
>
> grammar test;
>
> options {
> output=AST;
> }
>
> @lexer::members {
> static final int STATE_AT_BEGINNING = 0;
> static final int STATE_IN_TITLE = 1;
> static final int STATE_AFTER_TITLE = 2;
> int lexerState = STATE_AT_BEGINNING;
> }
>
> file : title;
>
> title : BEGIN_TITLE TITLE_TEXT END_TITLE;
>
> BEGIN_TITLE
> : {(lexerState == STATE_AT_BEGINNING)}? 'TITLE' WS_NL
> {lexerState=STATE_IN_TITLE;}
> ;
>
> TITLE_TEXT
> : {lexerState == STATE_IN_TITLE}? TEXT
> ;
>
> END_TITLE
> : {lexerState == STATE_IN_TITLE}? NL {lexerState=STATE_AFTER_TITLE;}
> ;
>
> BLANK_ROW
> : {!(lexerState == STATE_IN_TITLE)}? WS_NL
> ;
>
> REMARK : {!(lexerState == STATE_IN_TITLE)}? 'REMA' .* NL
> ;
>
> fragment
> WS_NL : (' ' | '\t')* NL;
>
> fragment
> NL : '\r'? '\n';
>
> fragment
> TEXT : (~('\r' | '\n'))*;
>
> Best Regards,
> Jonas
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ast.png
Type: image/png
Size: 8574 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20110617/0a7ec400/attachment.png
More information about the antlr-interest
mailing list