[antlr-interest] yet more on v4 lexer progress
Terence Parr
parrt at cs.usfca.edu
Sat May 1 14:54:51 PDT 2010
Wow. got recursive lexer rule invocation working in a few hours from the "no calls allowed" core (it mimics LL(*) analysis but at runtime).
@Test public void testRecursiveCall() throws Exception {
LexerGrammar g = new LexerGrammar(
"lexer grammar L;\n" +
"ACTION : '{' (ACTION|.)* '}' ;\n");
String expecting = "ACTION, EOF";
checkMatches(g, "{hi}", expecting);
checkMatches(g, "{{hi}}", expecting);
checkMatches(g, "{{x}{y}}", expecting);
checkMatches(g, "{{{{{{x}}}}}}", expecting);
}
Note how simple the bytecodes are for the grammar:
ACTION : '{' (ACTION | .)* '}' ;
gives:
0000: split 5
0005: match8 '{' // Start of ACTION
0007: split 14, 31
0014: split 21, 27
0021: call 5 // call ACTION
0024: jmp 28
0027: wildcard
0028: jmp 7
0031: match8 '}'
0033: accept 4
v4 does what you'd expect now: longest match with priority given to earlier rules upon match of same length.
It also handles case where it must remember all possible matches and rewind if it fails further on. This was highlighted in
http://www.antlr.org/jira/browse/ANTLR-189
Now it works automatically:
@Test public void testRewindBackToLastGoodMatch_DOT_vs_NUM() throws Exception {
LexerGrammar g = new LexerGrammar(
"lexer grammar L;\n" +
"NUM: '0'..'9'+ ('.' '0'..'9'+)? ;\n"+
"DOT : '.' ;\n"+
"WS : ' ' ;\n");
checkMatches(g, "3.14 .", "NUM, WS, DOT, EOF");
checkMatches(g, "9", "NUM, EOF");
checkMatches(g, ".1", "DOT, NUM, EOF");
checkMatches(g, "1.", "NUM, DOT, EOF");
}
Here, "1." starts NUM and enters ('.' '0'..'9'+)? subrule due to '.' after '1'. Ooops, no digit after '.'. Rewind to spot where we looked like an integer: "1" then next match sees '.'. cool.
Java impl of this more complicated VM is still only about 1200 bytes (in java bytecodes). Can use lots more memory at runtime than "no rule invocation" version as well.
Woohoo!
Ter
More information about the antlr-interest
mailing list