[antlr-interest] yet more on v4 lexer progress

Sat May 1 14:54:51 PDT 2010

Wow. got recursive lexer rule invocation working in a few hours from the "no calls allowed" core (it mimics LL(*) analysis but at runtime).

	@Test public void testRecursiveCall() throws Exception {
		LexerGrammar g = new LexerGrammar(
			"lexer grammar L;\n" +
			"ACTION : '{' (ACTION|.)* '}' ;\n");
		String expecting = "ACTION, EOF";
		checkMatches(g, "{hi}", expecting);
		checkMatches(g, "{{hi}}", expecting);
		checkMatches(g, "{{x}{y}}", expecting);
		checkMatches(g, "{{{{{{x}}}}}}", expecting);
	}

Note how simple the bytecodes are for the grammar:

ACTION : '{' (ACTION | .)* '}' ;

gives:

0000:	split         5
0005:	match8        '{'      // Start of ACTION
0007:	split         14, 31
0014:	split         21, 27
0021:	call          5 // call ACTION
0024:	jmp           28
0027:	wildcard        
0028:	jmp           7
0031:	match8        '}'
0033:	accept        4

v4 does what you'd expect now: longest match with priority given to earlier rules upon match of same length.

It also handles case where it must remember all possible matches and rewind if it fails further on.  This was highlighted in

http://www.antlr.org/jira/browse/ANTLR-189

Now it works automatically:

	@Test public void testRewindBackToLastGoodMatch_DOT_vs_NUM() throws Exception {
		LexerGrammar g = new LexerGrammar(
			"lexer grammar L;\n" +
			"NUM: '0'..'9'+ ('.' '0'..'9'+)? ;\n"+
			"DOT : '.' ;\n"+
			"WS : ' ' ;\n");
		checkMatches(g, "3.14 .", "NUM, WS, DOT, EOF");
		checkMatches(g, "9", "NUM, EOF");
		checkMatches(g, ".1", "DOT, NUM, EOF");
		checkMatches(g, "1.", "NUM, DOT, EOF");
	}

Here, "1." starts NUM and enters ('.' '0'..'9'+)? subrule due to '.' after '1'.  Ooops, no digit after '.'.  Rewind to spot where we looked like an integer: "1" then next match sees '.'. cool.

Java impl of this more complicated VM is still only about 1200 bytes (in java bytecodes).  Can use lots more memory at runtime than "no rule invocation" version as well.

Woohoo!
Ter