[antlr-interest] Lexer Rule Ordering, how to obtain a default token rule??

Daniel Shane lachinois at hotmail.com
Mon Jun 19 13:32:36 PDT 2006


Hi!

I'm writing a lexer for a new Lucene query parser, and I thought of giving ANTLR a try with my project. However, I'm faced with a problem I cant seem to resolve...

To make the problem simple, imagine that you have 4 types of tokens :

  a) AND (matches the string "AND")
  b) PREFIXED_STRING (matches any string ending with *, i.e. google*)
  c) STRING (anything that is separated by WS and is not one of the above)

The priority is *, then AND then STRING, here are a few examples :

google* -> PREFIXED_STRING(google)
AND* -> PREFIXED_STRING(AND)
google AND gogle -> STRING("google") AND STRING("gogle")
alpha beta -> STRING("alpha") STRING("beta")

I'm having an easy time with a) and b) and I have a grammar for something like this at the end of my message.

You can see, the problem is that I simply want the rule c) to be the last string the lexer can try, because its the default case. 
Now ANTLR will always complain because my grammar is ambig., but I really dont care since I want the offending rule STRING_OR_PREFIXED_STRING to be matched LAST anyways, so I should be able to disregard the ambig. warning if it is the case.

Unfortunately, there is no way to predict how ANTLR will order the lexer rules, and in my real grammar, it happens that it inserts that offending rule (STRING_OR_PREFIXED_STRING) right in the middle of the AND, OR, NOT, etc... so my lexer works perfectly with all the keywords above it but all the other keywords under it get recognized as STRING. 

Is there any way I can make this grammar work? The only solution I could come up with is to have a huge predicate and have all my rules potected that way I'm sure of the ordering.... but I mean... its a lexer, there must be a way this work?  


-----------------------------

tokens {
   STRING;
   PREFIXED_STRING;
}

//Match the AND token, if there is something after AND, then return
//either STRING or PREFIXED_STRING since its not a real AND
AND	: 
	"AND" 
	(t:STRING_OR_PREFIXED_STRING { $setType(t.getType()); })?
	;

//Match a STRING or a PREFIXED_STRING
STRING_OR_PREFIXED_STRING:
	(~('*' | ' ' | '\t' | '\n' | '\r'))+
	(
  		t:POSSIBLE_PREFIXED_STRING { $setType(t.getType()); }
		|
		{ $setType(STRING); }
	)
	|
	t2:POSSIBLE_PREFIXED_STRING { $setType(t2.getType()); }
	;

protected 
POSSIBLE_PREFIXED_STRING:
	STAR
	( 
	        (
	  	    WS
	  	  | { LA(1) == EOF_CHAR }?
		)=> { $setType(PREFIXED_STRING); text.setLength(text.length() - 1); }
		|
		t:STRING_OR_PREFIXED_STRING { $setType(t.getType()); }
	)
	;

WS	:	(' ' | '\t' | '\n' | '\r')+ { $setType(Token.SKIP); }
	;








_________________________________________________________________
Soyez parmi les premiers à essayer la future messagerie : Windows Live Messenger Beta
 http://ideas.live.com/programpage.aspx?versionId=0eccd94b-eb48-497c-8e60-c6313f7ebb73


More information about the antlr-interest mailing list