[antlr-interest] Lexing XQuery in antlr 3

Jim Idle jimi at temporal-wave.com
Thu Sep 24 09:49:59 PDT 2009


On 09/24/2009 09:12 AM, Josh Spiegel wrote:
> Hi,
>
> I am trying to migrate our XQuery lexer from antlr 2 to antlr 3.
>
> The match for a given token depends on the text before or after it.  
> For example, in a certain state the string "declare" should be a 
> keyword token if it is followed by "namespace" and otherwise it should 
> be a QName token.  We successfully handled this in antlr 2 using 
> syntactic predicates.  Eg:
>
> Keywords :
>       ('ancestor-or-self' (C|S1)* '::')=> 'ancestor-or-self' {
>         $type = ANCESTOR_OR_SELF_AXIS;
>       }
>     | ...
>     | ('declare' (C|S1)+ 'namespace' Del)=> 'declare' {
>         $type = DECLARE;
>       }
>     |
>     ....
>     | QName {
>          $type = QNAME;
>     }
>     ;
>
>
> Note: C and S1 are fragments that match comments and whitespace 
> respectively.  We also intermix some gated semantic predicates to 
> disable certain keywords depending on the state of the lexer (a state 
> that we maintain).  I have omitted that code for brevity.
>
> The rule is pretty long as there are many keywords in XQuery.  
> Unfortunately, in antlr3 the method specialStateTransition associated 
> with this Keywords rule exceeds the 64K limit and I get the Java "code 
> too large" error.  I have looked at composite grammars and searched 
> many of the "code too large" postings.  Is there a way to break up 
> this kind of rule?
Move this in to the parser rather than the lexer and create an id rule 
that allows the keywords as Ids.

However, I think that the complexity of your grammar would be greatly 
reduced if you placed the logic in the id rule and checked yourself in a 
hash table/map:

QName: ('a'..'b'|'_')+ { $type = lookup($text); } ;

Then create a map of the keywords and in your lookup, manually look 
through LA() for the indicators that this is a keyword or not.

XQuery is yet another language designed by people that don't understand 
languages I am afraid. Unless there is a real reason this must be done 
in the lexer, then transfer it to the parser and I think you will have 
better luck.

Jim


More information about the antlr-interest mailing list