[antlr-interest] Lexer Rule Ordering, how to obtain a default token rule??

Tue Jun 20 06:34:59 PDT 2006

Hi John and thanks for the input.

Yes indeed you are correct, 4 tokens including WS. I was not very clear but you figurered it out. I was not aware of that specific ANTLR option testLiterals, indeed it does the job very well for the tokens like "AND", "NOT", "OR" and others, and is much less complicated than my original lexer. 

However, I have another token, called lexical proximity that is represented by the following :

N_PROXIMITY : "/"! INT 
{... rule to match the rest in case its a string i.e.;

protected
DIGIT	: '0'..'9';

protected
INT     : ( options { greedy=true; } : DIGIT)+;

The priority is *, then N_PROXIMITY, then STRING. 

So here are some examples :

/12 -> N_PROXIMITY(12)
/12* -> PREFIXED_STRING("/12")
/12*google -> STRING("/12*google")
A * A -> STRING("A") STRING("*") STRING ("A")   // implicit ignored WS here

I will get again a clash with STRING and N_PROXIMITY, but if I can simpy say to ANTLR : hey put this rule at the end since STRING is the lowest priority, then it would be fine.

 ----------------------------------------
> To: lachinois at hotmail.com
> CC: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Lexer Rule Ordering,	how to obtain a default token rule??
> From: jbb at acm.org
> Date: Mon, 19 Jun 2006 18:28:13 -0400
> 
> on Mon, 19 Jun 2006 16:32:36, Daniel Shane asked:
> >Hi!
> 
> Greetings!
> 
> >I'm writing a lexer for a new Lucene query parser, and I thought of giving
> >ANTLR a try with my project. However, I'm faced with a problem I cant seem to
> >resolve...
> >
> >To make the problem simple, imagine that you have 4 types of tokens :
> >
> >  a) AND (matches the string "AND")
> >  b) PREFIXED_STRING (matches any string ending with *, i.e. google*)
> >  c) STRING (anything that is separated by WS and is not one of the above)
> >
> >...other info, including a complex trial lexer, snipped...
> 
> (the 4 tokens are AND, STRING, PREFIXED_STRING, and WS; where WS is to be
> ignored, right?)
> 
> I do not think that Antlr has the concept of a default token.
> 
> However, in this case, your reserved word - "AND" - is matched by your general
> pattern for STRING; so you are good to go for the use of the testLiterals
> option. 
> 
> Well maybe testLiterals can be thought of as a default token rule but with a
> twist; e.g. first match the general string (or identifier) pattern and then
> see if that result should be specialized into one of the reserved words.
> rather than trying all the special case reserved words first and then
> supplying a default as the result when they all fail.
> 
> Anyway, does this Lexer do what you need?
> 
> -------------------------
> class LuceneLexer extends Lexer;
> 
> tokens {
>     AND = "AND";
>     STRING;
>     PREFIXED_STRING;
> }
> 
> STRING options{ testLiterals=true; } :
>     ( ~( '*' | ' ' | '\t' | '\n' | '\r' ) )+
>     ( '*' { $setType(PREFIXED_STRING); text.setLength(text.length() - 1); } )?
>     ;
> 
> WS  : ( ' ' | ('\t' { tab(); }) ) { $setType(Token.SKIP); } ;
> EOL : ( '\r' ( '\n' )? | '\n' ) { newline(); $setType(Token.SKIP); } ;
> 
> -------------------------
> 
> Note: you did not say how the input strings "a * b" or "c*d" should be
> handled, so the above Lexer probably does not do the Right Thing on those
> inputs.
> 
> Hope this helps...
>    -jbb

_________________________________________________________________
Soyez le premier de votre quartier à découvrir le futur Hotmail : essayez Windows Live Mail Beta
http://ideas.live.com/programpage.aspx?versionId=5d21c51a-b161-4314-9b0e-4911fb2b2e6d