[antlr-interest] Lexer Rule Ordering, how to obtain a default token rule??

Tue Jun 20 06:44:31 PDT 2006

Ahhh... let me know if I'm wrong but is the only solution to use testLiterals on all fixed string literals and for the few cases where the keyword is not fixed (like N_PROXIMITY) in that case I use either a predicate like :

STRING : 
     (N_PROXIMITY) => ... setType(N_PROXIMITY) ...
   | ( ~( '*' | ' ' | '\t' | '\n' | '\r' ) )+ ... previous STRING definition...

and protect N_PROXIMITY

I dont see anything better, I think this is the best solution. Without the testLiterals, it would be pretty ugly with all the rules embedded in STRING.

Daniel

----------------------------------------
> From: lachinois at hotmail.com
> To: jbb at acm.org
> Subject: RE: Re: [antlr-interest] Lexer Rule Ordering,	how to obtain a	default token rule??
> Date: Tue, 20 Jun 2006 09:34:59 -0400
> CC: antlr-interest at antlr.org
> 
> Hi John and thanks for the input.
> 
> Yes indeed you are correct, 4 tokens including WS. I was not very clear but you figurered it out. I was not aware of that specific ANTLR option testLiterals, indeed it does the job very well for the tokens like "AND", "NOT", "OR" and others, and is much less complicated than my original lexer. 
> 
> However, I have another token, called lexical proximity that is represented by the following :
> 
> N_PROXIMITY : "/"! INT 
> {... rule to match the rest in case its a string i.e.;
> 
> protected
> DIGIT	: '0'..'9';
> 
> protected
> INT     : ( options { greedy=true; } : DIGIT)+;
> 
> The priority is *, then N_PROXIMITY, then STRING. 
> 
> So here are some examples :
> 
> /12 -> N_PROXIMITY(12)
> /12* -> PREFIXED_STRING("/12")
> /12*google -> STRING("/12*google")
> A * A -> STRING("A") STRING("*") STRING ("A")   // implicit ignored WS here
> 
> I will get again a clash with STRING and N_PROXIMITY, but if I can simpy say to ANTLR : hey put this rule at the end since STRING is the lowest priority, then it would be fine.
> 
>  ----------------------------------------
> > To: lachinois at hotmail.com
> > CC: antlr-interest at antlr.org
> > Subject: Re: [antlr-interest] Lexer Rule Ordering,	how to obtain a default token rule??
> > From: jbb at acm.org
> > Date: Mon, 19 Jun 2006 18:28:13 -0400
> > 
> > on Mon, 19 Jun 2006 16:32:36, Daniel Shane asked:
> > >Hi!
> > 
> > Greetings!
> > 
> > >I'm writing a lexer for a new Lucene query parser, and I thought of giving
> > >ANTLR a try with my project. However, I'm faced with a problem I cant seem to
> > >resolve...
> > >
> > >To make the problem simple, imagine that you have 4 types of tokens :
> > >
> > >  a) AND (matches the string "AND")
> > >  b) PREFIXED_STRING (matches any string ending with *, i.e. google*)
> > >  c) STRING (anything that is separated by WS and is not one of the above)
> > >
> > >...other info, including a complex trial lexer, snipped...
> > 
> > (the 4 tokens are AND, STRING, PREFIXED_STRING, and WS; where WS is to be
> > ignored, right?)
> > 
> > I do not think that Antlr has the concept of a default token.
> > 
> > However, in this case, your reserved word - "AND" - is matched by your general
> > pattern for STRING; so you are good to go for the use of the testLiterals
> > option. 
> > 
> > Well maybe testLiterals can be thought of as a default token rule but with a
> > twist; e.g. first match the general string (or identifier) pattern and then
> > see if that result should be specialized into one of the reserved words.
> > rather than trying all the special case reserved words first and then
> > supplying a default as the result when they all fail.
> > 
> > Anyway, does this Lexer do what you need?
> > 
> > -------------------------
> > class LuceneLexer extends Lexer;
> > 
> > tokens {
> >     AND = "AND";
> >     STRING;
> >     PREFIXED_STRING;
> > }
> > 
> > STRING options{ testLiterals=true; } :
> >     ( ~( '*' | ' ' | '\t' | '\n' | '\r' ) )+
> >     ( '*' { $setType(PREFIXED_STRING); text.setLength(text.length() - 1); } )?
> >     ;
> > 
> > WS  : ( ' ' | ('\t' { tab(); }) ) { $setType(Token.SKIP); } ;
> > EOL : ( '\r' ( '\n' )? | '\n' ) { newline(); $setType(Token.SKIP); } ;
> > 
> > -------------------------
> > 
> > Note: you did not say how the input strings "a * b" or "c*d" should be
> > handled, so the above Lexer probably does not do the Right Thing on those
> > inputs.
> > 
> > Hope this helps...
> >    -jbb
> 
> _________________________________________________________________
> Soyez le premier de votre quartier à découvrir le futur Hotmail : essayez Windows Live Mail Beta
> http://ideas.live.com/programpage.aspx?versionId=5d21c51a-b161-4314-9b0e-4911fb2b2e6d

_________________________________________________________________
Soyez parmi les premiers à essayer la future messagerie : Windows Live Messenger Beta
 http://ideas.live.com/programpage.aspx?versionId=0eccd94b-eb48-497c-8e60-c6313f7ebb73