[antlr-interest] newbie lookahead question

Lance Gutteridge lance at thinkingworks.com
Fri Apr 21 17:29:10 PDT 2006


Thanks for the help. It got me going.

The issue with defining the literals in the parser is that ANTLR then 
sets up the token type constants as prefixed by LITERAL_
e.g.LITERAL_office
Nothing wrong with this but I wanted the token to be OFFICE.


I found the solution to my problem by putting a tokens section in the lexer.

tokens {
     OF = "of";
     OFF= "off";
     OFFICE= "office";
}

This allows you to set k=1 with no problem.

This seems to solve my problem. Would this work for the orignal question 
regarding CALIBRATION_METHOD etc.?


Lance

Mike Quilleash wrote:

>Usually the way I handle keywords is to define them in the parser rather
>than the lexer.
>
>For example, a snippet from a lexer I'm building right now
>
>ID: (LETTER | '_') (LETTER | DIGIT | '_')*;
>
>// internal tokens only
>protected
>LETTER: 'a'..'z';
>
>protected
>DIGIT: '0'..'9';
>
>Which recognises basic identifiers that may start with a letter or a _
>followed by zero or more letters/digits/_
>
>I also have the following options defined in my lexer..
>
>Options
>{
>	caseSensitive = false;
>	caseSensitiveLiterals = false;
>	testLiterals = true;
>}
>
>
>Then in the parser use literal keywords
>
>MyParseRule:
>   "on"
> | "off"
> | "office"
> ;
>
>And so on. The lexer will recognise full word tokens and pass them to
>the parser which will then do the keyword matching for you.  No need for
>elaborate look ahead rules. I believe that with the testLiterals set to
>true literal strings in the parser are automatically added as token
>types to the lexer and a literal lookup map is built.  The lexer will
>then test each token it reads from the input stream against this literal
>map and if it finds a matching entry return that token type instead of
>ID.
>
>HTH.
>
>-----Original Message-----
>From: antlr-interest-bounces at antlr.org
>[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Lance Gutteridge
>Sent: 21 April 2006 20:41
>Cc: antlr-interest at antlr.org
>Subject: Re: [antlr-interest] newbie lookahead question
>
>Hi,
>I'm a user of ANTLR but I'm nowhere close to being a real pro with it. 
>I've just learnt what I had to implement my script language. So please
>excuse me if this is a dumb question.
>
>I saw the question regarding lookahead and it is something I have been
>wrestling with as well.
>
>I have a large number of keywords in my language.
>
>For purpose of example here are five:
>
>ON:         "on";
>OF:          "of";
>OFF:        "off";
>OFFICE:  "office";
>ACTIVATE:         "activate";
>ACTIVATED:      "activated";
>
>To disambiguate between ACTIVATE and ACTIVATED requires k = 9.
>
>That seems inefficient although it works.
>
>On the other hand Martin's solution of matching the leading part of the
>word and then using $setType seems difficult. If I want k=1, I would
>need to have a rule for any words that share a leading letter (e.g. ON
>and OF  and OFF and OFFICE).
>
>To  handle the ON, OF, OFF, and OFFICE case in the manner Martin
>suggests would be a fairly complicated rule, because it has to say that
>the token is OF is it matches "OF" and then a whitespace, to disambiuate
>it from OFF and OFFICE. Then the same has to be done to decide between
>OFF and OFFICE. (BTW I would be grateful for an example of such a rule
>as I have had trouble constructing one that works for this kind of
>situation).
>
>Is there no way to handle this in general, other than setting k to be as
>long as the longest prefix that two keywords have in common?  I would
>really like a technique that every time I add a new keyword I don't have
>to figure out all the ones it could conflct with and write some kind of
>long statement that descends through the letters and peels off
>ambiguities.
>
>
>Thanks,
>Lance
>
>
>
>Martin Probst wrote:
>
>  
>
>>Hi,
>>
>>you can either increase your lookahead (which is not advisable in this
>>    
>>
>
>  
>
>>case), or rather do it manually:
>>
>>CALIBRATION_THINGY:
>>  "CALIBRATION_" ( "METHOD" { $setType(CAL_METHOD);} |  "HANDLE" { 
>>$setType(CAL_HANDLE);} );
>>
>>This parses the CALIBRATION_ part and then decides what kind of token 
>>type this is later. You'll may want to add "CAL_METHOD" and 
>>"CAL_HANDLE" to the tokens section of your grammar because they aren't
>>    
>>
>
>  
>
>>declared automatically if used like this - you can use that to give 
>>them a proper help message later on.
>>
>>Martin
>>
>>
>>Am 17.04.2006 um 06:30 schrieb Lucien Stals:
>>
>>    
>>
>>>Hi,
>>>
>>>I've been learning ANTLR for about two weeks and want to be able to 
>>>parse (then transform into XML) an input file in a specific markup 
>>>language (ASAP2). I have not worked with parsers before and I feel 
>>>like I'm in a little over my head. It's sink or swim time for me.
>>>
>>>I have some basic stuff working, but I'm getting lots of warnings 
>>>about ambiguity.
>>>
>>>Part of a sample input file might look like:
>>>
>>>...
>>>/begin CALIBRATION_METHOD
>>>    "Slewing"
>>>    1
>>>    /begin CALIBRATION_HANDLE
>>>        0x1ffbf8
>>>        0x400
>>>        0
>>>        AllSlews
>>>    /end CALIBRATION_HANDLE
>>>/end CALIBRATION_METHOD
>>>...
>>>
>>>
>>>My question is about look ahead.
>>>In my parser, I have rules like:
>>>
>>>
>>>calibrationMethod:BEGIN CAL_METH
>>>            (calibrationHandle)*
>>>            END CAL_METH;
>>>           
>>>calibrationHandle:BEGIN CAL_HAND
>>>            END CAL_HAND;
>>>
>>>Where my lexer rules are:
>>>
>>>protected
>>>SLASH        :'/';
>>>
>>>BEGIN        :SLASH "begin";
>>>
>>>END        :SLASH "end";
>>>
>>>CAL_METH    :"CALIBRATION_METHOD";
>>>
>>>CAL_HAND    :"CALIBRATION_HANDLE";
>>>
>>>
>>>(I'm just dealing with the tag structure first. Parsing the actual 
>>>data is my next step. I have filter=true for now so I can ignore what
>>>      
>>>
>
>  
>
>>>I can't parse yet)
>>>
>>>And I'm getting ambiguity warnings. Should I set my lookahead to 
>>>something silly like 13 just to look past "CALIBRATION_"? (I read 
>>>that bigger lookaheads are performance killers) Or is there a smarter
>>>      
>>>
>
>  
>
>>>way to do this? Should I use predicates like:
>>>
>>>calibrationMethod:BEGIN CAL_METH {this.inCalMeth=true;}
>>>            (calibrationHandle)*
>>>            END CAL_METH {this.inCalMeth=false;};
>>>           
>>>calibrationHandle:{this.inCalMeth}?
>>>        BEGIN CAL_HAND
>>>        END CAL_HAND;
>>>
>>>Perhaps I'm completely off base. If it looks like I'm really  missing
>>>      
>>>
>
>  
>
>>>something, you might be right. Feel free to let me know.
>>>
>>>This is only one of the problems I'm having, but I'll just keep it to
>>>      
>>>
>
>  
>
>>>one question per post ;)
>>>
>>>BTW, if anyone is aware of a grammar that is similar which I can  get
>>>      
>>>
>
>  
>
>>>inspiration from, can you let me know?
>>>
>>>Thanks,
>>>
>>>Lucien.
>>>
>>>      
>>>
>>    
>>
>
>  
>


More information about the antlr-interest mailing list