[antlr-interest] newbie lookahead question

Lance Gutteridge lance at thinkingworks.com
Fri Apr 21 18:04:39 PDT 2006


Well maybe not. It seems I was wrong about the tokens section. It 
doesn't specify lexer rules so the tokens aren't detected and put into 
the token stream for the parser. Ah well. It seemed like a good idea at 
the time.

Lance

Lance Gutteridge wrote:

> Thanks for the help. It got me going.
>
> The issue with defining the literals in the parser is that ANTLR then 
> sets up the token type constants as prefixed by LITERAL_
> e.g.LITERAL_office
> Nothing wrong with this but I wanted the token to be OFFICE.
>
>
> I found the solution to my problem by putting a tokens section in the 
> lexer.
>
> tokens {
>     OF = "of";
>     OFF= "off";
>     OFFICE= "office";
> }
>
> This allows you to set k=1 with no problem.
>
> This seems to solve my problem. Would this work for the orignal 
> question regarding CALIBRATION_METHOD etc.?
>
>
> Lance
>
> Mike Quilleash wrote:
>
>> Usually the way I handle keywords is to define them in the parser rather
>> than the lexer.
>>
>> For example, a snippet from a lexer I'm building right now
>>
>> ID: (LETTER | '_') (LETTER | DIGIT | '_')*;
>>
>> // internal tokens only
>> protected
>> LETTER: 'a'..'z';
>>
>> protected
>> DIGIT: '0'..'9';
>>
>> Which recognises basic identifiers that may start with a letter or a _
>> followed by zero or more letters/digits/_
>>
>> I also have the following options defined in my lexer..
>>
>> Options
>> {
>>     caseSensitive = false;
>>     caseSensitiveLiterals = false;
>>     testLiterals = true;
>> }
>>
>>
>> Then in the parser use literal keywords
>>
>> MyParseRule:
>>   "on"
>> | "off"
>> | "office"
>> ;
>>
>> And so on. The lexer will recognise full word tokens and pass them to
>> the parser which will then do the keyword matching for you.  No need for
>> elaborate look ahead rules. I believe that with the testLiterals set to
>> true literal strings in the parser are automatically added as token
>> types to the lexer and a literal lookup map is built.  The lexer will
>> then test each token it reads from the input stream against this literal
>> map and if it finds a matching entry return that token type instead of
>> ID.
>>
>> HTH.
>>
>> -----Original Message-----
>> From: antlr-interest-bounces at antlr.org
>> [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Lance Gutteridge
>> Sent: 21 April 2006 20:41
>> Cc: antlr-interest at antlr.org
>> Subject: Re: [antlr-interest] newbie lookahead question
>>
>> Hi,
>> I'm a user of ANTLR but I'm nowhere close to being a real pro with 
>> it. I've just learnt what I had to implement my script language. So 
>> please
>> excuse me if this is a dumb question.
>>
>> I saw the question regarding lookahead and it is something I have been
>> wrestling with as well.
>>
>> I have a large number of keywords in my language.
>>
>> For purpose of example here are five:
>>
>> ON:         "on";
>> OF:          "of";
>> OFF:        "off";
>> OFFICE:  "office";
>> ACTIVATE:         "activate";
>> ACTIVATED:      "activated";
>>
>> To disambiguate between ACTIVATE and ACTIVATED requires k = 9.
>>
>> That seems inefficient although it works.
>>
>> On the other hand Martin's solution of matching the leading part of the
>> word and then using $setType seems difficult. If I want k=1, I would
>> need to have a rule for any words that share a leading letter (e.g. ON
>> and OF  and OFF and OFFICE).
>>
>> To  handle the ON, OF, OFF, and OFFICE case in the manner Martin
>> suggests would be a fairly complicated rule, because it has to say that
>> the token is OF is it matches "OF" and then a whitespace, to disambiuate
>> it from OFF and OFFICE. Then the same has to be done to decide between
>> OFF and OFFICE. (BTW I would be grateful for an example of such a rule
>> as I have had trouble constructing one that works for this kind of
>> situation).
>>
>> Is there no way to handle this in general, other than setting k to be as
>> long as the longest prefix that two keywords have in common?  I would
>> really like a technique that every time I add a new keyword I don't have
>> to figure out all the ones it could conflct with and write some kind of
>> long statement that descends through the letters and peels off
>> ambiguities.
>>
>>
>> Thanks,
>> Lance
>>
>>
>>
>> Martin Probst wrote:
>>
>>  
>>
>>> Hi,
>>>
>>> you can either increase your lookahead (which is not advisable in this
>>>   
>>
>>
>>  
>>
>>> case), or rather do it manually:
>>>
>>> CALIBRATION_THINGY:
>>>  "CALIBRATION_" ( "METHOD" { $setType(CAL_METHOD);} |  "HANDLE" { 
>>> $setType(CAL_HANDLE);} );
>>>
>>> This parses the CALIBRATION_ part and then decides what kind of 
>>> token type this is later. You'll may want to add "CAL_METHOD" and 
>>> "CAL_HANDLE" to the tokens section of your grammar because they aren't
>>>   
>>
>>
>>  
>>
>>> declared automatically if used like this - you can use that to give 
>>> them a proper help message later on.
>>>
>>> Martin
>>>
>>>
>>> Am 17.04.2006 um 06:30 schrieb Lucien Stals:
>>>
>>>   
>>>
>>>> Hi,
>>>>
>>>> I've been learning ANTLR for about two weeks and want to be able to 
>>>> parse (then transform into XML) an input file in a specific markup 
>>>> language (ASAP2). I have not worked with parsers before and I feel 
>>>> like I'm in a little over my head. It's sink or swim time for me.
>>>>
>>>> I have some basic stuff working, but I'm getting lots of warnings 
>>>> about ambiguity.
>>>>
>>>> Part of a sample input file might look like:
>>>>
>>>> ...
>>>> /begin CALIBRATION_METHOD
>>>>    "Slewing"
>>>>    1
>>>>    /begin CALIBRATION_HANDLE
>>>>        0x1ffbf8
>>>>        0x400
>>>>        0
>>>>        AllSlews
>>>>    /end CALIBRATION_HANDLE
>>>> /end CALIBRATION_METHOD
>>>> ...
>>>>
>>>>
>>>> My question is about look ahead.
>>>> In my parser, I have rules like:
>>>>
>>>>
>>>> calibrationMethod:BEGIN CAL_METH
>>>>            (calibrationHandle)*
>>>>            END CAL_METH;
>>>>           calibrationHandle:BEGIN CAL_HAND
>>>>            END CAL_HAND;
>>>>
>>>> Where my lexer rules are:
>>>>
>>>> protected
>>>> SLASH        :'/';
>>>>
>>>> BEGIN        :SLASH "begin";
>>>>
>>>> END        :SLASH "end";
>>>>
>>>> CAL_METH    :"CALIBRATION_METHOD";
>>>>
>>>> CAL_HAND    :"CALIBRATION_HANDLE";
>>>>
>>>>
>>>> (I'm just dealing with the tag structure first. Parsing the actual 
>>>> data is my next step. I have filter=true for now so I can ignore what
>>>>     
>>>
>>
>>  
>>
>>>> I can't parse yet)
>>>>
>>>> And I'm getting ambiguity warnings. Should I set my lookahead to 
>>>> something silly like 13 just to look past "CALIBRATION_"? (I read 
>>>> that bigger lookaheads are performance killers) Or is there a smarter
>>>>     
>>>
>>
>>  
>>
>>>> way to do this? Should I use predicates like:
>>>>
>>>> calibrationMethod:BEGIN CAL_METH {this.inCalMeth=true;}
>>>>            (calibrationHandle)*
>>>>            END CAL_METH {this.inCalMeth=false;};
>>>>           calibrationHandle:{this.inCalMeth}?
>>>>        BEGIN CAL_HAND
>>>>        END CAL_HAND;
>>>>
>>>> Perhaps I'm completely off base. If it looks like I'm really  missing
>>>>     
>>>
>>
>>  
>>
>>>> something, you might be right. Feel free to let me know.
>>>>
>>>> This is only one of the problems I'm having, but I'll just keep it to
>>>>     
>>>
>>
>>  
>>
>>>> one question per post ;)
>>>>
>>>> BTW, if anyone is aware of a grammar that is similar which I can  get
>>>>     
>>>
>>
>>  
>>
>>>> inspiration from, can you let me know?
>>>>
>>>> Thanks,
>>>>
>>>> Lucien.
>>>>
>>>>     
>>>
>>>   
>>
>>
>>  
>>
>


More information about the antlr-interest mailing list