[antlr-interest] newbie lookahead question

Fri Apr 21 13:07:41 PDT 2006

Usually the way I handle keywords is to define them in the parser rather
than the lexer.

For example, a snippet from a lexer I'm building right now

ID: (LETTER | '_') (LETTER | DIGIT | '_')*;

// internal tokens only
protected
LETTER: 'a'..'z';

protected
DIGIT: '0'..'9';

Which recognises basic identifiers that may start with a letter or a _
followed by zero or more letters/digits/_

I also have the following options defined in my lexer..

Options
{
	caseSensitive = false;
	caseSensitiveLiterals = false;
	testLiterals = true;
}

Then in the parser use literal keywords

MyParseRule:
   "on"
 | "off"
 | "office"
 ;

And so on. The lexer will recognise full word tokens and pass them to
the parser which will then do the keyword matching for you.  No need for
elaborate look ahead rules. I believe that with the testLiterals set to
true literal strings in the parser are automatically added as token
types to the lexer and a literal lookup map is built.  The lexer will
then test each token it reads from the input stream against this literal
map and if it finds a matching entry return that token type instead of
ID.

HTH.

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Lance Gutteridge
Sent: 21 April 2006 20:41
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] newbie lookahead question

Hi,
I'm a user of ANTLR but I'm nowhere close to being a real pro with it. 
I've just learnt what I had to implement my script language. So please
excuse me if this is a dumb question.

I saw the question regarding lookahead and it is something I have been
wrestling with as well.

I have a large number of keywords in my language.

For purpose of example here are five:

ON:         "on";
OF:          "of";
OFF:        "off";
OFFICE:  "office";
ACTIVATE:         "activate";
ACTIVATED:      "activated";

To disambiguate between ACTIVATE and ACTIVATED requires k = 9.

That seems inefficient although it works.

On the other hand Martin's solution of matching the leading part of the
word and then using $setType seems difficult. If I want k=1, I would
need to have a rule for any words that share a leading letter (e.g. ON
and OF  and OFF and OFFICE).

To  handle the ON, OF, OFF, and OFFICE case in the manner Martin
suggests would be a fairly complicated rule, because it has to say that
the token is OF is it matches "OF" and then a whitespace, to disambiuate
it from OFF and OFFICE. Then the same has to be done to decide between
OFF and OFFICE. (BTW I would be grateful for an example of such a rule
as I have had trouble constructing one that works for this kind of
situation).

Is there no way to handle this in general, other than setting k to be as
long as the longest prefix that two keywords have in common?  I would
really like a technique that every time I add a new keyword I don't have
to figure out all the ones it could conflct with and write some kind of
long statement that descends through the letters and peels off
ambiguities.

Thanks,
Lance

Martin Probst wrote:

> Hi,
>
> you can either increase your lookahead (which is not advisable in this

> case), or rather do it manually:
>
> CALIBRATION_THINGY:
>   "CALIBRATION_" ( "METHOD" { $setType(CAL_METHOD);} |  "HANDLE" { 
> $setType(CAL_HANDLE);} );
>
> This parses the CALIBRATION_ part and then decides what kind of token 
> type this is later. You'll may want to add "CAL_METHOD" and 
> "CAL_HANDLE" to the tokens section of your grammar because they aren't

> declared automatically if used like this - you can use that to give 
> them a proper help message later on.
>
> Martin
>
>
> Am 17.04.2006 um 06:30 schrieb Lucien Stals:
>
>> Hi,
>>
>> I've been learning ANTLR for about two weeks and want to be able to 
>> parse (then transform into XML) an input file in a specific markup 
>> language (ASAP2). I have not worked with parsers before and I feel 
>> like I'm in a little over my head. It's sink or swim time for me.
>>
>> I have some basic stuff working, but I'm getting lots of warnings 
>> about ambiguity.
>>
>> Part of a sample input file might look like:
>>
>> ...
>> /begin CALIBRATION_METHOD
>>     "Slewing"
>>     1
>>     /begin CALIBRATION_HANDLE
>>         0x1ffbf8
>>         0x400
>>         0
>>         AllSlews
>>     /end CALIBRATION_HANDLE
>> /end CALIBRATION_METHOD
>> ...
>>
>>
>> My question is about look ahead.
>> In my parser, I have rules like:
>>
>>
>> calibrationMethod:BEGIN CAL_METH
>>             (calibrationHandle)*
>>             END CAL_METH;
>>            
>> calibrationHandle:BEGIN CAL_HAND
>>             END CAL_HAND;
>>
>> Where my lexer rules are:
>>
>> protected
>> SLASH        :'/';
>>
>> BEGIN        :SLASH "begin";
>>
>> END        :SLASH "end";
>>
>> CAL_METH    :"CALIBRATION_METHOD";
>>
>> CAL_HAND    :"CALIBRATION_HANDLE";
>>
>>
>> (I'm just dealing with the tag structure first. Parsing the actual 
>> data is my next step. I have filter=true for now so I can ignore what

>> I can't parse yet)
>>
>> And I'm getting ambiguity warnings. Should I set my lookahead to 
>> something silly like 13 just to look past "CALIBRATION_"? (I read 
>> that bigger lookaheads are performance killers) Or is there a smarter

>> way to do this? Should I use predicates like:
>>
>> calibrationMethod:BEGIN CAL_METH {this.inCalMeth=true;}
>>             (calibrationHandle)*
>>             END CAL_METH {this.inCalMeth=false;};
>>            
>> calibrationHandle:{this.inCalMeth}?
>>         BEGIN CAL_HAND
>>         END CAL_HAND;
>>
>> Perhaps I'm completely off base. If it looks like I'm really  missing

>> something, you might be right. Feel free to let me know.
>>
>> This is only one of the problems I'm having, but I'll just keep it to

>> one question per post ;)
>>
>> BTW, if anyone is aware of a grammar that is similar which I can  get

>> inspiration from, can you let me know?
>>
>> Thanks,
>>
>> Lucien.
>>
>
>