[antlr-interest] QUESTION on: How do I handle abbreviated keywords?

Fri Oct 31 22:53:42 PDT 2008

Thank you, Gavin, for taking the time to reply.

>Am I supposed to write an initialization routine that builds a dictionary?
So, this is what I have to do.

In my CSharp2 target, there *already* is both components necessary for this 
dicationary; string values of the tokens and the corresponding integer token 
type.
It appears I have to duplicate some of that to make a dictionary, which is 
OK, but surprising since ANTLR doc/publication stresses efficiency. i.e. it 
seems the target could've reorg'd it in such a way as to provide this vs. 
requiring manual duplication of it.  Just thinking out loud, not 
complaining...overall, I'm loving ANTLR.  :-)

Regards,
Ben

----- Original Message ----- 
From: "Gavin Lambert" <antlr at mirality.co.nz>
To: "Ben Gillis" <wbgillis at gmail.com>; <antlr-interest at antlr.org>
Sent: Friday, October 31, 2008 9:52 PM
Subject: Re: [antlr-interest] QUESTION on: How do I handle abbreviated 
keywords?

> At 14:00 1/11/2008, Ben Gillis wrote:
>>see http://www.antlr.org/wiki/pages/viewpage.action?pageId=1802308.
>>
>>It's not clear to me the connection between the tokens block (and its 
>>auto-gen'd code), and this statement in the above URL:
>>
>>"might simply consult an IDictionary<string,int> map of all keywords (incl 
>>abbreviations). "
>>
>>The tokens block ends up in a string array named tokenNames (CSharp2 
>>target).  My tokens keywords are mixed with other entries related to the 
>>grammar definition.
>>
>>Am I supposed to write an initialization routine that builds a dictionary? 
>>If so, I have to filter through the auto-gen'd tokenNames making sure to 
>>enter only my keywords, otherwise I'll get false hits in my 
>>CheckKeywordsTable routine.  Somehow, this doesn't seem quite right, ???
>
> The tokenNames array is a list of token *names*, which is useless for that 
> purpose, since for that particular keyword matching strategy what you're 
> after is a mapping of keyword *text* to token *value*, which is an 
> entirely different thing.
>
> Say you have the keywords "begin", "end", and "while".  Your tokens block 
> declares imaginary token types like this:
>
> tokens {
>   BEGIN;
>   END;
>   WHILE;
> }
>
> These carry no text and can't do any matching by themselves, but they *do* 
> allocate a token ID for them.  In your lexer's constructor, you 
> additionally set up a dictionary mapping like so:
>
>   keywordTable.Add("begin", BEGIN);
>   keywordTable.Add("end", END);
>   keywordTable.Add("while", WHILE);
>
> Then in the CheckKeywordsTable function you use that mapping to return the 
> "real" token type, be that one listed in the table or the catch-all 
> IDENTIFIER (when it doesn't look like a keyword).  For more complicated 
> cases you may need to do something smarter than a dictionary lookup, but 
> that's up to you.
>
> (To handle abbreviations, which is what that page is primarily focused on, 
> then obviously you'll have to add the valid abbreviations of the keywords 
> to the table as well.)
>