[antlr-interest] Tokenising for context specific reserved words

Thu Jul 17 18:58:32 PDT 2008

----- Original Message ----
From: Jim Idle <jimi at temporal-wave.com>
To: Loring Craymer <lgcraymer at yahoo.com>
Cc: antlr-interest <antlr-interest at antlr.org>
Sent: Thursday, July 17, 2008 6:19:24 PM
Subject: Re: [antlr-interest] Tokenising for context specific reserved words

On Thu, 2008-07-17 at 17:36 -0700, Loring Craymer wrote: 
For Yggdrasil, I hide the sempred behind doubly-quoted keywords.  As to performance:  the sempred is called less often than id (as a rule--YMMV) and usually much less often.  The issue is aggregate performance, not local performance; the general principle for performance tweaking is to worry less about the cost of infrequent calls than the cost of frequent calls.  Basically, the id approach adds a method call and bitset inclusion test for every ID, while the sempred costs the three calls per keyword test.

OK - I see where you are going. However, most of the cases I come across mean that you would be doing those 3 calls for every keyword and I think it would be quickly unreadable. Most languages where this happens allow almost all keywords to be used as identifiers when they are not in fact the actual keyword. The lesson then is probably to step back from the solution before implementing either one and see which makes 
True, but keyword invocations are less common than user-defined identifiers, and only conditional keywords (like "if") appear at high frequency in code.  The disadvantage of the simple sempred is that you really want to "hoist and negate" because of the equality test (!= is faster than == for strings), but that is an implementation detail.
sense for your particular situation. I can imagine that cases where a few new keywords are introduced in a new version of the language but for backward compatibility reasons they are allowed to be identifiers, may well qualify as a sempred candidate for instance. 

There are probably better generic solutions for the whole keyword vs ID issue. Double quoting keywords seems like a reasonable way to flag something as also being available as in identifier, but then it forces the sempred route unless it is further adorned with constructs that may well then inextricably link the parser and lexer, which is probably/possibly best avoided. 
I do tag doubly quoted keywords with unique token types for tree walking as they are identified by the parser; that effectively decreases the frequency of identifier versus keyword conflicts over the translation cycle.  The converse (have id set non-IDs to ID in the conglomeration approach you described) is also useful.
--Loring

Jim

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080717/27ac79dc/attachment.html