[antlr-interest] Languages where keywords can be used as identifiers

Tue Feb 7 16:14:40 PST 2006

> Are there any hooks (from the parser) into the lexer, to tell it to
> switch off testLiterals, or (due to lookahead) is it already too late
> once the parser is parsing a rule?

Due to the lookahead, it's already too late. Having the parser trigger state switches in the lexer just leads to a world of hurt.

I agree about the maintenance issue with regard to keeping a list of unreserved keywords. The grammar I maintain (I wrote a parser for a language called ABL from Progress Software) has over 1000 token types now, most of those are unreserved keywords. I had hoped that your situation was different.

I don't know of any way around the fact that all of the unreserved keywords need to be listed as a rule in your grammar. The parser that Antlr generates needs that rule so that it can deal with lookahead issues.

John at joanju dot com

Adam Bishop (DSLWN) wrote:
> Thanks.
> The problem with this is that the list of (unreserved) keywords is
> expanding, so I would need to maintain the unreservedKeyword rule.  I
> need some way of guaranteeing that all of the keywords are in the rule,
> so I could use the literals txt file generated to generate the
> unreservedKeyword rule and import that rule into the grammar...
> 
> I have been trying a different approach, and have made a method that
> (greedily) fetches and matches the next token
> 
> 	/**
> 	  * Returns the string of the identifier.
> 	  * <p>Should be used instead of the ID token, since the ID
> token will only be returned
> 	  * by the lexer if the identifier is not a keyword
> 	  * <p>Show caution in the use of this method, particularly if
> ID is only one of many options.  
> 	  * If it is then getID should be the last option, as it will
> physically force the parser to chew up the next token regardless (i.e.
> always matches)
> 	  * @throws TokenStreamException 
> 	  * @throws MismatchedTokenException 
> 	  **/
> 	private Token getID() throws MismatchedTokenException,
> TokenStreamException{
> 		Token result = LT(1);
> 		match(result.getType());
> 		return result;
> 	}
> 
> It works in the case that the rule actually can be greedy, but the
> obvious downsides of this are that the getID needs to be the last option
> within any selection, and if it is part of an optional clause it will
> fail.  I could modify it to stop it calling match 100% of the time
> (possibly by passing in an exception set of tokens)
> 
> But both of these approaches seem to be... less than elegant.
> 
> Are there any hooks (from the parser) into the lexer, to tell it to
> switch off testLiterals, or (due to lookahead) is it already too late
> once the parser is parsing a rule?
> 
> P.S. I'm leaning towards your solution.
> 
> 
> -----Original Message-----
> From: John Green [mailto:greenj at ix.netcom.com] 
> Sent: Wednesday, 8 February 2006 12:13 p.m.
> To: Adam Bishop (DSLWN)
> Cc: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Languages where keywords can be used as
> identifiers
> 
> I went through the same thing a long time ago. To do it similar to what
> I did:
> 
> The lexer would always recognize "loop" as a keyword token LOOP.
> 
> The grammar would have a rule like:
>   unreservedkeyword: loop | etc | etc ;
> 
> The grammar would use a rule named "id":
>   id: ID | unreservedkeyword ;
> 
> But enhance that last rule a bit, so that when you add it to the tree,
> you change the type from LOOP (or whatever keyword) to ID:
>   id: ID | urk:unreservedkeyword { #urk.setType(ID); }
> I probably have the syntax wrong for setType, sorry, this is off the top
> of my head.
> 
> Now your grammar can use:
>   "goto" id
> and
>   datatype id
> 
> 
> HTH,
> John
> john at joanju dot com
> 
> 
> Adam Bishop (DSLWN) wrote:
>> I am parsing a language where "loop" is a keyword, however a label can
> 
>> be named loop.  The rule for label expects an identifier token, but
> the 
>> lexer will return a loop token.  Is there any way to switch
> testLiterals 
>> for a particular rule?
>>
>>  
>>
>> Ideally the Lexer shouldn't be doing testLiterals for any usage of the
> 
>> token ID in the parser.
>>
>>  
>>
>> NOTE:  To make things worse, I am having this problem wherever I have
> a 
>> rule in the parser that expects an identifier
>>
>> e.g.
>>
>>  
>>
>> "goto" ID
>>
>>  
>>
>> Will fail for input "goto loop"
>>
>>  
>>
>> And
>>
>>  
>>
>> datatype ID
>>
>>  
>>
>> will fail for "Number length" (since length is a keyword in another
> rule)
>>  
>>
> 
> 
>