[antlr-interest] How to handle subset relations between grammar elements?

Mon Jan 2 11:32:55 PST 2012

Greetings,
I am in the process of porting a grammar to Antlr. The original tool for
the grammar is unknown, it may not even be a grammar that was meant to be
consumed by a tool at all, maybe a pseudo grammar.

In the process, I've come across this requirement, where I have grammar
elements which are subsets of some other element. I'm using the term
elements in a generic way here, since I'm not sure about how they should be
represented in Antlr (in lexer or in parser).
An example is NonZeroDigit which is supposed to be Digit (0..9) without 0.
There are other examples, such as the ones copied below:

{String Char} = {Printable} - ["] - {quote}
{NonZeroDigit} = {Digit} - [0]

{LetterMinusA} = {Letter} - [aA]
{LetterMinusT} = {Letter} - [tT]

I've been working on building this grammar in Antlr, and so far, the only
solution I could find is to use parser rules  (thanks to a nice response to
my question at stackoverflow:
http://stackoverflow.com/questions/8695693/how-to-handle-tokens-when-one-is-subset-of-another-in-antlr)
This is an example grammar that shows my solution at the moment:

------------------------------------------------------------------------------
grammar AQLJV;

tokens{

}

expr    :    numbers_wo_zero;

expr_wn    :    numbers;

numbers    :
        NUMBER;

numbers_wo_zero
    :    NUMBER { if($NUMBER.text.indexOf("0") > -1) {throw new
RuntimeException("numbers_wo_zero expected, input contains 0");} }
    ;

NUMBER     :    (INT)+ ;

fragment INT :    '0'..'9'+
    ;
------------------------------------------------------------------------------

As you can see, the rule numbers_wo_zero puts a constraint on the NUMBER
token, but returns the token if it complies with the constaint. This way, I
am not creating tokens such as NUMBER_WO_ZERO, and if I use numbers rule,
both 123 and 103 would work fine.

I am however, curious about the appropriateness of this solution. Is there
a better, rather recommended way of doing this? I've tried using lexer's
capabilities, but it did not end up well.

My problem in the lexer based solution is, if I create tokens for more
specific elements (such as nonzerodigit), the lexer rule ends up producing
the specific token everytime, since it does not know anything about parser.
This means that parser rules expecting generic element may get the specific
one sometimes. For example, numbers mean all numbers, with or without zero,
but if I modify token type in the lexer rule, a rule expecting to match
numbers won't always work, since it'll get numbers_without_zero for input
123

So, what would be the right way of handling this use case?