[antlr-interest] Can lexer take hints

Thu Jan 19 03:57:38 PST 2006

I think your best bet will be to declare your lexer rules as protected
and use syntactic predicates in a "main" rule as follows:

protected ALPHA : 'a'..'z' | 'A'..'Z';
protected DIGIT : '0'..'9';

protected ID1: (ALPHA)+;
protected ID2: (DIGIT)+;
protected ID3: (ALPHA | DIGIT)+;

protected TOKEN: "MY_TOKEN";

ID_or_TOKEN // This is the "main" rule
  : ( ID3 ) => ( ID3 { $setType( ID3 ); } )
  | ( ID1 ) => ( ID1 { $setType( ID1 ); } )
  | ( ID2 ) => ( ID2 { $setType( ID2 ); } )
  | ( TOKEN ) => ( ID2 { $setType( TOKEN ); } )
;

So if you try to parse a string like "test1 test2 testtree 44
MY_TOKEN", the lexer will match this to "ID3 ID3 ID1 ID2 TOKEN".

Note that the first production in the ID_or_TOKEN rule is ID3. This is
because otherwise tokens of (ALPHA | DIGIT)+ type will never be
matched.

I hope that this was what you was after. If not or have other
questions let me know.

Kind regards,
Gabriel

On 18/01/06, Artem Dmytrenko <admytren at engin.umich.edu> wrote:
> Hello Antlr experts.
>
> I'm an antlr newbie struggling with all these pesky nondeterminism
> warnings. I'm trying to implement a parser for ABNF grammar that has
> overlaping tokens and matching rules. For example, it may have a token
> "media" as well as matching rules a="a..z" and b="a..z0..9". Essentially
> token "media" will match rule a and rule b, while a string like "blah"
> will match rule a and rule b. To make it even worse, tokens have a long
> and short term notation (e.g. "media" and "m" mean the same thing).
>
> My question is if it's possible for parser to instruct lexer to use only a
> subset of tokens. For example, let's say I have the following tokens
> defined in lexer:
>
> ID1: (ALPHA)+;
> ID2: (DIGIT)+;
> ID3: (ALPHA | DIGIT)+;
> TOKEN: "MY_TOKEN";
>
> Now I know in parser that at a particular point of time I only expect ID2
> or TOKEN and ask it not to match ID1 and ID2. For example:
>
> messageStart:
>    (ID2 | TOKEN)
>    { System.out.println("Detected message start"); }
>    ;
>
> When I compile code similar to the one above lexer matches all 4 (ID1,
> ID2, ID3, TOKEN) giving me unexpected results. So I don't think it works.
>
> Essentially what I'm trying to do is create a list of all possible lexer
> tokens and then specify in parser which ones to expect at any particular
> time. Is it possible to do with some sort of custom lexer/parser? If not,
> what would be the best approach to implementing this? I suspect that
> states is the only way - but they look very messy and I'm afraid they will
> cause the grammar to depart even further from original ABNF syntax and
> make it difficult to read.
>
> Thank you in advance for any help/pointers/examples on this topic.
>
> Similar questions must have been posted a million times on this forum, I
> apologize if mine is not much different (although it appears so to me!).
>
> Art.
>