[antlr-interest] Can lexer take hints

Fri Jan 20 15:23:48 PST 2006

Thank you Gabriel - it works like a charm! I vote for adding your 
description to ANTLR manual section on syntactic predicates :)

I've also found it helpful splitting tokens/id's into several smaller 
lexers and then using TokenStreamSelector to switch between them.

Regards,
Art.

On Thu, 19 Jan 2006, Gabriel Radu wrote:

> I think your best bet will be to declare your lexer rules as protected
> and use syntactic predicates in a "main" rule as follows:
>
> protected ALPHA : 'a'..'z' | 'A'..'Z';
> protected DIGIT : '0'..'9';
>
> protected ID1: (ALPHA)+;
> protected ID2: (DIGIT)+;
> protected ID3: (ALPHA | DIGIT)+;
>
> protected TOKEN: "MY_TOKEN";
>
> ID_or_TOKEN // This is the "main" rule
>  : ( ID3 ) => ( ID3 { $setType( ID3 ); } )
>  | ( ID1 ) => ( ID1 { $setType( ID1 ); } )
>  | ( ID2 ) => ( ID2 { $setType( ID2 ); } )
>  | ( TOKEN ) => ( ID2 { $setType( TOKEN ); } )
> ;
>
> So if you try to parse a string like "test1 test2 testtree 44
> MY_TOKEN", the lexer will match this to "ID3 ID3 ID1 ID2 TOKEN".
>
> Note that the first production in the ID_or_TOKEN rule is ID3. This is
> because otherwise tokens of (ALPHA | DIGIT)+ type will never be
> matched.
>
> I hope that this was what you was after. If not or have other
> questions let me know.
>
>
> Kind regards,
> Gabriel
>
>
> On 18/01/06, Artem Dmytrenko <admytren at engin.umich.edu> wrote:
>> Hello Antlr experts.
>>
>> I'm an antlr newbie struggling with all these pesky nondeterminism
>> warnings. I'm trying to implement a parser for ABNF grammar that has
>> overlaping tokens and matching rules. For example, it may have a token
>> "media" as well as matching rules a="a..z" and b="a..z0..9". Essentially
>> token "media" will match rule a and rule b, while a string like "blah"
>> will match rule a and rule b. To make it even worse, tokens have a long
>> and short term notation (e.g. "media" and "m" mean the same thing).
>>
>> My question is if it's possible for parser to instruct lexer to use only a
>> subset of tokens. For example, let's say I have the following tokens
>> defined in lexer:
>>
>> ID1: (ALPHA)+;
>> ID2: (DIGIT)+;
>> ID3: (ALPHA | DIGIT)+;
>> TOKEN: "MY_TOKEN";
>>
>> Now I know in parser that at a particular point of time I only expect ID2
>> or TOKEN and ask it not to match ID1 and ID2. For example:
>>
>> messageStart:
>>    (ID2 | TOKEN)
>>    { System.out.println("Detected message start"); }
>>    ;
>>
>> When I compile code similar to the one above lexer matches all 4 (ID1,
>> ID2, ID3, TOKEN) giving me unexpected results. So I don't think it works.
>>
>> Essentially what I'm trying to do is create a list of all possible lexer
>> tokens and then specify in parser which ones to expect at any particular
>> time. Is it possible to do with some sort of custom lexer/parser? If not,
>> what would be the best approach to implementing this? I suspect that
>> states is the only way - but they look very messy and I'm afraid they will
>> cause the grammar to depart even further from original ABNF syntax and
>> make it difficult to read.
>>
>> Thank you in advance for any help/pointers/examples on this topic.
>>
>> Similar questions must have been posted a million times on this forum, I
>> apologize if mine is not much different (although it appears so to me!).
>>
>> Art.
>>
>
>