[antlr-interest] Parsing advice: predicate needed?

Wed Dec 9 23:53:53 PST 2009

At 04:18 10/12/2009, Rick Schumeyer wrote:
>I have a lexer rule
>
>LETTER  :    ('a'..'z')|('A'..'Z');
>
>At a certain point in my parser, I want to know if the next 
>several letters are the word 'Article'.  I don't want an article 
>token, because that word can appear at other points, and I don't 
>care.
>
>Can I simply define a parser rule
>
>article_string : 'A' 'r' 't' 'i' 'c' 'l' 'e';
>
>If this works, then I'm really confused because I would think 
>that the lexer would have already decided that there are several 
>LETTER tokens, so I'm not sure how the parser would see an 'A' 
>and an 'r', etc.

As you guessed, that's not the way to do it.  What that would 
actually do is to define new tokens for each of the letters you 
quoted, making them no longer be LETTER tokens any more (possibly 
raising an ambiguity warning, and almost certainly breaking other 
parts of your grammar).  (This little surprise is why I usually 
recommend never using quoted literals in parser rules.)

Normally one of the functions of the lexer is to group individual 
characters into larger units (eg. identifiers or words); if you 
had a WORD token for example then an appropriate rule might be:
   article : { LT(1).text == "Article" }? WORD ;
   WORD : LETTER+ ;
   fragment LETTER : 'A'..'Z' | 'a'..'z' ;

You can do the same thing with individual LETTER tokens but it 
gets a bit more noisy:

   article : { LT(1).text == "A" && LT(2).text == "r" ... }? 
LETTER LETTER ... ;

(Code may need to vary a bit depending on target language; see the 
examples.)

Also note that if you are producing individual LETTER tokens, you 
should not be hiding or skipping any tokens (eg. whitespace), or 
the parser won't be able to tell the difference between "Article" 
and "A   r  tic  le".