[antlr-interest] PEG-style "and" predicates

Mon Jul 2 08:14:36 PDT 2007

In my lexer I am trying to use PEG-style "and" predicates ("match X  
if it is followed by Y") so that I can take a rule like this:

   FOO : 'bar' ;

And make it so that it will "match 'bar' only if it is the last non- 
whitespace thing on the line".

So, based on the 11 May 2006 notes on this page:

   <http://www.antlr.org/blog/antlr3/lookahead.tml>

I tried making a rule like this:

   FOO : ('bar' ' '* ('\n' | '\r' | EOF))=> 'bar' ;

But it doesn't work. Looking at the generated code, I can see that  
the syntactic predicate is broken; on lookahead of 'b', the lexer  
fires off a syntactic predicate which tries to match only 'bar' (and  
not the whitespace etc in the predicate). As a result, this rule will  
match for any occurrence of 'bar', not just for 'bar' when it is the  
last non-whitespace thing on the line. This is happening in a  
filtering lexer, but a quick test suggests that exactly the same  
thing happens in non-filtering lexers as well.

My understanding of predicates (notes at <http://wincent.com/ 
knowledge-base/ANTLR_predicates>) leads me to believe that this is  
failing because in ANTLR, syntactic predicates are used to order a  
rule's alternatives. They are not like gated semantic predicates  
which can be used to turn an entire rule off depending on context. It  
seems that syntactic predicates only make sense when a rule has more  
than one alternative, and so they can't be used for the purposes of  
writing PEG-style "and" predicates.

I could change my rule to this:

   FOO : 'bar' ' '* ('\n' | '\r' | EOF) ;

But that isn't what I want; I want a 'bar' token and I don't want the  
trailing whitespace and newline to be included in the token.

So it seems the only workaround is to do a (very ugly) manual  
lookahead inside a gated semantic predicate:

   FOO : { foo_helper(ctx) }?=> 'bar' ;

The helper method is needed because you can only put a single  
expression inside a semantic predicate and I need to loop... It would  
go something like this (in C):

   ANTLR3_BOOLEAN foo_helper(pMyLexer ctx)
   {
       ANTLR3_UCHAR c;
       int i = 4;

       // check for presence of 'bar'
       if (LA(1) == 'b' && LA(2) == 'a' && LA(3) == 'r')
       {
          // skip over any spaces
          while (c = LA(i++), c == ' ');

          // check for newline or EOF
          if (c == '\n' || c == '\r' || c == ANTLR3_CHARSTREAM_EOF)
             return ANTLR3_TRUE;
       }
       return ANTLR3_FALSE;
   }

That was typed directly (untested) into the mail program so it might  
need some tweaks to get it to work, but you get the idea...

I guess the other alternative is to use a validating semantic  
predicate after 'bar' is matched:

   FOO: 'bar' { last_thing_on_line(ctx) }? ;

But that once again requires an ugly helper method, although not  
quite as ugly (again, untested but you get the idea):

   ANTLR3_BOOLEAN foo_helper(pMyLexer ctx)
   {
       ANTLR3_UCHAR c;
       int i = 1;

       // skip over any spaces
       while (c = LA(i++), c == ' ');

       // check for newline or EOF
       if (c == '\n' || c == '\r' || c == ANTLR3_CHARSTREAM_EOF)
             return ANTLR3_TRUE;

       return ANTLR3_FALSE;
   }

So my questions are really threefold:

1. Is there a better way of doing this than using an ugly helper method?

2. Could ANTLR be changed so that an explicitly-specified syntactic  
predicate is executed in rules for which there is only one alternative?

3. Better still, could ANTLR be extended to accept a PEG-style "&"  
notation for situations like this?

The latter would allow you to write rules like:

   FOO : 'bar' &(' '* ('\n' | '\r' | EOF));

If an "&" notation were to be added then it would be good to add a  
"!" notation for completeness as well. I know that you can use "~" in  
ANTLR but I don't think it's exactly equivalent to the PEG "not".

Cheers,
Wincent