[antlr-interest] PEG-style "and" predicates
Wincent Colaiuta
win at wincent.com
Mon Jul 2 08:14:36 PDT 2007
In my lexer I am trying to use PEG-style "and" predicates ("match X
if it is followed by Y") so that I can take a rule like this:
FOO : 'bar' ;
And make it so that it will "match 'bar' only if it is the last non-
whitespace thing on the line".
So, based on the 11 May 2006 notes on this page:
<http://www.antlr.org/blog/antlr3/lookahead.tml>
I tried making a rule like this:
FOO : ('bar' ' '* ('\n' | '\r' | EOF))=> 'bar' ;
But it doesn't work. Looking at the generated code, I can see that
the syntactic predicate is broken; on lookahead of 'b', the lexer
fires off a syntactic predicate which tries to match only 'bar' (and
not the whitespace etc in the predicate). As a result, this rule will
match for any occurrence of 'bar', not just for 'bar' when it is the
last non-whitespace thing on the line. This is happening in a
filtering lexer, but a quick test suggests that exactly the same
thing happens in non-filtering lexers as well.
My understanding of predicates (notes at <http://wincent.com/
knowledge-base/ANTLR_predicates>) leads me to believe that this is
failing because in ANTLR, syntactic predicates are used to order a
rule's alternatives. They are not like gated semantic predicates
which can be used to turn an entire rule off depending on context. It
seems that syntactic predicates only make sense when a rule has more
than one alternative, and so they can't be used for the purposes of
writing PEG-style "and" predicates.
I could change my rule to this:
FOO : 'bar' ' '* ('\n' | '\r' | EOF) ;
But that isn't what I want; I want a 'bar' token and I don't want the
trailing whitespace and newline to be included in the token.
So it seems the only workaround is to do a (very ugly) manual
lookahead inside a gated semantic predicate:
FOO : { foo_helper(ctx) }?=> 'bar' ;
The helper method is needed because you can only put a single
expression inside a semantic predicate and I need to loop... It would
go something like this (in C):
ANTLR3_BOOLEAN foo_helper(pMyLexer ctx)
{
ANTLR3_UCHAR c;
int i = 4;
// check for presence of 'bar'
if (LA(1) == 'b' && LA(2) == 'a' && LA(3) == 'r')
{
// skip over any spaces
while (c = LA(i++), c == ' ');
// check for newline or EOF
if (c == '\n' || c == '\r' || c == ANTLR3_CHARSTREAM_EOF)
return ANTLR3_TRUE;
}
return ANTLR3_FALSE;
}
That was typed directly (untested) into the mail program so it might
need some tweaks to get it to work, but you get the idea...
I guess the other alternative is to use a validating semantic
predicate after 'bar' is matched:
FOO: 'bar' { last_thing_on_line(ctx) }? ;
But that once again requires an ugly helper method, although not
quite as ugly (again, untested but you get the idea):
ANTLR3_BOOLEAN foo_helper(pMyLexer ctx)
{
ANTLR3_UCHAR c;
int i = 1;
// skip over any spaces
while (c = LA(i++), c == ' ');
// check for newline or EOF
if (c == '\n' || c == '\r' || c == ANTLR3_CHARSTREAM_EOF)
return ANTLR3_TRUE;
return ANTLR3_FALSE;
}
So my questions are really threefold:
1. Is there a better way of doing this than using an ugly helper method?
2. Could ANTLR be changed so that an explicitly-specified syntactic
predicate is executed in rules for which there is only one alternative?
3. Better still, could ANTLR be extended to accept a PEG-style "&"
notation for situations like this?
The latter would allow you to write rules like:
FOO : 'bar' &(' '* ('\n' | '\r' | EOF));
If an "&" notation were to be added then it would be good to add a
"!" notation for completeness as well. I know that you can use "~" in
ANTLR but I don't think it's exactly equivalent to the PEG "not".
Cheers,
Wincent
More information about the antlr-interest
mailing list