[antlr-interest] Problem and commentary on semantic predicates and prediction

Thu Mar 31 08:15:08 PDT 2011

I'm trying to use semantic predicates in a lexer. I'll try to explain
everything as I walk through the things I tried. I started with this:

TEXT
    :   (   ~('}'|'<'|'\n'|'\r'|'\\'|'"'|'>')
        |   '\\' .
        |   {AnonymousTemplateLevel==0}? '}'
        |   {Outermost!=String}? '"'
        |   {Outermost!=BigString || input.LA(2)!='>'}? '>'
        )+
    ;

The TEXT token continues to consume characters until the end of the TEXT
section is complete. Depending on context (stored as fields in the lexer),
the text section could end with '}', '"', or '>>'. When I used the above
form of the rule, the prediction section of mTokens() never evaluated my
predicates, so when the next character of the input stream was any of '}',
'"', or '>', it would predict the TEXT rule. Here's a shortened form of the
prediction. For simplicity, I'll use (x in SetA) as syntax for x in
~('}'|'<'|'\n'|'\r'|'"'|'>'), which is a character from the first alt of
TEXT or the '\' character of the second alt. I'll use SetB for
~('<'|'\n'|'\r'), which is any possible character in TEXT without
considering the predicates.

if ((LA5_0 in SetB)) {
	alt5 = 5; // TEXT
}

I realized shortly after that no other rule in the lexer could start with
these characters, so by process of elimination I supposed this kind of
prediction is a reasonable result. I tried to correct the situation by
adding the following rule after TEXT:

ANYCHAR : . ;

This changed the prediction in mTokens, but certainly didn't fix it.

if ((LA5_0=='\"')) {
	int LA5_6 = input.LA(2);

	if (LA5_6 in SetB) {
		alt5 = 5;
	} else if (((Outermost!=String))) {
		alt5 = 5;
	} else if ((true)) {
		alt5 = 6;
	}
}

Next, looking at the ACyclicDFACodeGenerator in the tool, I noticed that
predicates are only emitted in the prediction phase if they are gated
(assigned to the "predicates" attribute of the "dfaEdge" template in
WalkFixedDFAGeneratingStateMachine). So I added gates to my predicates. In
this case, the prediction is correct, but the Boolean expression is
expressed as ((A && B) || B), which reduces to simply (B). In this case, the
expense of computing B (a semantic predicate) has no bearing on the result
because there is no way to avoid evaluating it.

if ((LA5_0=='\"')) {
	int LA5_6 = input.LA(2);

	if ((LA5_6 in SetB) && ((Outermost!=String))) {
		alt5 = 5;
	} else if (((Outermost!=String))) {
		alt5 = 5;
	} else if ((true)) {
		alt5 = 6;
	}
}

I originally didn't see the correctness of this test, so I went on to try
removing the ANYCHAR rule while keeping the gates on the predicates. The
results are interesting, so I'll note them here. The result in mTokens
looked like this:

... } else if (LA5_0 in SetA) {
	alt5 = 5;
} else if ((LA5_0=='}') && ((AnonymousTemplateLevel==0))) {
	alt5 = 5;
} else if ((LA5_0=='\"') && ((Outermost!=String))) {
	alt5 = 5;
} else if ((LA5_0=='>') && ((Outermost!=BigString || input.LA(2)!='>'))) {
	alt5 = 5;
} else ...

In this case I observed the following:

1. The LL(1) prediction that now appears is faster.
2. This lexer is used as a syntax highlighting component, where removing the
ANYCHAR rule can lead to substantial performance degradation. However, since
the Tokens rule is implicitly created by the grammar compiler, I have no way
to specify that the decision on TEXT can/should be treated as LL(1).
3. The check (LA5_0 in SetA) is overly restrictive because an conditions
check for (LA_5 in ('<'|'\n'|'\r')). Relaxing the test reduces the number of
comparisons in the false case (say (LA5_0=='{')) from 14 to 8.
4. The test (LA5_0 in SetB) expands to the following. Edit here: I meant to
show this for SetA, but since I'll accidentally used SetB, we'll go with
that.

((LA5_5>='\u0000' && LA5_5<='\t')||(LA5_5>='\u000B' &&
LA5_5<='\f')||(LA5_5>='\u000E' && LA5_5<=';')||(LA5_5>='=' &&
LA5_5<='\uFFFF'))

A simple transform (moving a few parenthesis) of this already ordered
expression gives:

((LA5_5 >= '\u0000' && (LA5_5 <= '\t' || (LA5_5 >= '\u000B' && (LA5_5 <=
'\f' || (LA5_5 >= '\u000E' && (LA5_5 <= ';' || (LA5_5 >= '=' && LA5_5 <=
'\uFFFF'))))))))

Which reduces the number of comparisons from 5 to 3 for eliminating '\n', 6
to 5 for eliminating '\r', and remains at 7 for eliminating '}'. For sets
with a very large number of ranges such as the ID start character in the
Java grammar, the time savings for this change could be tremendous
considering nearly all identifiers start with a character in the range
0..127 which is tested in the early portion of the expression.

Thanks,
Sam