[antlr-interest] Confusing, hopefully-final, problems
Gavin Lambert
antlr at mirality.co.nz
Sun Mar 1 23:33:24 PST 2009
At 11:21 2/03/2009, Sam Barnett-Cormack wrote:
>NUMBER : ;
Defining non-fragment rules that can succesfully match nothing
(ie. empty string) is Bad(tm).
>BSTRING : '\'' BSTRINGCONT '\'B';
[...]
>HSTRING : '\'' HSTRINGCONT '\'H';
These are going to give you grief. The lexer cannot backtrack,
and since HSTRINGCONT/BSTRINCONT can be infinite length, it cannot
determine sufficient static lookahead to disambiguate these
automatically.
What you should do is to define a generic
single-quoted-string-with-optional-trailing-B/H lexer rule, and
then either put some trailing code in the lexer rule to check the
content and change the type (or report an error), or defer that
until parse time.
>fragment
>CSTRINGNL : WSNONL* NL WSNONL* {setText("");};
setText has no effect in fragment rules.
>fragment
>XMLATTVAL : XMLDATTVAL | XMLSATTVAL ;
>
>fragment
>XMLATTRIB : XMLNAME '=' XMLATTVAL ;
>
>fragment
>WSBLOCK : WS+;
>
>fragment
>XMLATTRIBS
> : XMLATTRIB
> | (XMLATTRIB WS)=>XMLATTRIB WSBLOCK XMLATTRIBS;
>
>fragment
>XMLTAGATTS
> : WSBLOCK XMLATTRIBS ;
>
>fragment
>XMLOPENTAG : '<' XMLNAME XMLTAGATTS? WS* '>';
>
>fragment
>XMLCLOSETAG : '</' XMLNAME '>';
>
>fragment
>XMLSCLOSETAG : '<' XMLNAME XMLTAGATTS? WS* '/>';
>
>fragment
>XMLNONEMPTYELEMENT : XMLOPENTAG XMLCONTENT XMLCLOSETAG;
>
>fragment
>XMLEMPTYELEMENT : XMLSCLOSETAG;
>
>fragment
>XMLELEMENT options {
> backtrack=true;
>}
> : XMLEMPTYELEMENT | XMLNONEMPTYELEMENT ;
>
>fragment
>XMLCONTENT : (XMLELEMENT | XMLENTREF | ~INVALIDINXML) *;
>
>XMLFRAG : XMLELEMENT;
These really seem like they shouldn't be lexer rules. (Or
possibly that you should go look at the island grammar example.)
>extensionAdditionGroup : '[[' versionNumber componentTypeList
']]'
>;
[...]
>tag : '[' encodingReference class classNumber ']' ;
You should be very careful when using quoted literals in parser
rules (in fact if you're not used to their quirks you should
probably avoid using them).
The above will define four new lexer rules, similar to these:
T62: '[[';
T63: ']]';
T64: '[';
T65: ']';
In particular, note that the '[[' and ']]' produce unique tokens,
not two occurrences of the '[' or ']' token. This in turn means
that if you happen to have [[ or ]] in your input where the
grammar is expecting [ [ or ] ], then it will fail. (This,
incidentally, is the same issue behind C++s need to be careful
with >s when nesting templates.)
>[18:09:42] warning(200): ASN_1.g:518:15: Decision can match
input
>such as "'...'" using multiple alternatives: 1, 2
>As a result, alternative(s) 2 were disabled for that input
>[18:09:42] error(201): ASN_1.g:518:15: The following
alternatives
>can never be matched: 2
This one is fairly self-explanatory. The rule in question is
this:
setType : '{' (componentTypeLists | extensionAndException
optionalExtensionMarker)? '}' ;
Now, let's look at the alternatives. extensionAndException must
begin with a '...' token. It must match either
'componentTypeLists' or 'extensionAndException'. Now let's drill
into componentTypeLists. One of the alternatives is
ctlExtensionStuff; and that also begins with an
extensionAndException. (It can also be followed by an
optionalExtensionMarker.) So now that's two alts with a common
left prefix -- one must therefore die. The second error is
basically saying that they don't just have a common prefix -- one
is actually a subset of the other.
I'm sure you'll find that the other errors occur for similar
reasons. (It's also interesting to note that most of them occur
in places where you've turned backtracking on. You should usually
try to avoid doing that, in favour of rewriting your grammar to
remove the ambiguities.)
More information about the antlr-interest
mailing list