[antlr-interest] Confusing, hopefully-final, problems

Mon Mar 2 01:32:50 PST 2009

Gavin Lambert wrote:
> At 11:21 2/03/2009, Sam Barnett-Cormack wrote:
>  >NUMBER : ;
> 
> Defining non-fragment rules that can succesfully match nothing (ie. 
> empty string) is Bad(tm).
> 
>  >BSTRING : '\'' BSTRINGCONT '\'B';
> [...]
>  >HSTRING : '\'' HSTRINGCONT '\'H';
> 
> These are going to give you grief.  The lexer cannot backtrack, and 
> since HSTRINGCONT/BSTRINCONT can be infinite length, it cannot determine 
> sufficient static lookahead to disambiguate these automatically.

Okay... I'd figured those were okay, as ANTLR didn't complain about them...

> What you should do is to define a generic 
> single-quoted-string-with-optional-trailing-B/H lexer rule, and then 
> either put some trailing code in the lexer rule to check the content and 
> change the type (or report an error), or defer that until parse time.

A rule for both should be pretty easy. Add BSTRING and HSTRING to the 
tokens section, and do

BorHSTRING : '\'' content=HSTRINGCONT '\'' ( 'B' {$type=BSTRINGCONT;} | 
'H' {$type=HSTRINGCONT;} ) {($type == HSTRINGCONT | 
is_valid_binary_string($content))}?

Preferably with better error reporting than that.

>  >fragment
>  >CSTRINGNL : WSNONL* NL WSNONL* {setText("");};
> 
> setText has no effect in fragment rules.

That's one of the bits I know about but was leaving until I sort out the 
parser stuff - it doesn't affect the tokenization, after all, just the 
content of the tokens' text.

>  >fragment
>  >XMLATTVAL : XMLDATTVAL | XMLSATTVAL ;
>  >
>  >fragment
>  >XMLATTRIB : XMLNAME '=' XMLATTVAL ;
>  >
>  >fragment
>  >WSBLOCK : WS+;
>  >
>  >fragment
>  >XMLATTRIBS
>  >      : XMLATTRIB
>  >      | (XMLATTRIB WS)=>XMLATTRIB WSBLOCK XMLATTRIBS;
>  >
>  >fragment
>  >XMLTAGATTS
>  >      : WSBLOCK XMLATTRIBS ;
>  >
>  >fragment
>  >XMLOPENTAG : '<' XMLNAME XMLTAGATTS? WS* '>';
>  >
>  >fragment
>  >XMLCLOSETAG : '</' XMLNAME '>';
>  >
>  >fragment
>  >XMLSCLOSETAG : '<' XMLNAME XMLTAGATTS? WS* '/>';
>  >
>  >fragment
>  >XMLNONEMPTYELEMENT : XMLOPENTAG XMLCONTENT XMLCLOSETAG;
>  >
>  >fragment
>  >XMLEMPTYELEMENT : XMLSCLOSETAG;
>  >
>  >fragment
>  >XMLELEMENT options {
>  >  backtrack=true;
>  >}
>  >      : XMLEMPTYELEMENT | XMLNONEMPTYELEMENT ;
>  >
>  >fragment
>  >XMLCONTENT : (XMLELEMENT | XMLENTREF | ~INVALIDINXML) *;
>  >
>  >XMLFRAG : XMLELEMENT;
> 
> These really seem like they shouldn't be lexer rules.  (Or possibly that 
> you should go look at the island grammar example.)

The intention with these is to make blocks of embedded XML be tokens, to 
be processed separately. The parser code doesn't even support them yet, 
but I want the lexer to handle it for now. In the event of supporting 
embedded XML, the structure isn't really suitable for parsing except at 
around the same stage as the eventual (planned) tree parsing.

>  >extensionAdditionGroup : '[[' versionNumber componentTypeList ']]'
>  >;
> [...]
>  >tag : '[' encodingReference class classNumber ']' ;
> 
> You should be very careful when using quoted literals in parser rules 
> (in fact if you're not used to their quirks you should probably avoid 
> using them).
> 
> The above will define four new lexer rules, similar to these:
> T62: '[[';
> T63: ']]';
> T64: '[';
> T65: ']';
> 
> In particular, note that the '[[' and ']]' produce unique tokens, not 
> two occurrences of the '[' or ']' token.  This in turn means that if you 
> happen to have [[ or ]] in your input where the grammar is expecting [ [ 
> or ] ], then it will fail.  (This, incidentally, is the same issue 
> behind C++s need to be careful with >s when nesting templates.)

The grammar defines '[[' and '[' (and the closing ones) as separate 
lexical elements, so I assume the standard authors knew what they were 
doing there. Oh, and despite appearances, I get what's happening enough 
to know what happens to literals in the parser rules.

>  >[18:09:42] warning(200): ASN_1.g:518:15: Decision can match input
>  >such as "'...'" using multiple alternatives: 1, 2
>  >As a result, alternative(s) 2 were disabled for that input
>  >[18:09:42] error(201): ASN_1.g:518:15: The following alternatives
>  >can never be matched: 2
> 
> This one is fairly self-explanatory.  The rule in question is this:
> setType : '{' (componentTypeLists | extensionAndException 
> optionalExtensionMarker)? '}' ;
> 
> Now, let's look at the alternatives.  extensionAndException must begin 
> with a '...' token.  It must match either 'componentTypeLists' or 
> 'extensionAndException'.  Now let's drill into componentTypeLists.  One 
> of the alternatives is ctlExtensionStuff; and that also begins with an 
> extensionAndException.  (It can also be followed by an 
> optionalExtensionMarker.)  So now that's two alts with a common left 
> prefix -- one must therefore die.  The second error is basically saying 
> that they don't just have a common prefix -- one is actually a subset of 
> the other.

Damn, I could've sworn I checked deeply for that. I can't alter the 
language itself, but it does appear that the extensionAndException 
optionalExtensionMarker choice is actually redundant. I want to do nasty 
things to the people who wrote the grammar in the standard...

> I'm sure you'll find that the other errors occur for similar reasons.  
> (It's also interesting to note that most of them occur in places where 
> you've turned backtracking on.  You should usually try to avoid doing 
> that, in favour of rewriting your grammar to remove the ambiguities.)

I only turned on backtracking in cases where I couldn't see what the 
heck to do to remove ambiguities.

Thanks for the help

Sam