[antlr-interest] Predicate hoisting pain

Mon Apr 13 08:59:23 PDT 2009

Sam Barnett-Cormack wrote:
> Jim Idle wrote:
>> Sam Barnett-Cormack wrote:
>>> Hi all,
>>>
>>> So, in my grammar I have need to re-use rules so they ultimately refer 
>>> to a different rule (so I don't have to duplicate 
>>> intersection/union/exception rules). I use a parameter and gated 
>>> predicates, like so:
>>>
>>> elements[boolean os]
>>>    : {!$os}?=>subtypeElements
>>>    | {$os}?=>objectSetElements
>>>    | LPAREN! elementSetSpec[$os] RPAREN!
>>>    ;
>>>
>>> This is ultimately referred to from two places. The first, which 
>>> generates code that's just fine, is:
>>>
>>> elementSetSpecs
>>>    : rootElementSetSpec[false] (COMMA EXTMARK (COMMA 
>>> additionalElementSetSpec[false])?)?
>>>    -> ^(ELEMENTSET rootElementSetSpec EXTMARK? additionalElementSetSpec?)
>>>    ;
>>>
>>> However, in the *slightly* more complex case:
>>>
>>> objectSetSpec
>>>    : rootElementSetSpec[true] (COMMA EXTMARK 
>>> additionalElementSetSpec[true]?)?
>>>    | EXTMARK (COMMA additionalElementSetSpec[true])?
>>>    ;
>>>
>>> The predicates get hoisted in the generated code, and then there's 
>>> compile errors with undefined variable 'os'.
>>>
>>> I'm not sure why it happens in one case and not the other, and I'm even 
>>> less clear on how to fix it. Can anyone help?
>>>
>>>   
>> This is an FAQ basically, but you answer your own question as to why as 
>> your parameter to the rule is a local parameter but the code must be/can 
>> be hoisted for some decisions.
>>
>> The solution is relatively simple, but it probably isn't the correct 
>> solution as your need for this indicates that you are probably going 
>> wrong in the way you are constructing the parser. What you shoudl really 
>> do is merge these two possibilities in the parser, then in your tree 
>> walk, if you detect the use of a construct that is not valid for the 
>> context, then you parser it anyway but issue a really good semantic 
>> error along the lines of "Element specs like FOO cannot be used within 
>> specs for BARs". If you do not do this then your users will just get 
>> "Syntax error at FOO!", and unless they are already very knowledgeable 
>> about the language, then they won't really know what this means.
> 
>> However, remember the rules of good construction:
>>
>> 1) Anything that can be moved as an error in the lexer syntactically, to 
>> a semantic error, or left to the parser, should be;
>> 2) Anything that can be moved from a syntax error in the parser to a 
>> semantic error in the tree walker, should be;
>>
>> In general this means that error messages from your front end will be as 
>> good as they can be:
>>
>> 1) "Unknown character '\u8290'; in the lexer becomes: "Line 20, offset 
>> 42: The character 'u8290' is not a valid character for use in a variable 
>> name!"
>> 2) "No viable alt at 'FOO'", becomes "Line 42, offset 22: The construct 
>> FOO cannot be used within a BAR, only within a BAZ, try specifying as a 
>> BARRY."
> 
> So I would merge the two in the parser, and then separate them again in 
> the tree parser, and then do the context-sensitive validation there? In 
> this case, a user would be more likely to make a mistake that looks like 
> a mixture of a valueSet and an objectSet, rather than use one in the 
> place of another. They look different in any but the simplest cases 
> (where all the values or objects in a set are references - ie names),
> 
> However, something else in the language requires differentiation between 
> valueSets and objectSets to be deferred until semantic-building time 
> (when the type of the LHS of an expression is known), so I guess I'll 
> have to do that. It just sticks in my craw to let the parser allow 
> through something that isn't valid as *either*. However, there's a way 
> around that as well... boolean flags that get set on seeing a value 
> literal or object literal (the things that can't be mixed). Then a mixed 
> case won't get passed. However, I suspect that might be better left to 
> the semantic stage, where each element of the set can be validated, 
> based on the LHS that it goes with.

Which doesn't entirely work unless I extend that concept further than I 
have time to do, as there's too much ambiguity between the two possible 
parse styles... *sigh*

Sam