[antlr-interest] should sempred questions be asked on trailing optional tokens?

Thu May 8 09:39:19 PDT 2008

>  >simple_type_specifier
>  >        :   {sa.isUnsigned((TokenStream)input)}?
>  >                'unsigned'
>  >        |   {sa.isUnsignedInt((TokenStream)input)}?
>  >                'unsigned' 'int'
>  >        |   {sa.isSigned((TokenStream)input)}?
>  >                'signed'
>  >        |   {sa.isSignedInt((TokenStream)input)}?
>  >                'signed' 'int'
>  >        |   {sa.isInt((TokenStream)input)}?
>  >                'int'
>  >        ;  
> 
> On Fri, 09 May 2008 01:43:14 +1200
> Gavin Lambert <antlr at mirality.co.nz> wrote:
>
> Just out of curiosity, what are these sempred functions actually 
> examining?

Hello Gavin,

They just simply look at the tokens with a hand coded
finite state machine.  Which is like a hand coded version
of the code that ANTLR normally produces with simpler
grammars.

If the rest of the grammar was simpler, then ANTLR would
have no problems generating a DFA for this entire rule, which
is larger than the fragment above.

> Is it just the one-or-two token lookahead sufficient 
> to disambiguate these alts,

Yes, but it is the rest of the grammar surrounding it
with ambiguities that require lots of dis-ambiguating
semantic predicates to resolve, that causes ANTLR to go
back to k=1 lookahead on this rule, requiring dis-ambiguating
semantic predicates on this rule.

I think ANTLR is awesome that it can handle my grammar.
Its no bother for me to write the simple dis-ambiguating
semantic predicates that are required on this rule.

> or are you looking even further ahead 
> or doing something more esoteric?

No, not on this rule.

There are other rules that scan further ahead, including
scanning past stuff in this rule.

> Because if it's just those, then I'd use a synpred instead -- I 
> think it's cleaner:
> 
> simple_type_specifier
>    : ('unsigned' 'int') => 'unsigned' 'int'
>    | ('unsigned') => 'unsigned'
>    | ('signed' 'int') => 'signed' 'int'
>    | ('signed') => 'signed'
>    | 'int'
>    ;
> 
> Or even:
> 
> simple_type_specifier
>    : (('unsigned') => 'unsigned') 'int'?
>    | (('signed') => 'signed') 'int'?
>    | 'int'
>    ;

Sure if synpreds are cleaner in your grammar, and they probably are
cleaner in lots of grammars, then it sounds like a great idea to use
synpreds.  Using synpreds is probably more appealing in lots of
grammars than hand coding thousands of lines of dis-ambiguating
semantic predicates using little recursive descent compilers,
however for my grammar the latter seemed more appealing to me.

There are reasons I am not using synpreds:

* I need to look up stuff in symbol tables do resolve ambiguities,
hence dis-ambiguating semantic predicates are necessary.

* I would have needed to use nested backtracking (nested synpreds).
At the time I was concerned that nested backtracking may be difficult
to understand and debug.

* I think for my grammar that using dis-ambiguating sempreds results
in it looking cleaner than it would with a combination of
dis-ambiguating sempreds and synpreds.

* I was concerned that I would find it difficult to understand and
debug the ANTLR generated parser code using synpreds.  When I pass
-debug to ANTLR, it encounters an error.  The netbeans debugger can
not open my 2.7MB generated parser Java file.  I thought it would be
easier to debug code that I wrote than to debug ANTLR generated
parser code.

* I think it is easier for me to understand, develop and debug it
where I just use dis-ambiguating semantic predicates for everything.

* The dis-ambiguating semantic predicates are in a functional decomposition
that matches the grammar rule, and they call each other.  So while I
was developing this, my logic was: why introduce a synpred when I can
just continue writing more dis-ambiguating semantic predicates, including
little recursive descent compilers.  After writing thousands of lines of
dis-ambiguating sempreds, I did not need any synpreds.

Earlier when I writing lots of sempreds, and avoiding synpreds,
I was concerned that nested synpreds may be slow.  However I
do not know if they would be any slower than what I am doing.
There is no way I am going to write it again using synpreds
to find out.

> Although normally you shouldn't even need to use any predicates 
> here at all; just order the alts correctly.

Thanks, I tried commenting the sempreds and re-ordering the alts
starting with the longest ones first, however this gave the
same number of unmatched alts as the original ad hoc order with
the sempreds commented.

> I know you said they 
> were required for some reason in your larger grammar, but the only 
> explanation for that I can imagine at the moment is if 'int' were 
> a valid variable name (which seems like a silly thing to 
> allow).  Maybe there's another case I haven't thought of?

If I just comment the one isInt() sempred, ANTLR
still compiles it OK, so it does not need that one.

If I comment the others though, then ANTLR tells me it is ambiguous, with
errors like:

java -Xmx512m -classpath /h/goanna/2/eng/dev/tntdbo/java_src:/h/goanna/2/eng/dev/tntdbo:/h/goanna/2/ts/antlr/antlr-2008-05-07.18/lib/antlr-2008-05-07.18.jar:/h/goanna/2/ts/antlr/antlr-2008-05-07.18/lib/runtime-2008-05-07.18.jar:/h/goanna/2/ts/antlr/antlr-2008-05-07.18/lib/stringtemplate-3.1b1.jar:/h/goanna/2/ts/antlr/antlr-2008-05-07.18/lib/antlr-2.7.7.jar org.antlr.Tool -Xconversiontimeout 1200000 -report Tntdbo.g
ANTLR Parser Generator  Version 3.1b1 (??)  1989-2007
warning(200): Tntdbo.g:866:9: Decision can match input such as "UNSIGNED" using multiple alternatives: 4, 9, 10, 14, 15, 16, 17, 18, 19
As a result, alternative(s) 9,10,14,15,16,17,18,19 were disabled for that input
Semantic predicates were present but were hidden by actions.
error(201): Tntdbo.g:866:9: The following alternatives can never be matched: 9,10,14,15,16,17,18,19

So of course I put the sempreds back in and its fine.

Thanks, Mark