[antlr-interest] Fwd: beginner question - 'unexpected ast node' when generating from combined grammar

Sun Feb 13 14:33:58 PST 2011

Hi Kevin,

Thanks very much for your email - I did find it very helpful.

> I find that when I can't figure out why ANTLR is doing what it is doing
> that a gander at the generated source code of the parser sheds a lot of
> light on the subject.  Then all I have to do is figure out how to change
> my parser definitions to get ANTLR to do what *I* want it to do.

The problem I posted below was preventing ANTLR to generate any code,
which made it particularly hard for me to understand what was going
on.

Given the example I posted earlier:

  anyCmd  :    cmdSp | cmdIf;
  nonCmd  :    ~(anyCmd);

would you happen to know what could cause these errors to be generated?

  Test.g:0:0: syntax error: buildnfa: <AST>:19:16: unexpected AST node: anyCmd
  Test.g:19:14: set complement is empty

> Think about what TOKENS your parser needs to see, and how they can be
> generated, then think about in what order those tokens need to be
> recognized by the parser, and how to do it.

This comment was pretty helpful - I changed my grammar to use tokens a
lot more, e.g. from:

   cmdIf : '{if' WS* '}' content ('{elseif}' content)* ('{else}'
content)? WS* '{/if}';

to

   cmdIf: IF_OPEN WS* CURLY_CLOSE content (ELSEIF content)* (ELSE
content)? WS* IF_CLOSE;

This allows the grammar to compile, and it can now recognize some of
the basic content I pass it - thanks.

I also started using barriers so that things like IDENT are only
matched when within a command.

> Also, DOTTED_IDENT looks like it should be a parser rule rather than a
> lexer rule....
>
> dottedIdent: IDENT ( '.' IDENT )* ;

That makes sense - I was getting problems where something that matched
both IDENT and DOTTED_IDENT came up as an IDENT, and so caused the
parsing to fail.

> So, input like:
> ...
> would be considered illegal since there is WS between the { and the name?

I actually took this out to make the grammar easier to read for the
purpose of posting to this list! My original did have this (thanks for
pointing it out though!)

Thanks,
Nick

On 11 February 2011 21:50, Kevin J. Cummings
<cummings at kjchome.homeip.net> wrote:
> On 02/11/2011 02:42 PM, Nick C wrote:
>> Hi,
>>
>> I'm trying learn antlr by writing a parser for a simple HTML
>> templating language (in combination with reading the Definitive Antlr
>> Reference book - I'm only just past the calculator example so far
>> though.)
>>
>> The parser should handle something like this:
>
> "something like this" is not a very accurate description of what you
> intend to recognize.
>
>>   {namespace My.Namespace}
>>   {template MyTemplate}
>>       hello
>>       {if $name}
>>           {print $name}
>>       {else}
>>           world
>>       {/if}
>>       <br/>
>>    {/template}
>
> So, input like:
>
>        { namespace My.namespace }
>        { template MyTemplate }
>                hello
>                { if $name }
>                        { print $name }
>                { else }
>                        world
>                { /if }
>                <br/>
>        { /template }
>
> would be considered illegal since there is WS between the { and the name?
>
> (Unless it is specifically forbidden to do by your language
> specification), I would separate the '{' out as a separate token, and
> change the parser so that WS is not significant, except when necessary
> to separate similar tokens.  Then you end up with the problem of
> reserved words or keywords for things like "namespace", "template",
> "if", and "else" and making sure they don't conflict with IDENT.
>
> WS:  ( ' ' | '\t' | '\r' | '\n' )
>     { skip(); }
>  ;
>
> ns: '{' 'namespace' dottedIdent '}' ;
>
> etc...
>
> Now you have the problem of ambiguity of all the rules that start with
> '{', which can be solved by increasing k to k=2; (or more) or you can
> combine them into a single parser rule like:
>
> xml: '{' ( ns | template | if | else | ... ) ;
>
> Also, DOTTED_IDENT looks like it should be a parser rule rather than a
> lexer rule....
>
> dottedIdent: IDENT ( '.' IDENT )* ;
>
> Think about what TOKENS your parser needs to see, and how they can be
> generated, then think about in what order those tokens need to be
> recognized by the parser, and how to do it.
>
> If you try to do too much at a low level, you will run into problems
> (like you have been).
>
>> My current attempt is to first build a simple version of the parser
>> without any actions, just to get it to parse valid input correctly:
>>
>>     grammar Test;
>>
>>     options { language = 'CSharp2'; }
>>
>>     doc    :     ns
>>                  WS*
>>                  (template)+
>>                  WS* ;
>>
>>     ns      :    '{namespace' WS+ DOTTED_IDENT WS* '}';
>>
>>     template:    '{template' WS+ IDENT WS* '}' content '{/template}';
>>
>>     cmdSp   :    '{sp' WS* '/}';
>>
>>     cmdIf   :    '{if' WS* '}' content ('{elseif}' content)* ('{else}'
>> content)? WS* '{/if}' ;
>>
>>     anyCmd  :    cmdSp | cmdIf;
>>     nonCmd  :    ~(anyCmd); /* ~('{')*;*/
>>     content :    (anyCmd | nonCmd)*;
>>
>>     WS      :    ' '|'\t'|'\r'|'\n';
>>     IDENT   :    ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
>>     DOTTED_IDENT
>>             :    IDENT ((WS)* '.' (WS)* IDENT)*;
>>
>>
>> I was hoping to parse the first example minus the {$name} print
>> statement and the conditional in the {if} statement (i.e. the $name in
>> the {if}.)
>>
>> When I try to generate the parser with the above definition, I get the
>> following warnings/errors:
>>
>>     Test.g:0:0: syntax error: buildnfa: <AST>:19:16: unexpected AST node: anyCmd
>>     Test.g:19:14: set complement is empty
>>
>> I'm guessing my use of ~(anyCmd) is incorrect, but I don't understand why?
>>
>> If I try replacing that with ~('{')* as per the comment above, I get
>> these errors:
>>
>>     Test.g:19:18: Decision can match input such as "'{else}'" using
>> multiple alternatives: 1, 2
>>     As a result, alternative(s) 2 were disabled for that input
>>     ... more errors like this ...
>>     Test.g:19:18: The following alternatives can never be matched: 2
>>
>> I thought this was specifying 'any character apart from {', so I don't
>> understand how '{else}' could be a match (or why there are multiple
>> alternatives - I thought I only specified one, unless * counts as
>> many?)
>
> I find that when I can't figure out why ANTLR is doing what it is doing
> that a gander at the generated source code of the parser sheds a lot of
> light on the subject.  Then all I have to do is figure out how to change
> my parser definitions to get ANTLR to do what *I* want it to do.
>
> Sometimes that involves the use of syntactic or semantic predicates when
> things get complicated.
>
>> A brief explanation and/or a pointer to the section of the book I
>> should be reading would be most welcome.
>
> This is not necessarily covered by a specific part of the book.  It is
> usually covered by a very detailed specification of what you are trying
> to parse.  Once you figure out what your language's delimiters and
> separators are, the proper token definitions should just fall out.
>
>> Thanks,
>> Nick
>
> Just my $0.02.  I hope you find some of it helpful.
>
> --
> Kevin J. Cummings
> kjchome at verizon.net
> cummings at kjchome.homeip.net
> cummings at kjc386.framingham.ma.us
> Registered Linux User #1232 (http://counter.li.org)
>