[antlr-interest] Antlr3 C runtime woes.

Thu Nov 6 11:55:41 PST 2008

On Thu, 2008-11-06 at 09:48 -0800, Vamsi Juvvi wrote:
> Hi all.
> 
>    I hope this gets scanned by jim or any other C runtime expert and I
> get some advice.

Are you using output=AST ? I would need to see more of your grammar to
see whether any of these are bugs are just a misuse of the grammar to be
honest. I think that you might be trying to do things that would work if
you had output=AST (or the other way around).

> 
>    I had sent out an email containing a whole lot of text (in
> retrospect) and probably created an overload for whoever read it. I am
> simplifying it and adding some other problems I am facing. 
> 
>   I am trying to write a replacement for the MSIDL compiler with a
> custom simplified and customized syntax for my company's needs. I am
> using Antlr 3.1.1 with the C runtime. The problems started when I
> decided to carry some comments over from the IDL file to the generated
> files. Thats when I lost a few days tracking the following issues and
> figuring out workarounds.
> 
>   //-- Problem 1 -----------------
>   In a rule (import_decl say), I want to have a validating predicate
> to ensure that the rule starts on a line all by itself. The antlr book
> gave me the impression that it can expressed thus.
> 
>       import_decl : IMPORT_DECL { $start.pos == 0 } ?

Here you would be trying to reference the first token of the rule
return, which if your outputting the AST will be the tree span.

There are a few things wrong here. But first review what you are trying
to do. It seems that you want whatever IMPORT_DECL is, to start at
column position 0 only. What do you do with this token when it isn't at
column 0? Your construct will probably just throw a confusing error to
the user about what to them looks like valid code. Don't try to encode
errors in the grammar syntax like this, you should be going the other
way and parsing everything, then issuing specific things about errors
that you can check for. Then again, if the parser can accept it even if
it isn't at the start of a line, why make this an error anyway?

If you want to know something about IMPORT_DECL, then reference it
explicitly:

import_decl: i=IMPORT_DECL  { if ($i.whatever ...

off the top of my head the pos attribute isn't the start position of the
token, but the token number anyway (check that though I have not
looked).

If a token takes a different meaning when not at the start of the line,
then perform this token change in the lexer, not the parser. The parser
is really about token sequences, not their position in text lines.

Also, remember that you might not want:

import

to be any different than:

     import

>   There is a workaround though. I can hack it to use the actual
> members in the C runtime. I can replace the predicate with
>    { retval.start->charPosition == 0 }?

It usually means that you are doing something wrong if you start
resorting to this.

>   //-- Problem 2---------------------
>   Similar problems exist when I want to generate an imaginary token in
> my parser initialized with a real token.

You are probably repeating the same mistake to be honest.

> 
>   tokens
>    {
>       INTERFACE_BLOCK; // imaginary token.
>    }
> 
>   // The book indicates that the following should work.
>   interface
>     :^(INTERFACE_BLOCK[$interface.start] interface_decl
> interface_body_line*)

You mean -> ^(  

right?

>  The C code generated however has two issues.
>    It has the same problem as above where it tries to cast the rule
> retvalue to pANTLR3_COMMON_TOKEN instead of casting the retval.start
> item.

This may be a bug in the start template then.

>    It uses ADAPTER->createTypeTokenText or ADAPTER->createTypeText
> instead of ADAPTER->createTypeToken which is what I expect

This stuff can be confusing. One of the issues is that the Java runtime
uses overloaded method calls, which I don't have in C, so the template
has to guess from the parameters it is given. However, I have not known
it to be wrong. Try explicitly constructing what you want to know in the
interface rule and returning it from the rule, this will be much less
confusing in your grammar anyway.

> 
>  I am doing all of this just so I can encode the start or a rule
> ( INTERFACE_BLOCK[$start] is expected to copy the start token's data
> into the INTERFACE_BLOCK token) so I can later start searching for
> comments on a hidden channel.

Why not keep your interface block IMAGINARY and just use:

-> ^(INTERFACE_BLOCK $interface $interface_decl) 

INTERFACE_BLOCK will then contain start and end tokens anyway in your
AST and the start will logically be the interface token. I think that
you are trying to solve problems that don't exist ;-)

> 
>   I worked around this by adding even more suspect C code and ended up
> making the antlr grammar file lose some of it's declarative beauty. In
> my parser I did this to work around.
> 
> interface
> @after

@after isn't really supported in C targets at the moment anyway. There
are sequencing issues.

> {         
>     // Save the line-num/charPos of the first token to match in this
> interface

Yeah - I think that you are doing all this code but the imaginary token
has this anyway. The imaginary token will give you the start and end
tokens that it spans, regardless of how you re-write them. Hence in your
AST, you get the start token, and ask it in turn for the line and column
position, just like in Java. Then, if you don't like it, you can give a
nice semantic error and not have the parser say "NVA at 'interface'" ;-)
It is also a lot more efficient as until you ask for anythign, it is all
just passing pointers around.

>     // rule to the members of our imaginary token.
>     pANTLR3_COMMON_TOKEN pTok = retval.tree->getToken(retval.tree);
>     if(pTok != 0)
>     {
>       pTok->line = retval.start->line;
>       pTok->charPosition = retval.start->charPosition;
>     }    
> }
>     : interface_decl
>         OPEN_BRACE
>             interface_body_line*
>         CLOSE_BRACE SEMI_COLON
>     -> ^(INTERFACE_BLOCK interface_decl interface_body_line*)    
>     ;
>   
>   Is this related to the unsupported tokenReWrite feature documented
> in the C runtime or a bug ?

I think you are just doing things wrong ;-) It isn't token rewrite that
is unsupported, but the tokenStreamRewrite engine.

> 
>   This makes me feel vaguely dirty. I will have an interresting time
> explaining why these workarounds are neccesary. I have hyped antlr to
> the point where any mention of bugs will make my boss cringe. I will
> just gloss over all the C code telling him that its just the way it
> has to be done. 

No, basically, just take it out, and let ANTLR do what it does
naturally.

> 
>   //-- Problem 3 -----
>   Finally armed with these workaround, I expected searching the token
> stream for comments before/after would be straighforward. Frustration
> strikes again. After several days of pulling my hair out, I realize
> that there is yet another bug (which was reported on the antlr mailing
> lists last year) which needs to be fixed for it to work.
> 
>   First thing I did was allow hidden channel tokens to be retained. I
> wasn't aware that this needed to be done till I started tracing
> through the run time (Extra documentation maybe)
> 
>  pTokenStream = antlr3CommonTokenStreamSourceNew(ANTLR3_SIZE_HINT,
> TOKENSOURCE(pLexerCtx));
>  pTokenStream->discardOffChannel     = ANTLR3_FALSE;
> 
>  After this was done, there seems to be a very subtle bug with cache
> coherency (occacional issue when using caching for performance) which
> was reported on the mainling list. This was causing a token that was
> supposed to be hidden to be sent to parser failing the parse. The
> first token in my file was a hidden token and luckily exposed the
> problem right away. The fix (as reported by the original reporter of
> the problem) was to make sure that in the last few lines of the
> *fillBuffer* methods in antlr3tokenstream.c.
> 
> replace 
> 
>      /* Set the consume pointer to the first token that is on our
> channel */
>     tokenStream->p  = 0;
>     tokenStream->p  = skipOffTokenChannels(tokenStream,
> tokenStream->p);
> 
>     /* Cache the size so we don't keep doing indirect method calls*/
>     tokenStream->tstream->istream->cachedSize =
> tokenStream->tokens->count;
> 
>   With
> 
>      /* Cache the size so we don't keep doing indirect method calls
>      * Relocated to above the following two lines following bug report
>      * at http://markmail.org/search/list:antlr?q=token+stream
> +comments#query:list%3Aantlr%20token%20stream%20comments+page:6
> +mid:5olqv5itjde27uhi+state:results
>      */
>     tokenStream->tstream->istream->cachedSize =
> tokenStream->tokens->count;
> 
>     /* Set the consume pointer to the first token that is on our
> channel
>      */
>     tokenStream->p  = 0;
>     tokenStream->p  = skipOffTokenChannels(tokenStream,
> tokenStream->p);
> 
>   //----------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
>  These are minor problems on the whole, but made the experience of
> developing non trivial Antlr3 stuff surprinsgly tricky and not nearly
> as much fun as it was with Antlr2. 

Well, I think you are doing a bit of running before walking really ;-)

> I understand the fact that this is a volunteer effort and that much
> like in a commercial setting, bugs can slip through in a volunteer
> driven runtime and it is amazing that there are so few bugs.
> 
>   I can volunteer to start adding test cases if someone can point me
> to the resources to do so. 

I have been asking for someone to do this for 3 years. All you need to
do is write them. I want to use something like C-unit, but at the end of
the day, I don't care so long as someone other than me produces them.

> If there is an effort to start with a CPP runtime, I will be more than
> glad to assist in whatever way is needed as reading the current C code
> makes me want to cry! 

Normally, you don't go anywhere near this, but just use your own C++
code. However, I have started the C++ runtime and I will finish it
shortly.

> I feel more manly just for digging through it and getting it to work
> for me :-).

Try writing it :-)

> 
>  I hope someone can tell me where I am doing idiotic things or put
> these bugs in for eventual fixes. I can help with this but I will
> probably have to learn stringtemplate stuff and get an overview
> document or some such. I am willing however.

Mostly you are over complicating things I believe. 

I have some time set aside this week for latest bugs and some C++ work,
so I will check these things, probably tomorrow.

Jim