[antlr-interest] [C target] ANTLR 3.1 issues with token offsets and generated AST return types

Wed Aug 20 23:59:18 PDT 2008

On Wed, 2008-08-20 at 08:21 -0700, Jim Idle wrote:
> On Wed, 2008-08-20 at 11:13 +0200, Sven Van Echelpoel wrote: 
> > Hi,
> > 
> > I started out last week with ANTLR 3.1b2 to generate a parser with the C
> > target. All went very well and I must say that I was very impressed with
> > it. But then I wanted to get a hold of the token offsets (start and
> > stop). For that I need the functions getStartIndex() and getStopIndex()
> > of ANTLR3_COMMON_TOKEN_struct, right?
> 
> It depends where you are asking from. Also, if your input is really
> UTF16 and not UCS2, are you using the built in conversions? The
> supplied inputstreams handle latin-1 (well actually anything 8 bit)
> and UCS2 and don't handle surrogates. If you want UTF32  (this is what
> is handled internally) or OTF16 then you need to roll your own. 
> > 
Well technically our input is UTF-16, but we won't be supporting
languages outside the BMP, so UCS2 would do just fine.

> > When I parse my input, the token offsets are all rubbish. 
> 
> You realize that the indexes are in characters and not bytes right?

Yes I do, but even in characters the indexes I get returned are way too
big. As stated for the simple input "12376 87562356", the idexes are in
the millions.

>  What does your driver program look like. 

I just followed the basic example in the docs (and sprinkled some C++
sauce over it) :

Antlr::Pointer<
    ANTLR3_INPUT_STREAM_struct
  >                                   input(
                                       antlr3NewUCS2StringInPlaceStream(
                                          text.begin(),
                                          text.size(),
                                          0
                                        )
                                      );    

  Antlr::Pointer<
    WarpLexer
  >                                   lexer(
                                        WarpLexerNew( input.get() )
                                      );   

  Antlr::Pointer<
    ANTLR3_COMMON_TOKEN_STREAM_struct
  >                                   tokens(
                                       antlr3CommonTokenStreamSourceNew(
                                          ANTLR3_SIZE_HINT,
                                          TOKENSOURCE( lexer.get() )
                                        )
                                      );

  Antlr::Pointer<
    WarpParser
  >                                   parser
                                        = WarpParserNew( tokens.get() );

  WarpParser_translation_unit_return  parser_return
                                        =
parser->translation_unit( parser.get() );

[...]
> > 
> > Naturally I was working with 3.1b2 and not the official release, so when
> > I saw that 3.1 was released I went ahead and tried that one. This was
> > even worse! 
> 
> Are you sure you don't need a few more exclamation marks to get across
> your disdain with more emphasis?

My apologies if this was the tone you picked up in my posting. There is
no disdain whatsoever on my part. As said, I'm very impressed with ANTLR
and I like it a lot.

> > 3.1 with the C target does not even generate the type of the
> > AST in the return structs of the rules. 
> 
> I think that you meant to say: "I looked in the past posts of the last
> two weeks and saw that this question was answered 6 times already and
> that when producing trees you need:
> 
> options
> {
>    ASTLabelType=pANTLR3_BASE_TREE;
> } 

Yes, Gavin Lambert pointed this out to me yesterday on this very list.

> > Clearly we are missing something important here. Or maybe I am missing
> > something obvious. I used the C-runtime from the ANTLR source
> > distribution and tried it also with the stand-alone C lib distro. I'm
> > building an Ubuntu 7.1 with gcc 3.4 (64-bit).
> 
> Try the released runtime, correcting your grammar and make sure that
> you are using the UCS2 input stream. Use the built in references for
> $pos and so on in the lexer and see what you get.

Well, I would use the released runtime, if the generated grammar would
compile (see my reply to Gavin Lambert's post yesterday). I have no clue
what's going on there either.

By the way, what's $pos? I found no mention to it in the book.
> 
Thanks,

Sven