[antlr-interest] [C target] ANTLR 3.1 issues with token offsets and generated AST return types

Wed Aug 20 08:21:45 PDT 2008

On Wed, 2008-08-20 at 11:13 +0200, Sven Van Echelpoel wrote:

> Hi,
> 
> I started out last week with ANTLR 3.1b2 to generate a parser with the C
> target. All went very well and I must say that I was very impressed with
> it. But then I wanted to get a hold of the token offsets (start and
> stop). For that I need the functions getStartIndex() and getStopIndex()
> of ANTLR3_COMMON_TOKEN_struct, right?

It depends where you are asking from. Also, if your input is really
UTF16 and not UCS2, are you using the built in conversions? The supplied
inputstreams handle latin-1 (well actually anything 8 bit) and UCS2 and
don't handle surrogates. If you want UTF32  (this is what is handled
internally) or OTF16 then you need to roll your own.

> 
> When I parse my input, the token offsets are all rubbish. 

You realize that the indexes are in characters and not bytes right? What
does your driver program look like.

> Although in my
> grammar I'm using rewrite rules to generate the AST, I can reproduce it
> with a small grammar as well. Here's what I tried:
> 
> grammar MyGrammar;
> 
> options {
>   /* Generate C code */
>   language = C ;
>   /* Build an AST */
>   output=AST ;
> }
> 
> translation_unit
>   : NUMBER+
>   ;
> 
> fragment
> DIGIT_CHAR
>   : '0'..'9'
>   ;
>   
> fragment
> DIGIT_CHAR_WITHOUT_ZERO
>   : '1'..'9'
>   ;
> fragment
> WHITESPACE_CHAR
>   : ' ' |'\n' |'\r' | '\t'
>   ;
>   
> NUMBER
>   : ( '0' | DIGIT_CHAR_WITHOUT_ZERO ) DIGIT_CHAR*
>   ;
> 
> 
> WHITESPACE
>   : WHITESPACE_CHAR {$channel = HIDDEN;} 
>   ;
> 
> For an input "12376 87562356" (utf-16), the parse succeeds, but the
> start and stop index of the tokens associated with the AST nodes are way
> off the mark. Here's what I print out for each tree node (ts is token
> start, te is token end):
> 
> ts: 6649008 te: 6649017
> ts: 6649020 te: 6649035
> 
> Slightly bigger than the string I sent in. :-)

I need to see how you are asking for them. There have not been any
reports about his going wrong and I know that at least some people are
using UCS2. I intended to provide a universal input stream for 3.1 but
ran out of time. I will supply it later.

> 
> Naturally I was working with 3.1b2 and not the official release, so when
> I saw that 3.1 was released I went ahead and tried that one. This was
> even worse! 

Are you sure you don't need a few more exclamation marks to get across
your disdain with more emphasis?

> 3.1 with the C target does not even generate the type of the
> AST in the return structs of the rules. 

I think that you meant to say: "I looked in the past posts of the last
two weeks and saw that this question was answered 6 times already and
that when producing trees you need:

options
{
   ASTLabelType=pANTLR3_BASE_TREE;
}

> Clearly we are missing something important here. Or maybe I am missing
> something obvious. I used the C-runtime from the ANTLR source
> distribution and tried it also with the stand-alone C lib distro. I'm
> building an Ubuntu 7.1 with gcc 3.4 (64-bit).

Try the released runtime, correcting your grammar and make sure that you
are using the UCS2 input stream. Use the built in references for $pos
and so on in the lexer and see what you get.

Jim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080820/510c3ab7/attachment.html