[antlr-interest] [C target] ANTLR 3.1 issues with token offsets and generated AST return types
Jim Idle
jimi at temporal-wave.com
Wed Aug 20 08:21:45 PDT 2008
On Wed, 2008-08-20 at 11:13 +0200, Sven Van Echelpoel wrote:
> Hi,
>
> I started out last week with ANTLR 3.1b2 to generate a parser with the C
> target. All went very well and I must say that I was very impressed with
> it. But then I wanted to get a hold of the token offsets (start and
> stop). For that I need the functions getStartIndex() and getStopIndex()
> of ANTLR3_COMMON_TOKEN_struct, right?
It depends where you are asking from. Also, if your input is really
UTF16 and not UCS2, are you using the built in conversions? The supplied
inputstreams handle latin-1 (well actually anything 8 bit) and UCS2 and
don't handle surrogates. If you want UTF32 (this is what is handled
internally) or OTF16 then you need to roll your own.
>
> When I parse my input, the token offsets are all rubbish.
You realize that the indexes are in characters and not bytes right? What
does your driver program look like.
> Although in my
> grammar I'm using rewrite rules to generate the AST, I can reproduce it
> with a small grammar as well. Here's what I tried:
>
> grammar MyGrammar;
>
> options {
> /* Generate C code */
> language = C ;
> /* Build an AST */
> output=AST ;
> }
>
> translation_unit
> : NUMBER+
> ;
>
> fragment
> DIGIT_CHAR
> : '0'..'9'
> ;
>
> fragment
> DIGIT_CHAR_WITHOUT_ZERO
> : '1'..'9'
> ;
> fragment
> WHITESPACE_CHAR
> : ' ' |'\n' |'\r' | '\t'
> ;
>
> NUMBER
> : ( '0' | DIGIT_CHAR_WITHOUT_ZERO ) DIGIT_CHAR*
> ;
>
>
> WHITESPACE
> : WHITESPACE_CHAR {$channel = HIDDEN;}
> ;
>
> For an input "12376 87562356" (utf-16), the parse succeeds, but the
> start and stop index of the tokens associated with the AST nodes are way
> off the mark. Here's what I print out for each tree node (ts is token
> start, te is token end):
>
> ts: 6649008 te: 6649017
> ts: 6649020 te: 6649035
>
> Slightly bigger than the string I sent in. :-)
I need to see how you are asking for them. There have not been any
reports about his going wrong and I know that at least some people are
using UCS2. I intended to provide a universal input stream for 3.1 but
ran out of time. I will supply it later.
>
> Naturally I was working with 3.1b2 and not the official release, so when
> I saw that 3.1 was released I went ahead and tried that one. This was
> even worse!
Are you sure you don't need a few more exclamation marks to get across
your disdain with more emphasis?
> 3.1 with the C target does not even generate the type of the
> AST in the return structs of the rules.
I think that you meant to say: "I looked in the past posts of the last
two weeks and saw that this question was answered 6 times already and
that when producing trees you need:
options
{
ASTLabelType=pANTLR3_BASE_TREE;
}
> Clearly we are missing something important here. Or maybe I am missing
> something obvious. I used the C-runtime from the ANTLR source
> distribution and tried it also with the stand-alone C lib distro. I'm
> building an Ubuntu 7.1 with gcc 3.4 (64-bit).
Try the released runtime, correcting your grammar and make sure that you
are using the UCS2 input stream. Use the built in references for $pos
and so on in the lexer and see what you get.
Jim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080820/510c3ab7/attachment.html
More information about the antlr-interest
mailing list