[antlr-interest] 3.0.1 C target woes

Tue Oct 16 10:33:22 PDT 2007

Michelangelo was almost done with the Sistine chapel, and allowed Leonardo
to come in and have a look. After looking around a bit, Leonardo said:
"You've missed a bit!".

Basically, You should use the methods given to get the text of tokens, but
in general I don’t intend that you should need to know what the changelists
are to use it of course. For various reasons that I did not agree with, the
original source used 8 space tabs and now I want this to be 4 space tabs -
in general I commit these and spelling errors in comments as separate
checkins, but not always as I correct some of this as I go along.
Ironically, both the change to use absolute pointers (though it does sound
like that isn't quite right for UCS2) and the correction to spelling errors
in comments, were done for you!

I will take your suggestions under advisement, but let's not lose track of
the fact that it is free and unencumbered and a little more complicated than
Git ;-)

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Wincent Colaiuta
> Sent: Tuesday, October 16, 2007 2:56 AM
> To: Antlr Interest
> Subject: [antlr-interest] 3.0.1 C target woes
> 
> Had some problems trying to get my lexer that worked under the 3.0 C
> target runtime to work under the 3.0.1 runtime.
> 
> The problem is solved now, so I'm posting this here for others in
> case they run into similar issues.
> 
> Basically, the lexer was crashing after lines like this:
> 
>    start = (const char *)(stream->data + (token->start * 2));
>    len   = (token->stop - token->start + 1) * 2;
> 
> Here I'm just trying to get a pointer to the start of the token text,
> and its length in bytes. (Note that this is with a UCS-2 stream; the
> multiply-by-two operations are because each UCS-2 character occupies
> 2 bytes.)
> 
> Inspecting the values of the variables revealed that while under 3.0,
> token->start was a character index (the number of characters, not
> bytes, relative to the start of the stream), in 3.0.1 it is an
> absolute pointer.
> 
> Similarly, where token->stop was a character index in 3.0 (the number
> of characters, not bytes, relative to the start of the stream), in
> 3.0.1 it is an absolute pointer as well. Strangely, it is not a
> pointer to the end of the token text, but to the byte immediately
> preceding it. In the case of UCS-2 that means that it's a pointer to
> the second half of a character and isn't valid. Although this is
> correct for ASCII streams, it seems like a bug for UCS-2 streams.
> 
> That is, whereas in 3.0 given a character "a" at address 0x0f00:
> 
> - let's say stream->data is 0x0f00
> - token->start is 0
> - token->stop is 1
> - the token's address is 0x0f00 + 0
> - and its length is 1 * 2 (2 bytes)
> 
> But in 3.0.1:
> 
> - let stream->data be 0x0f00
> - token->start is now 0x0f00
> - token->stop is 0x0f01
> - the token's address is 0x0f00
> - and its length is (stop + 1) - start
> 
> So I was able to get my recognizer running by changing:
> 
>    start = (const char *)(stream->data + (token->start * 2));
>    len   = (token->stop - token->start + 1) * 2;
> 
> To:
> 
>    start = (const char *)token->start;
>    len = (token->stop + 1 - token->start);
> 
> Jim, is there anywhere where this kind of API-level change is
> documented in the release notes? It would be nice if this kind of
> information were included with future releases (or if it is already
> included, it would be nice if the info were made more prominent).
> 
> Another thing is that although the behaviour of the API changed, the
> documentation in the header files did not. The start field in
> "antlr3commontoken.h" is still documented as being "The character
> offset in the input stream where the text for this token starts."
> 
> I spent several hours last night trying to find the changeset which
> introduced these changes and I had little success. In the spirit of
> constructive criticism, there are a couple of things you could do to
> make the development history easier to search:
> 
> - in many changesets the commit message describes what sounds like a
> limited fix but the actual diff includes very extensive whitespace
> fixes; this makes it much harder to see the actual substantive change
> underneath all the cosmetic changes. Keeping your whitespace changes
> in separate commits would be a huge help.
> 
> - the same goes for spelling errors in comments; sometimes the number
> of corrections drowns out the changes to the non-comment lines in the
> source files. It would be great if you could keep such corrections in
> separate changesets.
> 
> - often it seems that unrelated topics are bundled together in single
> changesets, making it harder to understand the nature of the changes
> because they're all mixed in together.
> 
> - the commit messages tend to be fairly brief and it can be quite
> hard to figure out the purpose of a given changeset.
> 
> Hope I haven't caused any offense with this feedback; I think I've
> been spoilt lately by observing the Git development history. Check
> out their changelog for a shining example of ultra-clean development
> history:
> 
>    <http://repo.or.cz/w/git.git?a=log>
> 
> Cheers,
> Wincent
> 
> 
> 
> 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.5.488 / Virus Database: 269.14.12/1072 - Release Date:
> 10/15/2007 5:55 PM
> 

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.488 / Virus Database: 269.14.12/1072 - Release Date: 10/15/2007
5:55 PM

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20071016/293c1809/attachment-0001.html