[antlr-interest] Re: Is there an ANTLR trick/hack to specify "NEWLINE or EOF" in Lexer

Tom Moog tmoog at polhode.com
Tue Feb 11 19:12:26 PST 2003


Unicode is a number (code point).  It has
multiple representations (encodings): utf-8,
utf-16, ucs-2, etc.

With utf-8:

1 byte sequence = 0xxx xxxx
2 byte sequence = 11xx xxxx 10xx xxxx
3 byte sequence = 111x xxxx 10xx xxxx 10xx xxxx
4 byte sequence = 1111 xxxx 10xx xxxx 10xx xxxx 10xx xxxx

UCS-2 is just the first 2**16 unicode characters, except
for some holes (e.g. 0xfeff or is it 0xfffe ?, can't
remember) which is reserved for the byte order mark.

UTF-16 is similar to ucs-2, but has something called
surrogate pairs for representing Unicode values .gt.
2*16.  With surrogate pairs, the upper 6 bits of two
adjacent words contain special codes and the low
order 10 bits of each word are combined to create
a 2**20 value.  The upper 6 bits reserved for this
purpose could create a hole, but these Unicode values
are reserved for just this purpose.




On Tue, 11 Feb 2003, Terence Parr wrote:

>
> On Tuesday, February 11, 2003, at 06:29 AM, Anthony W. Youngman wrote:
>
> > IIRC (don't quote me on this) only the first byte in a Unicode char is
> > allowed to start with a binary 1. And this indicates the start of a
> > multi-byte sequence. So yes, you're right in saying 0xFFFF is not an
> > assigned Unicode codepoint. It can't be since the coding definition
> > prevents you have two consecutive bytes starting with a 1.
>
> I think you are talking about a specific disk encoding called UTF-8
> (unicode to follow).  That means that either it's a single byte (for
> saving space) or it's multi byte and the high bit says to look for
> another...not sure though.
>
> >
> > eg to encode 10110111b, you need to split it into two bytes, the first
> > starts 10 to show it's a two-byte block, and then the character is
> > split 7 bits in the lower byte and 1 in the upper, namely 10000001
> > 00110111. I think I've got it right ...
>
> I think that is right for UTF-8, but the 16 bit memory unicode char is
> what we need to worry about I think.
>
> Ter
> --
> Co-founder, http://www.jguru.com
> Creator, ANTLR Parser Generator: http://www.antlr.org
> Lecturer in Comp. Sci., University of San Francisco
>
>
>
>
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>
>
>


 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 



More information about the antlr-interest mailing list