[antlr-interest] Re: Is there an ANTLR trick/hack to specify "NEWLINE or EOF" in Lexer
Terence Parr
parrt at jguru.com
Tue Feb 11 11:21:16 PST 2003
On Tuesday, February 11, 2003, at 06:29 AM, Anthony W. Youngman wrote:
> IIRC (don't quote me on this) only the first byte in a Unicode char is
> allowed to start with a binary 1. And this indicates the start of a
> multi-byte sequence. So yes, you're right in saying 0xFFFF is not an
> assigned Unicode codepoint. It can't be since the coding definition
> prevents you have two consecutive bytes starting with a 1.
I think you are talking about a specific disk encoding called UTF-8
(unicode to follow). That means that either it's a single byte (for
saving space) or it's multi byte and the high bit says to look for
another...not sure though.
>
> eg to encode 10110111b, you need to split it into two bytes, the first
> starts 10 to show it's a two-byte block, and then the character is
> split 7 bits in the lower byte and 1 in the upper, namely 10000001
> 00110111. I think I've got it right ...
I think that is right for UTF-8, but the 16 bit memory unicode char is
what we need to worry about I think.
Ter
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Lecturer in Comp. Sci., University of San Francisco
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list