[antlr-interest] Re: Is there an ANTLR trick/hack to specify "NEWLINE or EOF" in Lexer

Terence Parr parrt at jguru.com
Tue Feb 11 11:21:16 PST 2003


On Tuesday, February 11, 2003, at 06:29 AM, Anthony W. Youngman wrote:

> IIRC (don't quote me on this) only the first byte in a Unicode char is 
> allowed to start with a binary 1. And this indicates the start of a 
> multi-byte sequence. So yes, you're right in saying 0xFFFF is not an 
> assigned Unicode codepoint. It can't be since the coding definition 
> prevents you have two consecutive bytes starting with a 1.

I think you are talking about a specific disk encoding called UTF-8 
(unicode to follow).  That means that either it's a single byte (for 
saving space) or it's multi byte and the high bit says to look for 
another...not sure though.

>
> eg to encode 10110111b, you need to split it into two bytes, the first 
> starts 10 to show it's a two-byte block, and then the character is 
> split 7 bits in the lower byte and 1 in the upper, namely 10000001 
> 00110111. I think I've got it right ...

I think that is right for UTF-8, but the 16 bit memory unicode char is 
what we need to worry about I think.

Ter
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Lecturer in Comp. Sci., University of San Francisco


 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 



More information about the antlr-interest mailing list