[antlr-interest] Re: Is there an ANTLR trick/hack to specify "NEWLINE or EOF" in Lexer

micheal_jor <open.zone at virgin.net> open.zone at virgin.net
Wed Feb 5 10:10:45 PST 2003


> > Incidentally, what's your opinion of point (2) below. You know, 
about
> > ANTLR supporting a "virtual EOF char" that Lexers can match in 
rules.
> >
> > NEWLINE
> > :  '\n'
> > |  '\r' ('\n')?
> > |  EOF                 // or $EOF or $eof
> > ;
> 
> Hmm....yeah, I'm not sure.  What character would it be?  We already 
use 
> (char)-1 in Java, which I think is wrong since 0xFFFF is a valid 
char 
> in some script.  Any unicode geniuses out there?

It depends on the datatype used for storing the current, buffered and 
LA "characters" within CharScanner/InputBuffer/CharBuffer et al.

Currently it's an 'int'. int is 32-bits and char is 16-bits [at least 
for Java and C# it is ;-)]. So -1 isn't actually 0xFFFF. It is 
0xFFFFFFFF. 

This allows a few "virtual char" tricks with only a mild cost in 
space (bounded by the size of the buffer/lookahead not the whole 
input stream). Most 32-bit+ CPUs are equally efficient at 16- and 32-
bit operations.

When[1] we decide to support UTF-32, we could move our internal 
representation to 64-bits. The code will then run slower on 32-bit 
CPUs but we will all be running 
Hammers/Itaniums/ExtremelyUltraSupraSparcs by then. in any case, UTF-
32 isn't important for a few weeks yet...  ;-)

Cheers,

Micheal

[1]  32-bit unicode support. Is that a tad presumptious... ;-)



 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 



More information about the antlr-interest mailing list