[antlr-interest] Re: Is there an ANTLR trick/hack to specify "NEWLINE or EOF" in Lexer
micheal_jor <open.zone at virgin.net>
open.zone at virgin.net
Wed Feb 5 10:10:45 PST 2003
> > Incidentally, what's your opinion of point (2) below. You know,
about
> > ANTLR supporting a "virtual EOF char" that Lexers can match in
rules.
> >
> > NEWLINE
> > : '\n'
> > | '\r' ('\n')?
> > | EOF // or $EOF or $eof
> > ;
>
> Hmm....yeah, I'm not sure. What character would it be? We already
use
> (char)-1 in Java, which I think is wrong since 0xFFFF is a valid
char
> in some script. Any unicode geniuses out there?
It depends on the datatype used for storing the current, buffered and
LA "characters" within CharScanner/InputBuffer/CharBuffer et al.
Currently it's an 'int'. int is 32-bits and char is 16-bits [at least
for Java and C# it is ;-)]. So -1 isn't actually 0xFFFF. It is
0xFFFFFFFF.
This allows a few "virtual char" tricks with only a mild cost in
space (bounded by the size of the buffer/lookahead not the whole
input stream). Most 32-bit+ CPUs are equally efficient at 16- and 32-
bit operations.
When[1] we decide to support UTF-32, we could move our internal
representation to 64-bits. The code will then run slower on 32-bit
CPUs but we will all be running
Hammers/Itaniums/ExtremelyUltraSupraSparcs by then. in any case, UTF-
32 isn't important for a few weeks yet... ;-)
Cheers,
Micheal
[1] 32-bit unicode support. Is that a tad presumptious... ;-)
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list