[antlr-interest] proposal for 2.7.4, Unicode, and more...
Mark Lentczner
markl at glyphic.com
Sun May 2 21:54:11 PDT 2004
Here is my take on Unicode and Antlr. I realize that parts of this
have already be stated by other people in this list. I thought it
would be good to pull together all those ideas and present an approach
as a cohesive, if way too long, proposal.
0) Philosophy
-------------
There are two clear separations that should guide this design: First,
character set and character encoding are distinct concepts that must be
cleanly handled throughout. Second, the semantics of Antlr shouldn't
depend on the implementation of Antlr. This is especially true since
Antlr is partially re-implemented for different target languages (Java,
C++, C# etc...)
1) Structure
------------
I think a good case can be made for considering all parsing activity in
Antlr to be in Unicode. The a lexer parses streams of characters into
tokens. The grammar is described in terms of characters, not encoded
bytes. (C++ is still C++ even if encoded in EBCDIC). Since Unicode
encompasses virtually all known characters, defining the characters
that Antlr lexers read as Unicode covers all bases. (See notes below
on binary.)
Handling different character encodings can be left completely to the
input stream class. If a grammar is to only be applied to US-ASCII or
ISO-8860-3 characters, than the input stream can be limited to that,
and map them into Unicode presented to the generated lexer - there is
no need to make that distinction in the lexer grammar file. On the
other hand, by specifying the grammar over Unicode, then by simply
changing the input stream, one can lex the same grammar over US-ASCII,
ISO-8860-3, UTF-8, or Shift-JIS, etc.
2) Antlr Features
-----------------
The only semantic aspect of Antlr that actually depends on
charVocabulary is the concept of compliment (element and set). What
started this thread was Terrance's observation that it is a constant
source of pitfalls: Currently inversion means "of all the characters
used in the grammar, not these". Which means that if my grammar only
mentions 'A'..'Z', and '0'..'9', then "~('0'..'9')" only means
'A'..'Z'. What most people expect is that "~('0'..'9')" should mean
ANY character in the input stream except '0'..'9'. Rather than fix
this by changing the default charVocabulary, a better approach is to
just to directly change the meaning of compliment to mean what people
expect it to mean. (See notes below on set inversion).
Once complement is defined this way, then the charVocabulary option can
be removed.
A large range of Unicode based built in character classes has been
suggested to be added. I see nothing wrong with the proposed syntaxes,
but I question the utility of all the proposed options. I have yet to
see a grammar that has a need to exclude particular Unicode blocks, for
example. On the other hand, some of the Unicode character properties
are good candidates for inclusion. I think restraint should reign
here, and Antlr should only implement at first what people will
actually use.
3) Implementation
-----------------
Since Unicode is no longer limited to 16 bits (and hasn't been for
quite some time), internally, Antlr should avoid the whole morass of
surrogate pairs, and simply do all character operations with integers.
Furthermore, this is exactly what Java 1.5 is going to do, and it is
really the only viable option in C++ (wchar being what it is).
In either Java, C# or C++, as implemented on most modern processors,
there will be no performance difference manipulating 32 signed integers
vs. 8 unsigned chars in a lexer where they are dealt with one at a
time. Even the string operations wouldn't be seriously affected since
most literals in a lexer tend to be short words and will be about as
efficient as small integer array compares. This also allows all of
Antlr's internal state values (EOF, etc.) to be disjoint from all
characters (by using negative values)
The only major stumbling block to Antlr's use of Unicode internally are
its bit sets and the need for compliment. In the generated code, the
use of bit sets is very regular, and a slightly more powerful
representation could easily support Unicode with complemented sets
without them always being O(2^20) bits in size. Antlr's use of bit
sets during the analysis and generation, however, might need some more
sophisticated bit set class to handle things without simply resorting
to huge bit maps. I'd be happy to lend some coding effort to make this
work.
When Antlr is used to parse binary formats, there is no real harm in
the internal Unicode interpretation. The input source would only
happen to supply characters less than 256. That set complements would
include characters beyond 8 btis wouldn't matter: They'd never be
presented by the input souce. The only slight trick would be in proper
handling of 0, which isn't a valid Unicode character. But I don't
think this would pose much of a problem.
- Mark
Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list