[antlr-interest] More about unicode
Terence Parr
parrt at cs.usfca.edu
Sat May 1 15:49:29 PDT 2004
Guys, Chapman Flack (was at Purdue might still be) gave me copious
notes about the right thing to do. here is an interesting section from
his notes:
2. Predefined base sets to start from. Unicode provides tables of
all the
defined character properties like letter or space separator and
in fact
the tables are already built into Java's Character class.
Instead of
all those ranges in the SableCC grammar, it should be possible to
say
ID : [?javaIdentifierStart] [?javaIdentifierPart]*
This is trivial to do if the set representation includes not just
ranges or bit sets but nodes that gen code to call the appropriate
Character method at run time. I would suggest adding three
constructs
to the lexer spec syntax corresponding to the three types of
information
available from the Character class. Using a made-up syntax you
can
change to anything you like better:
[?Foo] yields test code Character.isFoo( LA(1))
examples [?Defined] [?Digit] [?LowerCase]
[:Foo] yields test code Character.getType( LA(1)) ==
Character.Foo
examples [:LINE_SEPARATOR] [:CURRENCY_SYMBOL]
[#Foo] (Java 1.2 only) yields test code
Character.UnicodeBlock.of(LA(1)) ==
Character.UnicodeBlock.Foo
examples [#ARABIC] [#CJK_SYMBOLS_AND_PUNCTUATION]
That's about 125 useful Unicode starting sets for next to zilch
coding
effort.
With these sets it's very easy to test for membership of a given
char
at runtime, just make the call. But ANTLR during analysis
probably
needs to determine if lookahead sets are disjoint. Checking
these Java
methods for overlap means calling them for every possibility.
Yuck.
So, perhaps we should start allowing references to predefined ranges
like BENGALI etc... Check out the definitions in:
http://java.sun.com/j2se/1.3/docs/api/java/lang/
Character.UnicodeBlock.html
E.g.,
public static final Character.UnicodeBlock BENGALI;
Presumably, the charVocabulary could reference BENGALI, but then would
DIGIT, LOWERCASE, ... references become context sensitive if we allowed
them instead of the user having to put tests for the following in their
lexer?
0x0030 through 0x0039 ISO-LATIN-1 digits ('0' through '9')
0x0660 through 0x0669 Arabic-Indic digits
0x06F0 through 0x06F9 Extended Arabic-Indic digits
0x0966 through 0x096F Devanagari digits
0x09E6 through 0x09EF Bengali digits
....
As Chap says, however, the lookahead could get troublesome for things
like DIGIT...we'll see. "Calling all cars...calling all cars...anyone
seen Chap?"
Terence
--
Professor Comp. Sci., University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list