[antlr-interest] More about unicode

Sat May 1 15:49:29 PDT 2004

Guys, Chapman Flack (was at Purdue might still be) gave me copious  
notes about the right thing to do.  here is an interesting section from  
his notes:

   2.  Predefined base sets to start from.  Unicode provides tables of  
all the
       defined character properties like letter or space separator and  
in fact
       the tables are already built into Java's Character class.   
Instead of
       all those ranges in the SableCC grammar, it should be possible to  
say
         ID : [?javaIdentifierStart] [?javaIdentifierPart]*

       This is trivial to do if the set representation includes not just
       ranges or bit sets but nodes that gen code to call the appropriate
       Character method at run time.  I would suggest adding three  
constructs
       to the lexer spec syntax corresponding to the three types of  
information
       available from the Character class.  Using a made-up syntax you  
can
       change to anything you like better:

       [?Foo]    yields test code   Character.isFoo( LA(1))
                 examples [?Defined]  [?Digit]  [?LowerCase]

       [:Foo]    yields test code   Character.getType( LA(1)) ==  
Character.Foo
                 examples [:LINE_SEPARATOR]  [:CURRENCY_SYMBOL]

       [#Foo]    (Java 1.2 only) yields test code
                 Character.UnicodeBlock.of(LA(1)) ==  
Character.UnicodeBlock.Foo
                 examples [#ARABIC]  [#CJK_SYMBOLS_AND_PUNCTUATION]

       That's about 125 useful Unicode starting sets for next to zilch  
coding
       effort.

       With these sets it's very easy to test for membership of a given  
char
       at runtime, just make the call.  But ANTLR during analysis  
probably
       needs to determine if lookahead sets are disjoint.  Checking  
these Java
       methods for overlap means calling them for every possibility.   
Yuck.

So, perhaps we should start allowing references to predefined ranges  
like BENGALI etc...  Check out the definitions in:

http://java.sun.com/j2se/1.3/docs/api/java/lang/ 
Character.UnicodeBlock.html

E.g.,

public static final Character.UnicodeBlock BENGALI;

Presumably, the charVocabulary could reference BENGALI, but then would  
DIGIT, LOWERCASE, ... references become context sensitive if we allowed  
them instead of the user having to put tests for the following in their  
lexer?

0x0030 through 0x0039 ISO-LATIN-1 digits ('0' through '9')
0x0660 through 0x0669 Arabic-Indic digits
0x06F0 through 0x06F9 Extended Arabic-Indic digits
0x0966 through 0x096F Devanagari digits
0x09E6 through 0x09EF Bengali digits
....

As Chap says, however, the lookahead could get troublesome for things  
like DIGIT...we'll see.  "Calling all cars...calling all cars...anyone  
seen Chap?"

Terence
--
Professor Comp. Sci., University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/