[antlr-interest] More about unicode
lists at lischke-online.de
Sun May 2 01:20:41 PDT 2004
> Guys, Chapman Flack (was at Purdue might still be) gave me
> copious notes about the right thing to do. here is an
> interesting section from his notes:
This is what I had in mind too. Since I'm a beginnner with Java I didn't know how far the Unicode integration already
> So, perhaps we should start allowing references to predefined
> ranges like BENGALI etc... Check out the definitions in:
That would be the easiest way, IMO. I don't know what others think but I need mainly checks like isIdentifierStart,
isDigit, isLineTerminator etc.
> Presumably, the charVocabulary could reference BENGALI, but
> then would DIGIT, LOWERCASE, ... references become context
> sensitive if we allowed them instead of the user having to
> put tests for the following in their lexer?
As long as you only allow one of those language ranges it should be easy. Just reject any input not in this range as
invalid before it reaches the next processing stage. This way all (valid) input is automatically "in context". It
becomes more difficult if you allow combinations of input ranges, but I wonder if a given character ever has a
different meaning in different sets? If not (what I assume) then context sensitivity is not an issue at all. Simply
don't allow input not in the specified charVocabulary to go to the next processing stage.
> As Chap says, however, the lookahead could get troublesome
> for things like DIGIT...we'll see.
Why? With the isDigit method in Character Java will check this for us. Any valid digit character will get through.
However if you want to check *afterwards* if a given input is valid according to the charVocabulary then you are in
Yahoo! Groups Links
<*> To visit your group on the web, go to:
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
More information about the antlr-interest