[antlr-interest] Unicode XID_Start/XID_Continue? (And, other Unicode properties)

Joe l0calh05t at gmx.net
Sat Jul 5 15:47:47 PDT 2008


>> Are Unicode properties supported by Antlr in any way? It would be nice 
>> to be able to simply lex unicode identifiers as ID : XID_Start 
>> XID_Continue*
>> Or would I have to write a script that creates the appropriate lexer 
>> fragments from 
>> http://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt ?
>>     
>
> Here's what I hacked up to do something like that, using ICU4J,
>
>   http://lists.badgers-in-foil.co.uk/pipermail/metaas-dev/attachments/20070307/abfef6e7/UnicodeIdentifierGenerator.java
>
> I think the ICU UCharacter[1] class would allow codepoints to be tested
> against the XID* properties[2] in the same way, if the script doesn't
> already do what you want.
>
> [1] http://www.icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html
> [2] http://www.icu-project.org/apiref/icu4j/com/ibm/icu/lang/UProperty.html#XID_CONTINUE
>
>
> ta,
> dave
>   
So they are unsupported. And apparently UTF-16 isn't even really 
supported. Shouldn't this stuff be fairly easy to implement? The java 
version of LA already returns an int, so why not add UTF-16 decoding to 
it? And properties could be implemented via ICU

-- 
Generally speaking, things have gone about as far
as they can possibly go, when things have gotten
about as bad as they can reasonably get.



More information about the antlr-interest mailing list