[antlr-interest] Recognizing 5-th hex digit

Kieran Beltran kieran.beltran at gmail.com
Wed Aug 26 12:35:46 PDT 2009


I am working on an ANTLR grammar to support the ISO Standard Z notation
(specification language). The Z character set includes many non-ASCII
characters, so the lexer must recognize unicode character sequences, which,
for lexer token definitions comprising 4-hex escaped unicode (\uxxxx), I
believe ANTRL works fine.

I have encountered a problem when attempting to recognize two required
Standard Z symbols which are "above" the four-hex set recognized by my
generated lexer. The two symbols are \u1D538 and \u1D53D.

A review of the UCS documentation
http://unicode.org/Public/UNIDATA/UnicodeData.txt indicates that indeed
there is a 5-th hex digit that is used "publically", albeit infrequently -
primarily for mathematics, musical symbols and other areas. Not sure many
folks are writing grammars requiring recognition of such character sets.
Interestingly, the 5-th hex digit only needs to reach E as the highest UCS
symbol that might be used publically appears to currently be \uE01EF. Above
F0000 appears to be for private use only.

Looking at the ANTLRv3.g grammar within the ESC fragment definition, I
believe that the four-hex unicode definition is defined:
see line 495        'u' XDIGIT XDIGIT XDIGIT XDIGIT

Is the solution to include a fifth digit to be recognized optionally? Could
I simply replace line 495 (as below) and add a new fragment

'u' ZDIGIT? XDIGIT XDIGIT XDIGIT XDIGIT

fragment
ZDIGIT :
  '0' .. '9'
 | 'a' .. 'e'
 | 'A' .. 'E'
 ;

 Are there other implementation considerations I have overlooked?

Is the limited use of this too restricted to be considered / reported as an
actual ANTLR bug? Hence, should I build my own customized ANTLR?

Thank-you for considering this.

Kieran Beltran
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090826/7f6450df/attachment.html 


More information about the antlr-interest mailing list