[antlr-interest] unicode strings using supplemental char range

Sun Jun 27 09:25:45 PDT 2004

On Jun 26, 2004, at 8:48 PM, Mark Lentczner wrote:
> I think the problem is much simpler than all this gnashing of teeth is
> making things out to be.

Hi Mark,

Thanks a million for cutting this down to the essentials.  At the very 
least your suggestion is a fabulous starting place that would not 
preclude future enhancements.

> Ignoring for the moment Unicode specific extensions to Antlr, Antlr has
> very modest needs to be fully Unicode compliant:
>
> [ Terminology: "char" = Java type, 16-bits, "character" = Unicode
> character in range 0..0x10FFFF, "String" = Java type, "unicode string"
> = a sequence of zero or more characters. ]
>
> 	- the ability to read a stream of characters
> 	- the ability to take a String written in a grammar file (i.e.
> "abc\u20ac123") and produce a unicode string from it (i.e. [ 97, 98,
> 99, 8364, 49, 50, 41 ] - or if you mailer can handle it: [ 'a', 'b',
> 'c', '€', '1, '2', '3' ])

So, I need to make IntString or some such and then use it to represent 
all string literals found in a grammar, all of which are assumed to be 
UTF-16 unless they use "\u+..." character notation (but I would convert 
to IntString after reading it in).  My dictionary of string literals 
would go to a dictionary of IntString.  When generating code, I would 
convert my match(String) to be match(IntString) and IntString would 
move to org.antlr.runtime package (or perhaps int[] is good enough at 
that point).  All input symbols would be ints as per my IntegerStream 
now (that is the thing that sucks from a CharSource) and has methods 
for lookahead like:

int LA(int i);

For lexer errors and such, I would have to have support code to print 
IntString back out character by character (full unicode), right?  But, 
what if the terminal can't handle it?  Won't I have to allow people to 
specify the encoding?  Doesn't this start to get messy?  Can you go a 
little further and tell me at runtime what support you think we'd need? 
  match() will be easy, but how do I print: "expecting foo found bar"?

> Note: The escape syntax for Antlr will probably need to be redesigned.
> "\u" followed by four hex digits doesn't cut it, though could be kept
> for backward compatibility.  It is probably best to bite the bullet and
> have a delimited escape sequence: "\U" followed by hex digits followed
> by ";".  Or if you want to look like the Unicode documentation
> standards, "\U+"...

Yeah, i'm leaning towards to \uxxxx for 16-bit char and \u+10xxxx 
format for full character range notation.  Can even be in the same 
string.

> Other features that have been discussed fall into two camps:
>
> Features that are really not logically part of a lexer/parser package:
> 	- transcoding the input from a some encoding byte stream into a stream
> of characters

I'll leave that to Java or a developer's own implementation of 
IntegerStream.

> 	- character sequence normalization

yeah, no way ;)

> None of these should be part of Antlr (IMHO) and are easily handled as
> needed via re-implementing the streaming interface.
>
> Features that might be possible nice utilities to have in a
> lexer/parser package:
> 	- case folding

Well, I want to support a simple "ignore case" feature, which literally 
just upper cases all char in '...' and "..." before putting them in a 
hash table or in a DFA.  Then, I'd have LA(i) return the uppercase 
equivalent, while retaining the actual case to be put into the token 
text etc...  Does Character.toUpperCase() usually just do this 
correctly and across all languages or is it encoding sensitive (ick)?

> 	- Unicode character classes as pre-defined (or algorithmically
> defined) lexer rules

Yep, I want stuff like LETTER and DIGIT.  I *hope* there is a 
noncontext-sensitive definition.  I have deleted the charVocabulary 
option for 3.0 and simply made it all UNICODE. :)  Wildcard is just a 
big range of characters now.  So, even if you're doing LATIN, DIGIT 
will contain digits for other languages, but that should be ok, right?  
Can anybody think of a reason to limit the char vocabulary?  Certainly 
any bizarre character that comes in will get an error the minute it 
doesn't match anything (even when lexing plain old binary data files).

Wait, is DIGIT a predefined set or is it a "function" that must operate 
on the character?  Wouldn't that suck!  How would I translate DIGIT if 
it is a computation?  I need to compute lookahead on DIGIT (i.e., it 
needs to yield a set of char), which is pretty tough if it's a 
function.

> 	- Unicode character blocks as pre-defined (or algorithmically defined)
> lexer rules

I might as well allow stuff in Character.UnicodeBlock as it's just a 
simple lookup via reflection into the Character.UnicodeBlock class. :)

OTOH, how does all this Java capability work when generating C or LISP? 
  In other words, is it sufficient for me to allow

lexer grammar test(language=C);

a : "\u0020blort" ;
INT : (DIGIT)+ ;

ID : (KOREAN)+ 'LATIN CHAR SPEC WRITTEN ALL THE WAY OUT' ;

and then convert all of the strings, characters to IntString and the 
DIGIT, KOREAN to sets of characters?  Everything would be normalized at 
that point and ANTLR could proceed as if it were all just a bunch of 
ints and sets of ints as it does now.

> These may be nice, though Antlr has gotten along just fine until now
> without them.  I would heavily caution implementing these, or basing
> implementation issues on them until someone speaks up who would
> actually use them.  And even then, I caution adding large library needs
> to Antlr just to support optional features.

Yes, please speak up if you know about what you'd need for BENGALI and 
classes like DIGIT, TITLECASE, etc...

> [ Personally, I might like to see some of the Unicode character classes
> (though, by no means all of them).  The Unicode blocks are useless to
> me, as is case folding.  I suspect for real world language parsing
> needs, unless a format is actually defined in terms of the Unicode
> properties, a grammar writer might prefer to explicitly declare their
> character sets in their lexer anyway. ]

Well, judging from the SableCC grammar that defines all the JavaID sets 
manually, I'd say that some predefined blocks would be useful, though 
just how many people are doing Java parsers ;)

heh, we might actually be converging on a solution!  I'd like to decide 
soon so that I can refactor to have IntString rather than String in the 
right places.

Ter

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/