[antlr-interest] unicode strings using supplemental char range
Terence Parr
parrt at cs.usfca.edu
Sun Jun 27 09:25:45 PDT 2004
On Jun 26, 2004, at 8:48 PM, Mark Lentczner wrote:
> I think the problem is much simpler than all this gnashing of teeth is
> making things out to be.
Hi Mark,
Thanks a million for cutting this down to the essentials. At the very
least your suggestion is a fabulous starting place that would not
preclude future enhancements.
> Ignoring for the moment Unicode specific extensions to Antlr, Antlr has
> very modest needs to be fully Unicode compliant:
>
> [ Terminology: "char" = Java type, 16-bits, "character" = Unicode
> character in range 0..0x10FFFF, "String" = Java type, "unicode string"
> = a sequence of zero or more characters. ]
>
> - the ability to read a stream of characters
> - the ability to take a String written in a grammar file (i.e.
> "abc\u20ac123") and produce a unicode string from it (i.e. [ 97, 98,
> 99, 8364, 49, 50, 41 ] - or if you mailer can handle it: [ 'a', 'b',
> 'c', '€', '1, '2', '3' ])
So, I need to make IntString or some such and then use it to represent
all string literals found in a grammar, all of which are assumed to be
UTF-16 unless they use "\u+..." character notation (but I would convert
to IntString after reading it in). My dictionary of string literals
would go to a dictionary of IntString. When generating code, I would
convert my match(String) to be match(IntString) and IntString would
move to org.antlr.runtime package (or perhaps int[] is good enough at
that point). All input symbols would be ints as per my IntegerStream
now (that is the thing that sucks from a CharSource) and has methods
for lookahead like:
int LA(int i);
For lexer errors and such, I would have to have support code to print
IntString back out character by character (full unicode), right? But,
what if the terminal can't handle it? Won't I have to allow people to
specify the encoding? Doesn't this start to get messy? Can you go a
little further and tell me at runtime what support you think we'd need?
match() will be easy, but how do I print: "expecting foo found bar"?
> Note: The escape syntax for Antlr will probably need to be redesigned.
> "\u" followed by four hex digits doesn't cut it, though could be kept
> for backward compatibility. It is probably best to bite the bullet and
> have a delimited escape sequence: "\U" followed by hex digits followed
> by ";". Or if you want to look like the Unicode documentation
> standards, "\U+"...
Yeah, i'm leaning towards to \uxxxx for 16-bit char and \u+10xxxx
format for full character range notation. Can even be in the same
string.
> Other features that have been discussed fall into two camps:
>
> Features that are really not logically part of a lexer/parser package:
> - transcoding the input from a some encoding byte stream into a stream
> of characters
I'll leave that to Java or a developer's own implementation of
IntegerStream.
> - character sequence normalization
yeah, no way ;)
> None of these should be part of Antlr (IMHO) and are easily handled as
> needed via re-implementing the streaming interface.
>
> Features that might be possible nice utilities to have in a
> lexer/parser package:
> - case folding
Well, I want to support a simple "ignore case" feature, which literally
just upper cases all char in '...' and "..." before putting them in a
hash table or in a DFA. Then, I'd have LA(i) return the uppercase
equivalent, while retaining the actual case to be put into the token
text etc... Does Character.toUpperCase() usually just do this
correctly and across all languages or is it encoding sensitive (ick)?
> - Unicode character classes as pre-defined (or algorithmically
> defined) lexer rules
Yep, I want stuff like LETTER and DIGIT. I *hope* there is a
noncontext-sensitive definition. I have deleted the charVocabulary
option for 3.0 and simply made it all UNICODE. :) Wildcard is just a
big range of characters now. So, even if you're doing LATIN, DIGIT
will contain digits for other languages, but that should be ok, right?
Can anybody think of a reason to limit the char vocabulary? Certainly
any bizarre character that comes in will get an error the minute it
doesn't match anything (even when lexing plain old binary data files).
Wait, is DIGIT a predefined set or is it a "function" that must operate
on the character? Wouldn't that suck! How would I translate DIGIT if
it is a computation? I need to compute lookahead on DIGIT (i.e., it
needs to yield a set of char), which is pretty tough if it's a
function.
> - Unicode character blocks as pre-defined (or algorithmically defined)
> lexer rules
I might as well allow stuff in Character.UnicodeBlock as it's just a
simple lookup via reflection into the Character.UnicodeBlock class. :)
OTOH, how does all this Java capability work when generating C or LISP?
In other words, is it sufficient for me to allow
lexer grammar test(language=C);
a : "\u0020blort" ;
INT : (DIGIT)+ ;
ID : (KOREAN)+ 'LATIN CHAR SPEC WRITTEN ALL THE WAY OUT' ;
and then convert all of the strings, characters to IntString and the
DIGIT, KOREAN to sets of characters? Everything would be normalized at
that point and ANTLR could proceed as if it were all just a bunch of
ints and sets of ints as it does now.
> These may be nice, though Antlr has gotten along just fine until now
> without them. I would heavily caution implementing these, or basing
> implementation issues on them until someone speaks up who would
> actually use them. And even then, I caution adding large library needs
> to Antlr just to support optional features.
Yes, please speak up if you know about what you'd need for BENGALI and
classes like DIGIT, TITLECASE, etc...
> [ Personally, I might like to see some of the Unicode character classes
> (though, by no means all of them). The Unicode blocks are useless to
> me, as is case folding. I suspect for real world language parsing
> needs, unless a format is actually defined in terms of the Unicode
> properties, a grammar writer might prefer to explicitly declare their
> character sets in their lexer anyway. ]
Well, judging from the SableCC grammar that defines all the JavaID sets
manually, I'd say that some predefined blocks would be useful, though
just how many people are doing Java parsers ;)
heh, we might actually be converging on a solution! I'd like to decide
soon so that I can refactor to have IntString rather than String in the
right places.
Ter
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list