[antlr-interest] proposal for 2.7.4, Unicode, and more...

matthew ford Matthew.Ford at forward.com.au
Mon May 3 02:25:58 PDT 2004


I agree with all of this.
It seems a very clear set of proposals.
matthew

----- Original Message ----- 
From: "Mark Lentczner" <markl at glyphic.com>
To: <antlr-interest at yahoogroups.com>
Sent: Monday, May 03, 2004 2:54 PM
Subject: Re: [antlr-interest] proposal for 2.7.4, Unicode, and more...


> Here is my take on Unicode and Antlr.  I realize that parts of this 
> have already be stated by other people in this list.  I thought it 
> would be good to pull together all those ideas and present an approach 
> as a cohesive, if way too long, proposal.
> 
> 0) Philosophy
> -------------
> There are two clear separations that should guide this design: First, 
> character set and character encoding are distinct concepts that must be 
> cleanly handled throughout.  Second, the semantics of Antlr shouldn't 
> depend on the implementation of Antlr.  This is especially true since 
> Antlr is partially re-implemented for different target languages (Java, 
> C++, C# etc...)
> 
> 1) Structure
> ------------
> I think a good case can be made for considering all parsing activity in 
> Antlr to be in Unicode.  The a lexer parses streams of characters into 
> tokens.  The grammar is described in terms of characters, not encoded 
> bytes.  (C++ is still C++ even if encoded in EBCDIC).  Since Unicode 
> encompasses virtually all known characters, defining the characters 
> that Antlr lexers read as Unicode covers all bases.  (See notes below 
> on binary.)
> 
> Handling different character encodings can be left completely to the 
> input stream class.  If a grammar is to only be applied to US-ASCII or 
> ISO-8860-3 characters, than the input stream can be limited to that, 
> and map them into Unicode presented to the generated lexer - there is 
> no need to make that distinction in the lexer grammar file.  On the 
> other hand, by specifying the grammar over Unicode, then by simply 
> changing the input stream, one can lex the same grammar over US-ASCII, 
> ISO-8860-3, UTF-8, or Shift-JIS, etc.
> 
> 2) Antlr Features
> -----------------
> The only semantic aspect of Antlr that actually depends on 
> charVocabulary is the concept of compliment (element and set).  What 
> started this thread was Terrance's observation that it is a constant 
> source of pitfalls:  Currently inversion means "of all the characters 
> used in the grammar, not these".  Which means that if my grammar only 
> mentions 'A'..'Z', and '0'..'9', then "~('0'..'9')" only means 
> 'A'..'Z'.  What most people expect is that "~('0'..'9')" should mean 
> ANY character in the input stream except '0'..'9'.  Rather than fix 
> this by changing the default charVocabulary, a better approach is to 
> just to directly change the meaning of compliment to mean what people 
> expect it to mean.  (See notes below on set inversion).
> 
> Once complement is defined this way, then the charVocabulary option can 
> be removed.
> 
> A large range of Unicode based built in character classes has been 
> suggested to be added.  I see nothing wrong with the proposed syntaxes, 
> but I question the utility of all the proposed options.  I have yet to 
> see a grammar that has a need to exclude particular Unicode blocks, for 
> example.  On the other hand, some of the Unicode character properties 
> are good candidates for inclusion.  I think restraint should reign 
> here, and Antlr should only implement at first what people will 
> actually use.
> 
> 3) Implementation
> -----------------
> Since Unicode is no longer limited to 16 bits (and hasn't been for 
> quite some time), internally, Antlr should avoid the whole morass of 
> surrogate pairs, and simply do all character operations with integers.  
> Furthermore, this is exactly what Java 1.5 is going to do, and it is 
> really the only viable option in C++ (wchar being what it is).
> 
> In either Java, C# or C++, as implemented on most modern processors, 
> there will be no performance difference manipulating 32 signed integers 
> vs. 8 unsigned chars in a lexer where they are dealt with one at a 
> time.  Even the string operations wouldn't be seriously affected since 
> most literals in a lexer tend to be short words and will be about as 
> efficient as small integer array compares.  This also allows all of 
> Antlr's internal state values (EOF, etc.) to be disjoint from all 
> characters (by using negative values)
> 
> The only major stumbling block to Antlr's use of Unicode internally are 
> its bit sets and the need for compliment.  In the generated code, the 
> use of bit sets is very regular, and a slightly more powerful 
> representation could easily support Unicode with complemented sets 
> without them always being O(2^20) bits in size.  Antlr's use of bit 
> sets during the analysis and generation, however, might need some more 
> sophisticated bit set class to handle things without simply resorting 
> to huge bit maps.  I'd be happy to lend some coding effort to make this 
> work.
> 
> When Antlr is used to parse binary formats, there is no real harm in 
> the internal Unicode interpretation.  The input source would only 
> happen to supply characters less than 256.  That set complements would 
> include characters beyond 8 btis wouldn't matter: They'd never be 
> presented by the input souce.  The only slight trick would be in proper 
> handling of 0, which isn't a valid Unicode character.  But I don't 
> think this would pose much of a problem.
> 
> - Mark
> 
> 
> Mark Lentczner
> markl at wheatfarm.org
> http://www.wheatfarm.org/
> 
> 
> 
>  
> Yahoo! Groups Links
> 
> 
> 
>  
> 


 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list