[antlr-interest] Support to Ter, questions

Randall R Schulz rschulz at sonic.net
Mon Jul 2 07:59:52 PDT 2007


On Monday 02 July 2007 07:41, Jim Idle wrote:
> ...
>
> > Questions (or remarks?):
> > ---------
> > To recap what we suspect are bugs in ANTRL-2:
> >
> > - using the "à" character (à) anywhere in
> > a xxx.g file, even if only in a comment, causes
> > ANTLR to abort.
>
> This is not a bug, it is just that Ter hates à the grave is a
> freeloader on the whole latin-1 race... ;-) It does rather sound like
> a bug though.

I had an em-dash in a comment and thought it was a bug when ANTLR choked
on it thusly:

ANTLR Parser Generator  Version 3.0 (May 17, 2007)  1989-2007
error(10):  internal error: CLIF.g : CLIF.g:810:12: expecting '*', found ' '
org.antlr.tool.ANTLRLexer.nextToken(ANTLRLexer.java:321)
antlr.TokenStreamRewriteEngine.nextToken(TokenStreamRewriteEngine.java:161)


... but then I realized that the file may not have been a properly
formatted UTF-8 file and that's what confused ANTLR. So I didn't report
it until I could try a better experiment—(there! that's a em-dash—I like
them...) (If I could figure out how to get an ellipsis, I'd use them,
too. I think I like them even better than em-dashes.)


So, first things first, does ANTLR support grammars written in arbitrary
Unicode? Does it accept UTF-8-encoded files? I know the Java source
grammar is defined as accepting Unicode, not just, say, ISO-8859-1.


> ...
>
> Jim


Randall Schulz


More information about the antlr-interest mailing list