[antlr-interest] No viable for alternative with ISO-LATIN-1 non-breaking space character
Guntis Ozols
guntiso at latnet.lv
Mon Feb 18 15:08:22 PST 2008
See:
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html
"Every instance of the Java virtual machine has a default charset, which may or
may not be one of the standard charsets. The default charset is determined
during virtual-machine startup and typically depends upon the locale and
charset being used by the underlying operating system."
Guntis
> I had an issue earlier today with the Java version of the grammar I am
> working on not reading UTF-8 encoded text properly. I would also like to
> know what the default is.
>
> Thanks,
> Jamie Penney
>
> Darach Ennis wrote:
> > Hi Jim.
> >
> > Bingo! Thank you! You were very close:
> >
> > new ANTLRFileStream("/tmp/nbsp.txt", "ISO-8859-1")
> >
> > The non-breaking-space is encoding specific and my input stream is
> > iso-8859-1
> > so this should be iso-8859-1 in my case. What is the default encoding
> > in ANTLRInputStream?
> > Is it UTF-8 or the system encoding? The javadoc could mention what the
> > default is.
> >
> > Regards,
> >
> > Darach.
> >
> > PS: I generally use the POSIX.1 od utility (od -H file.txt on
> > unix/linux) to verify characters in the input encoding.
> >
> > On Feb 18, 2008 8:53 PM, Jim Idle <jimi at temporal-wave.com
> > <mailto:jimi at temporal-wave.com>> wrote:
> >
> > Are you sure that that is actually character 0xa0? Print the hex
> > value of it.
> >
> >
> >
> > However, I think that perhaps you need to add the "UTF8" encoding
> > option to your input stream?
> >
> >
> >
> > new ANTLRFileStream((/tmp/nbsp.txt", "UTF8")
> >
> >
> >
> > Jim
> >
> >
> >
> > *From:* Darach Ennis [mailto:darach at gmail.com
> > <mailto:darach at gmail.com>]
> > *Sent:* Monday, February 18, 2008 8:59 AM
> > *To:* antlr-interest at antlr.org <mailto:antlr-interest at antlr.org>
> > *Subject:* [antlr-interest] No viable for alternative with
> > ISO-LATIN-1 non-breaking space character
> >
> >
> >
> > Hi guys,
> >
> > I'm not sure if this is a case of user error or a bug. I have
> > replicated the issue in a testcase as follows:
> >
> > grammar Test;
> >
> > @parser::header {
> > import java.io.FileInputStream;
> > }
> >
> > @parser::members {
> > public static void main(String args[]) throws Throwable {
> > final ANTLRInputStream cs = new ANTLRInputStream(new
> > FileInputStream("/tmp/nbsp.txt"));
> > final TestLexer sl = new TestLexer(cs);
> > final CommonTokenStream cts = new CommonTokenStream(sl);
> > final TestParser sp = new TestParser(cts);
> > sp.rules();
> > }
> > }
> >
> > rules: anything+;
> > anything: Other | Directive ;
> > Other: '-' ( ('directive') => ('directive') { $type = Directive;
> > } | /* empty */ );
> > WS : (' ' | '\t' | '\f' | '\r' | '\n' | '\u00a0') {
> > $channel=HIDDEN; };
> >
> > Despite defining a non-breaking space (iso-latin-1) within the
> > whitespace hiding lexer rule 'WS'
> > test input with this character fails to parse as expected. Here is
> > some test input:
> >
> > -directive †-directive †-directive †-directive - - -
> >
> > Here is some example output:
> >
> > line 1:11 no viable alternative at character '†'
> > line 1:24 no viable alternative at character '†'
> > line 1:37 no viable alternative at character '†'
> >
> >
> > Given the above grammar I would have expected the non-breaking
> > space (\u00a0) to be ignored.
> >
> > Is this a bug or user error? If user error, can anyone suggest a
> > grammar fix?
> >
> > Regards,
> >
> > Darach.
More information about the antlr-interest
mailing list