[antlr-interest] No viable for alternative with ISO-LATIN-1 non-breaking space character

Mon Feb 18 15:08:22 PST 2008

See:
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html

"Every instance of the Java virtual machine has a default charset, which may or
may not be one of the standard charsets. The default charset is determined
during virtual-machine startup and typically depends upon the locale and
charset being used by the underlying operating system."

Guntis

> I had an issue earlier today with the Java version of the grammar I am
> working on not reading UTF-8 encoded text properly. I would also like to
> know what the default is.
>
> Thanks,
> Jamie Penney
>
> Darach Ennis wrote:
> > Hi Jim.
> >
> > Bingo! Thank you! You were very close:
> >
> > new ANTLRFileStream("/tmp/nbsp.txt", "ISO-8859-1")
> >
> > The non-breaking-space is encoding specific and my input stream is
> > iso-8859-1
> > so this should be iso-8859-1 in my case. What is the default encoding
> > in ANTLRInputStream?
> > Is it UTF-8 or the system encoding? The javadoc could mention what the
> > default is.
> >
> > Regards,
> >
> > Darach.
> >
> > PS: I generally use the POSIX.1 od  utility (od -H file.txt on
> > unix/linux) to verify characters in the input encoding.
> >
> > On Feb 18, 2008 8:53 PM, Jim Idle <jimi at temporal-wave.com
> > <mailto:jimi at temporal-wave.com>> wrote:
> >
> >     Are you sure that that is actually  character 0xa0? Print the hex
> >     value of it.
> >
> >
> >
> >     However, I think that perhaps  you need to add the "UTF8" encoding
> >     option to your input stream?
> >
> >
> >
> >     new ANTLRFileStream((/tmp/nbsp.txt", "UTF8")
> >
> >
> >
> >     Jim
> >
> >
> >
> >     *From:* Darach Ennis [mailto:darach at gmail.com
> >     <mailto:darach at gmail.com>]
> >     *Sent:* Monday, February 18, 2008 8:59 AM
> >     *To:* antlr-interest at antlr.org <mailto:antlr-interest at antlr.org>
> >     *Subject:* [antlr-interest] No viable for alternative with
> >     ISO-LATIN-1 non-breaking space character
> >
> >
> >
> >     Hi guys,
> >
> >     I'm not sure if this is a case of user error or a bug. I have
> >     replicated the issue in a testcase as follows:
> >
> >     grammar Test;
> >
> >     @parser::header {
> >       import java.io.FileInputStream;
> >     }
> >
> >     @parser::members {
> >       public static void main(String args[]) throws Throwable {
> >         final ANTLRInputStream cs = new ANTLRInputStream(new
> >     FileInputStream("/tmp/nbsp.txt"));
> >         final TestLexer sl = new TestLexer(cs);
> >         final CommonTokenStream cts = new CommonTokenStream(sl);
> >         final TestParser sp = new TestParser(cts);
> >         sp.rules();
> >       }
> >     }
> >
> >     rules:    anything+;
> >     anything: Other | Directive ;
> >     Other:   '-' ( ('directive') => ('directive') { $type = Directive;
> >     } | /* empty */ );
> >     WS    :    (' ' | '\t' | '\f' | '\r' | '\n' | '\u00a0') {
> >     $channel=HIDDEN; };
> >
> >     Despite defining a non-breaking space (iso-latin-1) within the
> >     whitespace hiding lexer rule 'WS'
> >     test input with this character fails to parse as expected. Here is
> >     some test input:
> >
> >     -directive †-directive †-directive †-directive - - -
> >
> >     Here is some example output:
> >
> >     line 1:11 no viable alternative at character '†'
> >     line 1:24 no viable alternative at character '†'
> >     line 1:37 no viable alternative at character '†'
> >
> >
> >     Given the above grammar I would have expected the non-breaking
> >     space (\u00a0) to be ignored.
> >
> >     Is this a bug or user error? If user error, can anyone suggest a
> >     grammar fix?
> >
> >     Regards,
> >
> >     Darach.