[antlr-interest] No viable for alternative with ISO-LATIN-1 non-breaking space character
Jamie Penney
jpen054 at ec.auckland.ac.nz
Mon Feb 18 14:41:41 PST 2008
I had an issue earlier today with the Java version of the grammar I am
working on not reading UTF-8 encoded text properly. I would also like to
know what the default is.
Thanks,
Jamie Penney
Darach Ennis wrote:
> Hi Jim.
>
> Bingo! Thank you! You were very close:
>
> new ANTLRFileStream("/tmp/nbsp.txt", "ISO-8859-1")
>
> The non-breaking-space is encoding specific and my input stream is
> iso-8859-1
> so this should be iso-8859-1 in my case. What is the default encoding
> in ANTLRInputStream?
> Is it UTF-8 or the system encoding? The javadoc could mention what the
> default is.
>
> Regards,
>
> Darach.
>
> PS: I generally use the POSIX.1 od utility (od -H file.txt on
> unix/linux) to verify characters in the input encoding.
>
> On Feb 18, 2008 8:53 PM, Jim Idle <jimi at temporal-wave.com
> <mailto:jimi at temporal-wave.com>> wrote:
>
> Are you sure that that is actually character 0xa0? Print the hex
> value of it.
>
>
>
> However, I think that perhaps you need to add the "UTF8" encoding
> option to your input stream?
>
>
>
> new ANTLRFileStream((/tmp/nbsp.txt", "UTF8")
>
>
>
> Jim
>
>
>
> *From:* Darach Ennis [mailto:darach at gmail.com
> <mailto:darach at gmail.com>]
> *Sent:* Monday, February 18, 2008 8:59 AM
> *To:* antlr-interest at antlr.org <mailto:antlr-interest at antlr.org>
> *Subject:* [antlr-interest] No viable for alternative with
> ISO-LATIN-1 non-breaking space character
>
>
>
> Hi guys,
>
> I'm not sure if this is a case of user error or a bug. I have
> replicated the issue in a testcase as follows:
>
> grammar Test;
>
> @parser::header {
> import java.io.FileInputStream;
> }
>
> @parser::members {
> public static void main(String args[]) throws Throwable {
> final ANTLRInputStream cs = new ANTLRInputStream(new
> FileInputStream("/tmp/nbsp.txt"));
> final TestLexer sl = new TestLexer(cs);
> final CommonTokenStream cts = new CommonTokenStream(sl);
> final TestParser sp = new TestParser(cts);
> sp.rules();
> }
> }
>
> rules: anything+;
> anything: Other | Directive ;
> Other: '-' ( ('directive') => ('directive') { $type = Directive;
> } | /* empty */ );
> WS : (' ' | '\t' | '\f' | '\r' | '\n' | '\u00a0') {
> $channel=HIDDEN; };
>
> Despite defining a non-breaking space (iso-latin-1) within the
> whitespace hiding lexer rule 'WS'
> test input with this character fails to parse as expected. Here is
> some test input:
>
> -directive †-directive †-directive †-directive - - -
>
> Here is some example output:
>
> line 1:11 no viable alternative at character '†'
> line 1:24 no viable alternative at character '†'
> line 1:37 no viable alternative at character '†'
>
>
> Given the above grammar I would have expected the non-breaking
> space (\u00a0) to be ignored.
>
> Is this a bug or user error? If user error, can anyone suggest a
> grammar fix?
>
> Regards,
>
> Darach.
>
>
More information about the antlr-interest
mailing list