[antlr-interest] No viable for alternative with ISO-LATIN-1 non-breaking space character

Mon Feb 18 14:41:41 PST 2008

I had an issue earlier today with the Java version of the grammar I am 
working on not reading UTF-8 encoded text properly. I would also like to 
know what the default is.

Thanks,
Jamie Penney

Darach Ennis wrote:
> Hi Jim.
>
> Bingo! Thank you! You were very close:
>
> new ANTLRFileStream("/tmp/nbsp.txt", "ISO-8859-1")
>
> The non-breaking-space is encoding specific and my input stream is 
> iso-8859-1
> so this should be iso-8859-1 in my case. What is the default encoding 
> in ANTLRInputStream?
> Is it UTF-8 or the system encoding? The javadoc could mention what the 
> default is.
>
> Regards,
>
> Darach.
>
> PS: I generally use the POSIX.1 od  utility (od -H file.txt on 
> unix/linux) to verify characters in the input encoding.
>
> On Feb 18, 2008 8:53 PM, Jim Idle <jimi at temporal-wave.com 
> <mailto:jimi at temporal-wave.com>> wrote:
>
>     Are you sure that that is actually  character 0xa0? Print the hex
>     value of it.
>
>      
>
>     However, I think that perhaps  you need to add the "UTF8" encoding
>     option to your input stream?
>
>      
>
>     new ANTLRFileStream((/tmp/nbsp.txt", "UTF8")
>
>      
>
>     Jim
>
>      
>
>     *From:* Darach Ennis [mailto:darach at gmail.com
>     <mailto:darach at gmail.com>]
>     *Sent:* Monday, February 18, 2008 8:59 AM
>     *To:* antlr-interest at antlr.org <mailto:antlr-interest at antlr.org>
>     *Subject:* [antlr-interest] No viable for alternative with
>     ISO-LATIN-1 non-breaking space character
>
>      
>
>     Hi guys,
>
>     I'm not sure if this is a case of user error or a bug. I have
>     replicated the issue in a testcase as follows:
>
>     grammar Test;
>
>     @parser::header {
>       import java.io.FileInputStream;
>     }
>
>     @parser::members {
>       public static void main(String args[]) throws Throwable {
>         final ANTLRInputStream cs = new ANTLRInputStream(new
>     FileInputStream("/tmp/nbsp.txt"));
>         final TestLexer sl = new TestLexer(cs);
>         final CommonTokenStream cts = new CommonTokenStream(sl);
>         final TestParser sp = new TestParser(cts);
>         sp.rules();
>       }
>     }
>
>     rules:    anything+;
>     anything: Other | Directive ;
>     Other:   '-' ( ('directive') => ('directive') { $type = Directive;
>     } | /* empty */ );
>     WS    :    (' ' | '\t' | '\f' | '\r' | '\n' | '\u00a0') {
>     $channel=HIDDEN; };
>
>     Despite defining a non-breaking space (iso-latin-1) within the
>     whitespace hiding lexer rule 'WS'
>     test input with this character fails to parse as expected. Here is
>     some test input:
>
>     -directive †-directive †-directive †-directive - - -
>
>     Here is some example output:
>
>     line 1:11 no viable alternative at character '†'
>     line 1:24 no viable alternative at character '†'
>     line 1:37 no viable alternative at character '†'
>
>
>     Given the above grammar I would have expected the non-breaking
>     space (\u00a0) to be ignored.
>
>     Is this a bug or user error? If user error, can anyone suggest a
>     grammar fix?
>
>     Regards,
>
>     Darach.
>
>