[antlr-interest] Whitespace matching

Fri Apr 13 09:31:30 PDT 2012

You have a lexer rule called WHITESPACE but you ALSO have the literal
string '\r\n' in your parser rules. Hence the longer parser rule is
creating the ANTLR formed token tok_NN and not your WHITESPACE token
because it is a longer match. Hence the mismatched token.

Do not use 'literals' in your parser rules as it gets you in to trouble
when you are starting out. *

Have you been through the getting started posts in the wiki?

Jim

* I feel like I must have written this line about 32,768 times ;)

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Jason Jones
> Sent: Friday, April 13, 2012 6:03 AM
> To: Bart Kiers
> Cc: antlr-interest at antlr.org interest
> Subject: Re: [antlr-interest] Whitespace matching
>
> Ah, I see. I think I get what's been happening (whether I understand it
> is a different matter) there must be something else in the prolog
> grammar of mine that's changing the behaviour of the lexer/parser. I
> assumed that if I just added the rules you have that it would work the
> same as yours but apparently not. Here's the full grammar that I've
> been playing with:
>
> //TODO: Add grammar for operators
> //TODO: Add grammar for lists - DONE
> //TODO: Add grammar for comments - DONE
> //TODO: Add grammar for whitespace
>
> grammar prolog;
>
> //options {
> //output=template;
> //rewrite=true;
> //}
>
> start : program EOF;
> program : WHITESPACE line+ WHITESPACE (query WHITESPACE)*;
> line    :    'L';
> query    :    'Q';
> //line : clause | comment ;
> comment : '% ' string '\r\n' | '/*' string '*/' ; //Doesn't allow
> commas, parenthese, square brakets, etc. in comments. Consider fixing!
> //Another issue being how the single line comment is ended is it
> determined by the newline character?
> clause : predicate ('.' | ':-' predicate_list '.') ; predicate : atom |
> atom '(' term_list ')' ; predicate_list : predicate (',' predicate)* ;
> list : '[' term_list ('|' term)? ']' ;
>
> structure : atom '(' term_list ')' ;
> term_list : term (',' term)* ;
>
> //query : '?-' predicate_list '.' ;
>
> term : numeral | atom | variable | structure | list ; atom : small_atom
> | '\'' string '\''; small_atom : LOWERCASE_LETTER character*; variable
> : UPPERCASE_LETTER character* ; numeral : DIGIT+ ; character :
> LOWERCASE_LETTER | UPPERCASE_LETTER | DIGIT | SPECIAL ; string :
> character+ (WHITESPACE+ character+)* ;
>
> WHITESPACE  : (' ' | '\t' | '\r' | '\n')+ ; //currently only used in
> string //NEWLINE : '\r\n' | '\n' ; LOWERCASE_LETTER : 'a' .. 'z' ;
> UPPERCASE_LETTER : 'A' .. 'Z' | '_' ; DIGIT : '0' .. '9' ; SPECIAL :
> '+' | '-' | '*' | '/' | '\\' | '^' | '~' | ':' | '.' | '?' | '@'
> | '#' | '$' | '&' ;
>
> So when I create a grammar just including the rules you've suggested it
> works fine but why when I use the same rules in this grammar does it
> not work?
>
> Jason.
>
> On 13 April 2012 12:39, Bart Kiers <bkiers at gmail.com> wrote:
>
> > You must be doing something wrong/different. Perhaps you're running
> an
> > old .class file?
> > I copied your prolog.g grammar and Main.java file and did this:
> >
> > wget http://www.antlr.org/download/antlr-3.4-complete.jar
> > java -cp antlr-3.4-complete.jar org.antlr.Tool prolog.g javac -cp
> > antlr-3.4-complete.jar *.java java -cp .:antlr-3.4-complete.jar Main
> >
> > which didn't produce any error or warning.
> >
> > Regards,
> >
> > Bart.
> >
> >
> >
> > On Fri, Apr 13, 2012 at 1:06 PM, Jason Jones <jmjones5 at gmail.com>
> wrote:
> >
> >> Stranger... Okay will I've done a manual test using this class:
> >>
> >> import org.antlr.runtime.*;
> >>
> >>
> >> public class Main {
> >>           public static void main(String[] args) throws Exception {
> >>                prologLexer lexer = new prologLexer(new
> >> ANTLRStringStream("\r\nL\r\n"));
> >>               prologParser parser = new prologParser(new
> >> CommonTokenStream(lexer));
> >>               parser.start();
> >>           }
> >> }
> >>
> >> After running it like so:
> >>
> >> $ java -cp .:/usr/local/antlr-3.4/lib/antlr-3.4-complete.jar Main
> >> line 1:0 mismatched input '\r\n' expecting WHITESPACE
> >>
> >> I still seem to be getting the same issue ^. Here's the current
> >> grammar that I used to create the parser and lexer:
> >>
> >>
> >> start : program EOF;
> >> program : WHITESPACE line+ WHITESPACE (query WHITESPACE)*;
> >> line    :       'L';
> >> query   :       'Q';
> >>
> >> WHITESPACE  : (' ' | '\t' | '\r' | '\n')+ ;
> >>
> >> Jason.
> >>
> >>
> >> On 13 April 2012 07:12, Bart Kiers <bkiers at gmail.com> wrote:
> >>
> >>> Both the interpreter and the debugger from ANTLRWorks (1.4.3) parse
> >>> the input just fine.
> >>>
> >>> I'm assuming you're not entering "\r" and "\n" as literals, but are
> >>> actually entering line breaks in the text areas of ANTLRWorks'
> >>> interpreter... Perhaps you've selected ANTLRWorks to start parsing
> >>> with a different rule than the `start` rule? Anyway, forget about
> >>> ANTLRWorks for a moment and whip up a manual test:
> >>>
> >>> public class Main {
> >>>   public static void main(String[] args) throws Exception {
> >>>     TLexer lexer = new TLexer(new ANTLRStringStream("\r\nL\r\n"));
> >>>     TParser parser = new TParser(new CommonTokenStream(lexer));
> >>>     parser.start();
> >>>   }
> >>> }
> >>>
> >>>
> >>> Bart.
> >>>
> >>>
> >>> On Fri, Apr 13, 2012 at 12:09 AM, Jason Jones
> <jmjones5 at gmail.com>wrote:
> >>>
> >>>> Hi Bart,
> >>>>
> >>>> I thing we're using different version of ANTLR (or something along
> >>>> those lines) as using your grammar I get a
> MismatchedTokenException
> >>>> using the input you've used "\r\nL\r\n". I'm currently using
> >>>> ANTLRWorks version 1.4.3, could this be the reason why your end
> >>>> seems to be working and mine not?
> >>>>
> >>>> Jason.
> >>>>
> >>>>
> >>>> On 12 April 2012 22:06, Bart Kiers <bkiers at gmail.com> wrote:
> >>>>
> >>>>> Hi Jason,
> >>>>>
> >>>>> Then there's something other than what you've posted going wrong,
> >>>>> since the parser generated from:
> >>>>>
> >>>>> start      : program EOF;
> >>>>> program    : WHITESPACE line+ WHITESPACE (query WHITESPACE)*;
> >>>>> line       : 'L';
> >>>>> query      : 'Q';
> >>>>> WHITESPACE : (' ' | '\t' | '\r' | '\n')+;
> >>>>>
> >>>>> parses the input "\r\nL\r\n" just fine.
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> Bart.
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 12, 2012 at 10:48 PM, Jason Jones
> <jmjones5 at gmail.com>wrote:
> >>>>>
> >>>>>> Hi Bart,
> >>>>>>
> >>>>>> Thanks for the suggestion, although it doesn't work either...
> The
> >>>>>> skip option does work but since I'll be doing something with the
> >>>>>> whitespace later I don't want to take this option. Is there
> >>>>>> something else we're missing?
> >>>>>>
> >>>>>> Jason.
> >>>>>>
> >>>>>>
> >>>>>> On 12 April 2012 19:10, Bart Kiers <bkiers at gmail.com> wrote:
> >>>>>>
> >>>>>>> Hi Jason,
> >>>>>>>
> >>>>>>> On Thu, Apr 12, 2012 at 6:43 PM, Jason Jones
> <jmjones5 at gmail.com>wrote:
> >>>>>>>
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> start : program ;
> >>>>>>>> program : WHITESPACE line+ WHITESPACE (query WHITESPACE)*;
> >>>>>>>>
> >>>>>>>> WHITESPACE  : (' ' | '\t' | '\r' | '\n')* ; //currently only
> >>>>>>>> used in string
> >>>>>>>>
> >>>>>>>>
> >>>>>>> A lexer rule must always match something: if it can match zero
> >>>>>>> chars, it can/will go in an infinite loop.
> >>>>>>>
> >>>>>>> Do something like this:
> >>>>>>>
> >>>>>>> start : program ;
> >>>>>>> program : WHITESPACE? line+ WHITESPACE? (query WHITESPACE?)*;
> >>>>>>> WHITESPACE  : (' ' | '\t' | '\r' | '\n')+ ;
> >>>>>>>
> >>>>>>> or simply skip spaces like this:
> >>>>>>>
> >>>>>>> start : program ;
> >>>>>>> program : line+ query*;
> >>>>>>> WHITESPACE  : (' ' | '\t' | '\r' | '\n')+ {skip();} ;
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>>
> >>>>>>> Bart.
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address