[antlr-interest] Whitespace matching

Sat Apr 14 05:46:20 PDT 2012

Hi Jim,

Didn't see it before, sorry. Seems there's a lot of hidden complexity in
antlr that I've got to understand. I've not read the starting wiki is there
anything in particular I should be looking at? I've just be working from
the antlr reference book.

I think what I'm really missing is how the grammar really maps to the java
output of antlr.

Jason.

On 14 April 2012 01:10, Jim Idle <jimi at temporal-wave.com> wrote:

> Did you read my reply?
>
> On Apr 13, 2012, at 3:55 PM, Jason Jones <jmjones5 at gmail.com> wrote:
>
> > Yeah thanks, looks a bit better and definitely makes more sense, but
> still
> > having the weird whitespace mismatch issue... :S
> >
> > On 13 April 2012 14:34, Charles Daniels <cjdaniels4 at gmail.com> wrote:
> >
> >> Try the following changes (note that some of your parser rules become
> >> lexer rules):
> >>
> >> atom : SMALL_ATOM | STRING;
> >>
> >> COMMENT : '% ' ~('\n'|'\r')* '\r'? '\n' | '/*' ( options
> {greedy=false;} :
> >> . )* '*/' ;
> >> SMALL_ATOM : LOWERCASE_LETTER CHARACTER* ;
> >> VARIABLE : UPPERCASE_LETTER CHARACTER* ;
> >> NUMERAL : DIGIT+ ;
> >> STRING : '"' (CHARACTER | WHITESPACE)* '"' ;
> >>
> >> fragment CHARACTER : LOWERCASE_LETTER | UPPERCASE_LETTER | DIGIT |
> SPECIAL
> >> ;
> >> fragment LOWERCASE_LETTER : 'a' .. 'z' ;
> >> fragment UPPERCASE_LETTER : 'A' .. 'Z' | '_' ;
> >> fragment DIGIT : '0' .. '9' ;
> >> fragment SPECIAL : '+' | '-' | '*' | '/' | '\\' | '^' | '~' | ':' | '.'
> |
> >> '?' | '@' | '#' | '$' | '&' ;
> >>
> >>
> >> I haven't tested this, but it should get you closer to what you need, if
> >> it doesn't completely address the issue.
> >>
> >> Regards,
> >> Chuck
> >>
> >> On Fri, Apr 13, 2012 at 9:03 AM, Jason Jones <jmjones5 at gmail.com>
> wrote:
> >>
> >>> Ah, I see. I think I get what's been happening (whether I understand
> it is
> >>> a different matter) there must be something else in the prolog grammar
> of
> >>> mine that's changing the behaviour of the lexer/parser. I assumed that
> if
> >>> I
> >>> just added the rules you have that it would work the same as yours but
> >>> apparently not. Here's the full grammar that I've been playing with:
> >>>
> >>> //TODO: Add grammar for operators
> >>> //TODO: Add grammar for lists - DONE
> >>> //TODO: Add grammar for comments - DONE
> >>> //TODO: Add grammar for whitespace
> >>>
> >>> grammar prolog;
> >>>
> >>> //options {
> >>> //output=template;
> >>> //rewrite=true;
> >>> //}
> >>>
> >>> start : program EOF;
> >>> program : WHITESPACE line+ WHITESPACE (query WHITESPACE)*;
> >>> line    :    'L';
> >>> query    :    'Q';
> >>> //line : clause | comment ;
> >>> comment : '% ' string '\r\n' | '/*' string '*/' ; //Doesn't allow
> commas,
> >>> parenthese, square brakets, etc. in comments. Consider fixing!
> >>> //Another issue being how the single line comment is ended is it
> >>> determined
> >>> by the newline character?
> >>> clause : predicate ('.' | ':-' predicate_list '.') ;
> >>> predicate : atom | atom '(' term_list ')' ;
> >>> predicate_list : predicate (',' predicate)* ;
> >>> list : '[' term_list ('|' term)? ']' ;
> >>>
> >>> structure : atom '(' term_list ')' ;
> >>> term_list : term (',' term)* ;
> >>>
> >>> //query : '?-' predicate_list '.' ;
> >>>
> >>> term : numeral | atom | variable | structure | list ;
> >>> atom : small_atom | '\'' string '\'';
> >>> small_atom : LOWERCASE_LETTER character*;
> >>> variable : UPPERCASE_LETTER character* ;
> >>> numeral : DIGIT+ ;
> >>> character : LOWERCASE_LETTER | UPPERCASE_LETTER | DIGIT | SPECIAL ;
> >>> string : character+ (WHITESPACE+ character+)* ;
> >>>
> >>> WHITESPACE  : (' ' | '\t' | '\r' | '\n')+ ; //currently only used in
> >>> string
> >>> //NEWLINE : '\r\n' | '\n' ;
> >>> LOWERCASE_LETTER : 'a' .. 'z' ;
> >>> UPPERCASE_LETTER : 'A' .. 'Z' | '_' ;
> >>> DIGIT : '0' .. '9' ;
> >>> SPECIAL : '+' | '-' | '*' | '/' | '\\' | '^' | '~' | ':' | '.' | '?' |
> '@'
> >>> | '#' | '$' | '&' ;
> >>>
> >>> So when I create a grammar just including the rules you've suggested it
> >>> works fine but why when I use the same rules in this grammar does it
> not
> >>> work?
> >>>
> >>> Jason.
> >>>
> >>> On 13 April 2012 12:39, Bart Kiers <bkiers at gmail.com> wrote:
> >>>
> >>>> You must be doing something wrong/different. Perhaps you're running an
> >>> old
> >>>> .class file?
> >>>> I copied your prolog.g grammar and Main.java file and did this:
> >>>>
> >>>> wget http://www.antlr.org/download/antlr-3.4-complete.jar
> >>>> java -cp antlr-3.4-complete.jar org.antlr.Tool prolog.g
> >>>> javac -cp antlr-3.4-complete.jar *.java
> >>>> java -cp .:antlr-3.4-complete.jar Main
> >>>>
> >>>> which didn't produce any error or warning.
> >>>>
> >>>> Regards,
> >>>>
> >>>> Bart.
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Apr 13, 2012 at 1:06 PM, Jason Jones <jmjones5 at gmail.com>
> >>> wrote:
> >>>>
> >>>>> Stranger... Okay will I've done a manual test using this class:
> >>>>>
> >>>>> import org.antlr.runtime.*;
> >>>>>
> >>>>>
> >>>>> public class Main {
> >>>>>          public static void main(String[] args) throws Exception {
> >>>>>               prologLexer lexer = new prologLexer(new
> >>>>> ANTLRStringStream("\r\nL\r\n"));
> >>>>>              prologParser parser = new prologParser(new
> >>>>> CommonTokenStream(lexer));
> >>>>>              parser.start();
> >>>>>          }
> >>>>> }
> >>>>>
> >>>>> After running it like so:
> >>>>>
> >>>>> $ java -cp .:/usr/local/antlr-3.4/lib/antlr-3.4-complete.jar Main
> >>>>> line 1:0 mismatched input '\r\n' expecting WHITESPACE
> >>>>>
> >>>>> I still seem to be getting the same issue ^. Here's the current
> grammar
> >>>>> that I used to create the parser and lexer:
> >>>>>
> >>>>>
> >>>>> start : program EOF;
> >>>>> program : WHITESPACE line+ WHITESPACE (query WHITESPACE)*;
> >>>>> line    :       'L';
> >>>>> query   :       'Q';
> >>>>>
> >>>>> WHITESPACE  : (' ' | '\t' | '\r' | '\n')+ ;
> >>>>>
> >>>>> Jason.
> >>>>>
> >>>>>
> >>>>> On 13 April 2012 07:12, Bart Kiers <bkiers at gmail.com> wrote:
> >>>>>
> >>>>>> Both the interpreter and the debugger from ANTLRWorks (1.4.3) parse
> >>> the
> >>>>>> input just fine.
> >>>>>>
> >>>>>> I'm assuming you're not entering "\r" and "\n" as literals, but are
> >>>>>> actually entering line breaks in the text areas of ANTLRWorks'
> >>>>>> interpreter... Perhaps you've selected ANTLRWorks to start parsing
> >>> with a
> >>>>>> different rule than the `start` rule? Anyway, forget about
> ANTLRWorks
> >>> for a
> >>>>>> moment and whip up a manual test:
> >>>>>>
> >>>>>> public class Main {
> >>>>>>  public static void main(String[] args) throws Exception {
> >>>>>>    TLexer lexer = new TLexer(new ANTLRStringStream("\r\nL\r\n"));
> >>>>>>    TParser parser = new TParser(new CommonTokenStream(lexer));
> >>>>>>    parser.start();
> >>>>>>  }
> >>>>>> }
> >>>>>>
> >>>>>>
> >>>>>> Bart.
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Apr 13, 2012 at 12:09 AM, Jason Jones <jmjones5 at gmail.com
> >>>> wrote:
> >>>>>>
> >>>>>>> Hi Bart,
> >>>>>>>
> >>>>>>> I thing we're using different version of ANTLR (or something along
> >>>>>>> those lines) as using your grammar I get a MismatchedTokenException
> >>> using
> >>>>>>> the input you've used "\r\nL\r\n". I'm currently using ANTLRWorks
> >>> version
> >>>>>>> 1.4.3, could this be the reason why your end seems to be working
> and
> >>> mine
> >>>>>>> not?
> >>>>>>>
> >>>>>>> Jason.
> >>>>>>>
> >>>>>>>
> >>>>>>> On 12 April 2012 22:06, Bart Kiers <bkiers at gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Hi Jason,
> >>>>>>>>
> >>>>>>>> Then there's something other than what you've posted going wrong,
> >>>>>>>> since the parser generated from:
> >>>>>>>>
> >>>>>>>> start      : program EOF;
> >>>>>>>> program    : WHITESPACE line+ WHITESPACE (query WHITESPACE)*;
> >>>>>>>> line       : 'L';
> >>>>>>>> query      : 'Q';
> >>>>>>>> WHITESPACE : (' ' | '\t' | '\r' | '\n')+;
> >>>>>>>>
> >>>>>>>> parses the input "\r\nL\r\n" just fine.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>>
> >>>>>>>> Bart.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Apr 12, 2012 at 10:48 PM, Jason Jones <jmjones5 at gmail.com
> >>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Bart,
> >>>>>>>>>
> >>>>>>>>> Thanks for the suggestion, although it doesn't work either... The
> >>>>>>>>> skip option does work but since I'll be doing something with the
> >>> whitespace
> >>>>>>>>> later I don't want to take this option. Is there something else
> >>> we're
> >>>>>>>>> missing?
> >>>>>>>>>
> >>>>>>>>> Jason.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 12 April 2012 19:10, Bart Kiers <bkiers at gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Jason,
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Apr 12, 2012 at 6:43 PM, Jason Jones <
> jmjones5 at gmail.com
> >>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> ...
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> start : program ;
> >>>>>>>>>>> program : WHITESPACE line+ WHITESPACE (query WHITESPACE)*;
> >>>>>>>>>>>
> >>>>>>>>>>> WHITESPACE  : (' ' | '\t' | '\r' | '\n')* ; //currently only
> used
> >>>>>>>>>>> in string
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> A lexer rule must always match something: if it can match zero
> >>>>>>>>>> chars, it can/will go in an infinite loop.
> >>>>>>>>>>
> >>>>>>>>>> Do something like this:
> >>>>>>>>>>
> >>>>>>>>>> start : program ;
> >>>>>>>>>> program : WHITESPACE? line+ WHITESPACE? (query WHITESPACE?)*;
> >>>>>>>>>> WHITESPACE  : (' ' | '\t' | '\r' | '\n')+ ;
> >>>>>>>>>>
> >>>>>>>>>> or simply skip spaces like this:
> >>>>>>>>>>
> >>>>>>>>>> start : program ;
> >>>>>>>>>> program : line+ query*;
> >>>>>>>>>> WHITESPACE  : (' ' | '\t' | '\r' | '\n')+ {skip();} ;
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>>
> >>>>>>>>>> Bart.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> >>> Unsubscribe:
> >>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
> >>>
> >>
> >>
> >
> > List: http://www.antlr.org/mailman/listinfo/antlr-interest
> > Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>