[antlr-interest] Matching multiple occurrences of quoted text joined by 'and' (i.e. "a" and "b" and "c")

Wed Nov 3 18:29:49 PDT 2010

Greetings!

On Wed, 2010-11-03 at 22:40 +0000, Colin Yates wrote:
> Hmmm - I think I might be running into a bug - either in the code or my
> understanding (almost certainly my understanding!).

unfortunately i do believe you have mis-understood ANTLR Lexers...

> I have created a simple grammar which demonstrates the problem (I am testing
> the first parser rule called 'rule1') :
> 
> --- start
> grammar QuotedText;
> 
> @parser::header {
> package examples.aandb;
> }
> 
> @lexer::header {
> package examples.aandb;
> }
> 
> rule1
> : a=QUOTED_TEXT 'and' b=QUOTED_TEXT 'and' c=QUOTED_TEXT
> { System.out.println("rule1: " + a.getText() + ", " + b.getText() + "," +
> c.getText());}
> ;
> ruleThatShouldBeIgnored
> : 'and whose' 'external'? 'resource is' theResource=('this' | QUOTED_TEXT)
> { System.out.println("taskResource: " + $theResource);}
> ;
> 

please recall that ANTLR Lexers are greedy *and* Lexers do *not*
back-track.

so when the input string of 'and ' in your input '"a" and "b"'
sub-phrase, the lexer commits itself to recognizing the 'and whose' and
only the 'and whose' keyword. of course this particular input string
does not contain the 'and where' keyword so the lexer generates the
error you are observing (e.g. we are at the opening " of the "b"
sub-phrase and not a 'w' that must follow any instance of 'and ' in
order to recognize the 'and where' keyword.

as an aside, you will have similar problems with your 'resource is'
keyword.

you can fix this particular problem by splitting the 'and where' (and
'resource is') keyword into 2 keywords: 'and' 'where'.  this also has
the benefit of recognizing more than 1 blank (or tab) between the two
words --- 'and     where' != 'and where' 
but 'and'       'where' == 'and' 'where'

as a larger suggestion, you should avoid any 'quoted' keyword in your
parser rules. rather you should specify explicit lexer rules for them.
by following this suggestion you will be more aware of your lexer's
activities and can deal with common left prefix issues such as the above
problem.

if you really want to have an 'and where' token in addition to a 'and'
token you can do something like the following by using explicit lexer
rules:

tokens { AND_WHERE; } // at top of the .g file, after options{}

AND : 'and' ( WS 'where' { $type = AND_WHERE; } )? ;

and use AND everywhere in the parser instead of 'and' and use AND_WHERE
everywhere in the parser instead of 'and where'.

note: the use of WS above isn't really quite the best. better would be
to rework your WS rule as something like these:

fragment WS : (' '|'\t'|'\n'|'\r')+ ;
WHITESPACE : WS {skip();} ;

this latter alternative avoids token creation overhead in usage of the
WS rule (because it is now a fragment).

> QUOTED_TEXT : '"' (~'"')* '"';
> WS
>     : (' '|'\t'|'\n'|'\r')+ {skip();}
>     ;
> --- end
> 
> My test case is as follows:
> 
> --- start
> package examples.aandb;
> 
> import org.junit.Test;
> import org.antlr.runtime.CommonTokenStream;
> import org.antlr.runtime.CharStream;
> import org.antlr.runtime.ANTLRStringStream;
> import org.antlr.runtime.RecognitionException;
> 
> import java.io.IOException;
> 
> public class TestCase {
> 
>     @Test
>     public void happyPath() throws IOException, RecognitionException {
>         String dsl = "\"a\" and \"b\" and \"c\"";
>         createParser(dsl).rule1();
>     }
> 
>     private QuotedTextParser createParser(String testString) throws
> IOException {
>         QuotedTextLexer lexer = createLexer(testString);
>         CommonTokenStream tokens = new CommonTokenStream(lexer);
>         return new QuotedTextParser(tokens);
>     }
> 
>     private QuotedTextLexer createLexer(String testString) throws
> IOException {
>         CharStream stream = new ANTLRStringStream(testString);
>         return new QuotedTextLexer(stream);
>     }
> }
> 
> --- end
> 
> If I run that (in IDEA 8 using latest antlrworks and antlr 3.2 then I get
> the following output:
> 
> --- start
> lline 1:8 mismatched character '"' expecting 'w'
> line 1:9 no viable alternative at character 'b'
> line 1:17 no viable alternative at character 'c'
> line 1:19 mismatched character '<EOF>' expecting '"'
> line 1:10 missing 'and' at '" and "'
> line 0:-1 mismatched input '<EOF>' expecting 'and'
> --- end
> 
> if however, I comment out the second rule ('ruleThatShouldBeIgnored') then
> everything works as expected.  The output is:
> 
> --- start
> rule1: "a", "b","c"
> --- end
> 
> I don't understand this behaviour - I don't see why
> 'ruleThatShouldBeIgnored' is having any influence.
> 
> Any ideas?
> 
> Thanks,
> 
> Col
> 
> On 3 November 2010 19:37, Colin Yates <colin.yates at gmail.com> wrote:
> 
> > Thanks Gordon,
> >
> > That doesn't work either.  I think I need to separate out just this
> > fragment into its own grammar to ensure that the rest of the grammar
> > isn't having any unexpected side effects.
> >
> > I will report back once I have isolated these two rules... Thanks!
> >
> > Sent from my iPad
> >
> > On 3 Nov 2010, at 19:25, Gordon Tyler <Gordon.Tyler at quest.com> wrote:
> >
> > >> QUOTED_TEXT : '\"' ( options {greedy=false;} : .)* '\"'
> > >
> > > Try this:
> > >
> > > QUOTED_TEXT : '"' (~'"')* '"'
> > >
> > > In English: Match '"', then match zero or more characters which are not
> > '"', then match '"'.
> > >
> > > Ciao,
> > > Gordon
> > >
> >

Hope this helps.....
   -jbb