[antlr-interest] Help with a parser

Tue Aug 2 16:50:47 PDT 2011

By examination, I didn't run the grammar through ANTLR, I'm pretty
sure this is your problem:

 FIELDNAME
     :    KEYWORD ':' ;

FIELDNAME is a Lexer Rule, and KEYWORD is your lexer rule.  I'm pretty
sure that is at least one of the problem.  For me, every time ANTLR
does this the easiest thing to do is make sure the lexer is generating
the _exactly_ the tokens I think it does.  Generally, if I did
something obviously dumb in the parser, I'll spot it inside of 60
seconds.  The ones that take a long time to spot are when I'm sure I
know what tokens are being generated, and I'm wrong.

My guess here is that your guess as to which tokens are being
generated is wrong.  If you'd given the actual error message output,
it would I'd be more confident in my analysis.  Not sure what the best
way to go about fixing this is.  Not sure if you should create a
fragment named KEYWORD_STUB and then use it both places you use it in
the FIELDNAME and KEYWORD lexer rules, or if you should figure out how
to accomplish this some other way.

I'm 99% sure, this is a case where you are doing too much in the
lexer.  You might consider making a ':' a first class token, and
eliminate the difference between STRING vs. KEYWORD.  As Jim Idle
always points out, Lexer errors are useless to humans.  Parser errors
are better, but Tree Walker errors are the best place to generate
error messages.  If you have on token type for STRING and KEYWORD, and
a separate token for ':', pretty much everything you put in there will
Lex.  Then at least when you get to the parser, you'll be able to
generate a decent error message, rather than saying "Unexpected
character '1' at offset: 4", for an input like 'A 12:XYZ' (because it
appears keywords aren't allow to start with numbers).  You'd much
rather tell them "12" is not a valid keyword at offset: 4... You can
only do that if it lexes.  The parser will be able to kick out such an
error message (and likely the tree walker also).

Best of luck, I use SOLR and Lucene, so I look forward to better parsers!

Kirby

On Tue, Aug 2, 2011 at 6:11 PM, Scott Smith <ssmith at mainstreamdata.com> wrote:
> I assume this is the proper place to put this.  I'm trying to build a parser for filters generated by SOLR (lucene.apache.org).
>
> Examples of valid "sentences" the parser should parse are:
>
> fq = fred
>
> fq = (fred OR bill)
>
> fq = harry:(fred OR bill)
>
> fq = (harry:fred OR jack:bill)
>
> fq = ((harry:fred OR bill) AND (jane OR marry:sally))
>
> terms can be nested to arbitrary levels.  The colon really binds to the word before it (e.g., "harry:").
>
> I've listed the parser below (which doesn't work).  Can someone suggest what I can do?  It seems like a simple problem, but so far I haven't cracked it.  I will admit that I've only been playing with Antlr the last week or so.  I did play all of Scott Stanchfield's excellent videos on vimeo.  But, still I'm confused.
>
> When I run the parser in antlrworks with example 3 ("fq = harry:(fred OR bill)" - no quote marks), it finds "fq = harry:(fred" and then it wants the right paren instead of expanding out the filter_expr rule.  What am I missing?
>
> Thanks
>
> Scott
>
> Here's the parser.
>
> grammar testGrammer;
>
> options {
>  language = Java;
> }
>
> @header {
>  package a.b.c;
> }
>
> @lexer::header {
>  package a.b.c;
> }
>
> filter:
>  'fq' '=' filter_expr EOF
>  ;
>
> term
>  : KEYWORD
>  | STRING
>  | '(' filter_expr ')'
>  ;
>
> fieldname
>     :    FIELDNAME? term
>     ;
>
>
> filter_expr:
>  fieldname (((AND | OR | NOT))? fieldname)*
>  ;
>
> FIELDNAME
>     :    KEYWORD ':' ;
>
> AND : 'AND' | '&&' ;
> OR  : 'OR' | '||' ;
> NOT : 'NOT' | '!' ;
> KEYWORD : LETTER (LETTER | NUM_CHAR | '_')*;
> STRING : '"' NONCONTROL_CHAR* '"' ;
> WS  : ' ' | '\t' | '\n' | '\r' | '\u3000' {$channel=HIDDEN; } ;
>
> fragment NONCONTROL_CHAR: LETTER | NUM_CHAR | SPACE | SYMBOL;
> fragment SYMBOL:  ' '..'!' | '#'..'/' | ':'..'@' | '['..'`' | '{'..'~';
> fragment LETTER: LOWER | UPPER;
> fragment LOWER: 'a'..'z';
> fragment UPPER: 'A'..'Z';
> fragment NUM_CHAR: '0'..'9';
> fragment SPACE: ' ' | '\t';
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>