[antlr-interest] Handling lexical nondeterminism in Tokens

Gabriel Radu gabriel.adrian.radu at googlemail.com
Fri Feb 10 09:52:03 PST 2006


Dear Mark,

Sorry for the delayed reply. With a little help from the Oracle
ORACLE-7-SQL grammar from the antler home page, I think that I have
solved the problem.

The attached grammar file works with your test cases. Now the next
step will be to build the AST and write a parser for it.

Best regards,
Gabriel



On 07/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
> I've included both an attached jpeg showing the non-determiinism in
> Antlr Studio and the Eclipse project as a zip archive. There is a
> Main.java file with examples I'm testing in it.
>
> To handle all the cases where parenth can occur, I'm doing the following.
>
> > class SearchQueryParser extends Parser;
> >
> > options{
> >     k=3;
> >     exportVocab=SearchQuery;
> >     buildAST = true; // uses CommonAST by default
> > }
> >
> > statement
> >     : LEFT_PAREN expr RIGHT_PAREN EOST
> >     | expr EOST
> >     ;
> >
> > expr
> >     : mexpr ( bool_relations mexpr)*
> >     | LEFT_PAREN mexpr ( bool_relations  mexpr)* RIGHT_PAREN
> >     ;
> >
> > mexpr
> >     : atom ( bool_relations atom )*
> >     | LEFT_PAREN atom ( bool_relations atom )*    RIGHT_PAREN
> >     ;
> >
> > atom
> >     : IDENTIFIER equivalence_relation LITERAL
> >     ;
> >
> > bool_relations : AND|OR|NOT ;
> >
> > equivalence_relation : EQUALS|NOT_EQUALS|LT|LTE|GT|GTE ;
>
>
> thanks,
> -Mark
>
> Gabriel Radu wrote:
>
> >Do you have any test cases which work? And what is the generated
> >parser complaining about when you run the test cases mentioned in your
> >previous email?
> >
> >Best regards,
> >Gabriel
> >
> >
> >
> >On 07/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
> >
> >
> >> Thanks, that is very close to what I need.
> >>
> >> Seems not to work on the following cases, which should all be valid.
> >>
> >> "((foo='bar')AND bim='bam');"
> >> "(foo='bar' AND bim='bam');"
> >> "foo='bar' AND bim='bam';"
> >>
> >> I've tried different combinations of defining the parethises and the white
> >>space. Along the lines of the following
> >>
> >> : LEFT_PAREN! expr RIGHT_PAREN!
> >> | expr
> >> But can't seem to get it right.
> >>
> >> -Mark
> >>
> >>
> >> Gabriel Radu wrote:
> >> Dear Mark,
> >>
> >>What about:
> >>
> >>
> >>
> >> class SearchQueryParser extends Parser;
> >> options
> >> {
> >> k=3;
> >> exportVocab=SearchQuery;
> >> buildAST = true; // uses CommonAST by default
> >>
> >> }
> >>
> >>statement
> >> : LEFT_PAREN! expr RIGHT_PAREN! EOST
> >> | expr EOST
> >> ;
> >>
> >>expr
> >> : mexpr ( bool_relations LEFT_PAREN! expr RIGHT_PAREN! )*
> >> ;
> >>
> >>mexpr
> >> : atom ( bool_relations atom )*
> >> ;
> >>
> >>atom
> >> : IDENTIFIER equivalence_relation LITERAL
> >> ;
> >>
> >>bool_relations : AND|OR|NOT ;
> >>
> >>equivalence_relation : EQUALS|NOT_EQUALS|LT|LTE|GT|GTE ;
> >>
> >> where EOST is a token which marks the end of a statement. It can be
> >>something like end of line or semicolon. Don't forget to add it to the
> >>lexer as well.
> >>
> >>Let me know how you are getting on.
> >>
> >>
> >>Best regards,
> >>Gabriel
> >>
> >>
> >>
> >>
> >>On 06/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
> >>
> >>
> >> Thanks, I think I understand how this will help in the Lexer. I'm
> >>currently having problems how to capture how to properly represent the
> >>syntax in the Parser
> >>
> >>Here's a clarification of what I should be able to do with the query
> >>language:
> >>
> >>The smallest "atom" is a LITERAL string, currently this can be in quotes
> >>or not in quotes. For example:
> >>
> >>
> >>
> >> United States
> >>"Untied States"
> >>
> >> Alternatively a "atom" can be an equivalence relation. For instance:
> >>
> >>
> >>
> >> title="Untied States"
> >>title<>"Untied States"
> >>date>=2006
> >>
> >> Each atom can have parentheses around it.
> >>
> >>
> >>
> >> (title="Untied States")
> >>
> >> equivalence relations can be joined using boolean relations
> >>
> >>
> >>
> >> title="Untied States" AND date>=2006
> >>title="Untied States" OR date>=2006
> >>title="Untied States" NOT date>=2006
> >>
> >> boolean relations can be wrapped in parentheses to control precedence.
> >>
> >>
> >>
> >> title="Untied States" AND (date>=2006 OR author=Steven King)
> >>
> >> I've been trying to capture this using the following parser, but its
> >>clear now that I'm missing the mark:
> >>
> >>
> >>
> >> class SearchQueryParser extends Parser;
> >> options
> >> {
> >> k=3;
> >> exportVocab=SearchQuery;
> >> buildAST = true; // uses CommonAST by default
> >>
> >> }
> >>
> >>expr
> >> :
> >> mexpr ((AND|OR|NOT) mexpr)*
> >> ;
> >>
> >>mexpr
> >> :
> >> LITERAL | (IDENTIFIER (EQUALS|NOT_EQUALS|LT|LTE|GT|GTE)
> >>LITERAL)+
> >> ;
> >>
> >>atom
> >> :
> >> LEFT_PAREN! mexpr RIGHT_PAREN! | LEFT_PAREN! expr RIGHT_PAREN!
> >> ;
> >>
> >>
> >> thanks again for your advice,
> >>Mark
> >>
> >>Gabriel Radu wrote:
> >>
> >>
> >>
> >> Dear Mark,
> >>
> >>I suggest using syntactic predicates. Also increasing the lexers look
> >>ahead to 2 (k=2) for example may sort out the ambiguity between LT and
> >>LTE, and GT and GTE. However, if you use syntactic predicates for all
> >>tokens, increasing the look ahead may not be necessary.
> >>
> >>An example of using syntactic predicates for your grammar is following:
> >>
> >>
> >>class SearchQueryLexer extends Lexer;
> >> options
> >> {
> >> charVocabulary='\3'..'\377';
> >> }
> >>
> >>
> >>MAIN_LEXER_RULE
> >> : ( LITERAL ) => ( LITERAL { $setType( LITERAL ); } )
> >>
> >> | ( NOT_EQUALS ) => ( NOT_EQUALS { $setType( NOT_EQUALS ); } )
> >> | ( LTE ) => ( LTE { $setType( LTE ); } )
> >> | ( GTE ) => ( GTE { $setType( GTE ); } )
> >>
> >> | ( LT ) => ( LT { $setType( LT ); } )
> >> | ( GT ) => ( GT { $setType( GT ); } )
> >>
> >> | ( NOT ) => ( NOT { $setType( NOT ); } )
> >> | ( AND ) => ( AND { $setType( AND ); } )
> >> | ( OR ) => ( OR { $setType( OR ); } )
> >>
> >> | ( LEFT_PAREN ) => ( LEFT_PAREN { $setType( LEFT_PAREN ); } )
> >> | ( RIGHT_PAREN ) => ( RIGHT_PAREN { $setType( RIGHT_PAREN ); } )
> >>
> >> | ( EQUALS ) => ( EQUALS { $setType( EQUALS ); } )
> >>
> >> | ( IDENTIFIER ) => ( IDENTIFIER { $setType( IDENTIFIER ); } )
> >>
> >> | ( WS ) => WS
> >>
> >> ;
> >>
> >>protected
> >>WS
> >> :
> >> ('\n' | ' ' | '\t' | '\r')+
> >> {
> >> $setType(Token.SKIP);
> >> }
> >> ;
> >>
> >>
> >>protected
> >>SINGLE_QUOTE_STRING
> >> :
> >> '\''! (~('\''))* '\''!
> >> ;
> >>
> >>protected
> >>DOUBLE_QUOTE_STRING
> >> :
> >> '"'! (~('"'))* '"'!
> >> ;
> >>
> >>protected
> >>LITERAL
> >> :
> >> SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
> >> ;
> >>
> >>protected
> >>IDENTIFIER
> >>
> >> options
> >> {
> >> testLiterals=true;
> >> }
> >>
> >> :
> >> ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
> >>('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
> >> ;
> >>
> >>protected
> >>LEFT_PAREN
> >> : '(' ;
> >>
> >>protected
> >>RIGHT_PAREN
> >> : ')' ;
> >>
> >>protected
> >>NOT
> >> : ("NOT"|"not") ;
> >>
> >>protected
> >>AND
> >> : ("AND"|"and") ;
> >>
> >>protected
> >>OR
> >> : ("OR"|"or") ;
> >>
> >>protected
> >>EQUALS
> >> : '=' ;
> >>
> >>protected
> >>NOT_EQUALS
> >> : "<>" ;
> >>
> >>protected
> >>LT
> >> : '<' ;
> >>
> >>protected
> >>LTE
> >> : "<=" ;
> >>
> >>protected
> >>GT
> >> : '>' ;
> >>
> >>protected
> >>GTE
> >> : ">=" ;
> >>
> >>
> >>The syntactic predicates are in MAIN_LEXER_RULE. The order of
> >>productions (alternative rules) in MAIN_LEXER_RULE is important,
> >>because the lexer will try to match them in the order they are
> >>declared and will stop as soon as it finds a match. So for example LTE
> >>must be above LT because other ways the lexer will match the LT and
> >>then an EQUALS in stead of LTE.
> >>
> >>Let me know if this has solved your problem.
> >>
> >>
> >>Best regards,
> >>Gabriel
> >>
> >>
> >>
> >>
> >>On 05/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
> >>
> >>
> >>
> >>
> >> I'm still working on building a Parser for our query syntax. I've
> >>encountered an issue with nondeterminism. I've included my grammar file:
> >>
> >>My question is how can I assure that the boolean predicate AND not the
> >>quoted string literal "you AND I" do not collide? I'd be very thankful
> >>to anyone with comments about obvious problems with my grammar file.
> >>
> >>thanks,
> >>Mark
> >>
> >>
> >>
> >>
> >>
> >> class SearchQueryParser extends Parser;
> >> options
> >> {
> >> k=3;
> >> exportVocab=SearchQuery;
> >> buildAST = true; // uses CommonAST by default
> >>
> >> }
> >>
> >>
> >>expr
> >> :
> >> mexpr ((AND|OR|NOT) mexpr)*
> >> ;
> >>
> >>mexpr
> >> :
> >> LITERAL^ | IDENTIFIER^ ((EQUALS|NOT_EQUALS|LT|LTE|GT|GTE)
> >>LITERAL^)+
> >> ;
> >>
> >>
> >>atom
> >> :
> >> IDENTIFIER | LEFT_PAREN! expr RIGHT_PAREN!
> >> ;
> >>
> >>class SearchQueryLexer extends Lexer;
> >> options
> >> {
> >> charVocabulary='\3'..'\377';
> >> }
> >>
> >>WS
> >> :
> >> ('\n' | ' ' | '\t' | '\r')+
> >> {
> >> $setType(Token.SKIP);
> >> }
> >> ;
> >>
> >>
> >>protected
> >>SINGLE_QUOTE_STRING
> >> :
> >> '\''! (~('\''))* '\''!
> >> ;
> >>
> >>protected
> >>DOUBLE_QUOTE_STRING
> >> :
> >> '"'! (~('"'))* '"'!
> >> ;
> >>
> >>LITERAL
> >> :
> >> SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
> >> ;
> >>
> >>IDENTIFIER
> >>
> >> options
> >> {
> >> testLiterals=true;
> >> }
> >>
> >> :
> >> ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
> >>('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
> >> ;
> >>
> >>LEFT_PAREN
> >> : '(' ;
> >>
> >>RIGHT_PAREN
> >> : ')' ;
> >>
> >>NOT
> >> : ("NOT"|"not") ;
> >>
> >>AND
> >> : ("AND"|"and") ;
> >>
> >>OR
> >> : ("OR"|"or") ;
> >>
> >>EQUALS
> >> : '=' ;
> >>
> >>NOT_EQUALS
> >> : "<>" ;
> >>
> >>LT
> >> : '<' ;
> >>
> >>LTE
> >> : "<=" ;
> >>
> >>GT
> >> : '>' ;
> >>
> >>GTE
> >> : ">=" ;
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: query.g
Type: application/octet-stream
Size: 2425 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20060210/2a8ba710/query-0001.obj


More information about the antlr-interest mailing list