[antlr-interest] Handling lexical nondeterminism in Tokens

Mon Feb 6 08:38:11 PST 2006

Dear Mark,

What about:

> class SearchQueryParser extends Parser;
>     options
>     {
>         k=3;
>         exportVocab=SearchQuery;
>         buildAST = true;   // uses CommonAST by default
>
>     }
>
> statement
>     : LEFT_PAREN! expr RIGHT_PAREN! EOST
>     | expr EOST
>     ;
>
> expr
>     : mexpr ( bool_relations LEFT_PAREN! expr RIGHT_PAREN! )*
>     ;
>
> mexpr
>     : atom ( bool_relations atom )*
>     ;
>
> atom
>     : IDENTIFIER equivalence_relation LITERAL
>     ;
>
> bool_relations : AND|OR|NOT ;
>
> equivalence_relation : EQUALS|NOT_EQUALS|LT|LTE|GT|GTE ;

where EOST is a token which marks the end of a statement. It can be
something like end of line or semicolon. Don't forget to add it to the
lexer as well.

Let me know how you are getting on.

Best regards,
Gabriel

On 06/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
> Thanks, I think I understand how this will help in the Lexer. I'm
> currently having problems how to capture how to properly represent the
> syntax in the Parser
>
> Here's a clarification of what I should be able to do with the query
> language:
>
> The smallest "atom" is a LITERAL string, currently this can be in quotes
> or not in quotes. For example:
>
> > United States
> > "Untied States"
>
> Alternatively a "atom" can be an equivalence relation. For instance:
>
> > title="Untied States"
> > title<>"Untied States"
> > date>=2006
>
> Each atom can have parentheses around it.
>
> > (title="Untied States")
>
> equivalence relations can be joined using boolean relations
>
> > title="Untied States" AND date>=2006
> > title="Untied States" OR date>=2006
> > title="Untied States" NOT date>=2006
>
> boolean relations can be wrapped in parentheses to control precedence.
>
> > title="Untied States" AND (date>=2006 OR author=Steven King)
>
> I've been trying to capture this using the following parser, but its
> clear now that I'm missing the mark:
>
> >class SearchQueryParser extends Parser;
> >    options
> >    {
> >          k=3;
> >        exportVocab=SearchQuery;
> >        buildAST = true;   // uses CommonAST by default
> >
> >    }
> >
> >expr
> >    :
> >        mexpr ((AND|OR|NOT) mexpr)*
> >    ;
> >
> >mexpr
> >    :
> >        LITERAL | (IDENTIFIER (EQUALS|NOT_EQUALS|LT|LTE|GT|GTE) LITERAL)+
> >    ;
> >
> >atom
> >      :
> >          LEFT_PAREN! mexpr RIGHT_PAREN! | LEFT_PAREN! expr RIGHT_PAREN!
> >    ;
> >
> thanks again for your advice,
> Mark
>
> Gabriel Radu wrote:
>
> >Dear Mark,
> >
> >I suggest using syntactic predicates. Also increasing the lexers look
> >ahead to 2 (k=2) for example may sort out the ambiguity between LT and
> >LTE, and GT and GTE. However, if you use syntactic predicates for all
> >tokens, increasing the look ahead may not be necessary.
> >
> >An example of using syntactic predicates for your grammar is following:
> >
> >
> >class SearchQueryLexer extends Lexer;
> >    options
> >    {
> >        charVocabulary='\3'..'\377';
> >    }
> >
> >
> >MAIN_LEXER_RULE
> >  : ( LITERAL ) => ( LITERAL { $setType( LITERAL ); } )
> >
> >  | ( NOT_EQUALS ) => ( NOT_EQUALS { $setType( NOT_EQUALS ); } )
> >  | ( LTE ) => ( LTE { $setType( LTE ); } )
> >  | ( GTE ) => ( GTE { $setType( GTE ); } )
> >
> >  | ( LT ) => ( LT { $setType( LT ); } )
> >  | ( GT ) => ( GT { $setType( GT ); } )
> >
> >  | ( NOT ) => ( NOT { $setType( NOT ); } )
> >  | ( AND ) => ( AND { $setType( AND ); } )
> >  | ( OR ) => ( OR { $setType( OR ); } )
> >
> >  | ( LEFT_PAREN ) => ( LEFT_PAREN { $setType( LEFT_PAREN ); } )
> >  | ( RIGHT_PAREN ) => ( RIGHT_PAREN { $setType( RIGHT_PAREN ); } )
> >
> >  | ( EQUALS ) => ( EQUALS { $setType( EQUALS ); } )
> >
> >  | ( IDENTIFIER ) => ( IDENTIFIER { $setType( IDENTIFIER ); } )
> >
> >  | ( WS ) => WS
> >
> >  ;
> >
> >protected
> >WS
> >    :
> >        ('\n' | ' ' | '\t' | '\r')+
> >        {
> >            $setType(Token.SKIP);
> >        }
> >    ;
> >
> >
> >protected
> >SINGLE_QUOTE_STRING
> >    :
> >        '\''! (~('\''))* '\''!
> >    ;
> >
> >protected
> >DOUBLE_QUOTE_STRING
> >    :
> >        '"'! (~('"'))* '"'!
> >    ;
> >
> >protected
> >LITERAL
> >    :
> >        SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
> >    ;
> >
> >protected
> >IDENTIFIER
> >
> >    options
> >    {
> >        testLiterals=true;
> >    }
> >
> >    :
> >        ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
> >('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
> >    ;
> >
> >protected
> >LEFT_PAREN
> >    :    '('        ;
> >
> >protected
> >RIGHT_PAREN
> >    :    ')'        ;
> >
> >protected
> >NOT
> >    :    ("NOT"|"not")    ;
> >
> >protected
> >AND
> >    :    ("AND"|"and")    ;
> >
> >protected
> >OR
> >    :    ("OR"|"or")        ;
> >
> >protected
> >EQUALS
> >    :    '='        ;
> >
> >protected
> >NOT_EQUALS
> >    :    "<>"    ;
> >
> >protected
> >LT
> >    :    '<'        ;
> >
> >protected
> >LTE
> >    :    "<="    ;
> >
> >protected
> >GT
> >    :    '>'        ;
> >
> >protected
> >GTE
> >    :    ">="    ;
> >
> >
> >The syntactic predicates are in MAIN_LEXER_RULE. The order of
> >productions (alternative rules) in MAIN_LEXER_RULE is important,
> >because the lexer will try to match them in the order they are
> >declared and will stop as soon as it finds a match. So for example LTE
> >must be above LT because other ways the lexer will match the LT and
> >then an EQUALS in stead of LTE.
> >
> >Let me know if this has solved your problem.
> >
> >
> >Best regards,
> >Gabriel
> >
> >
> >
> >
> >On 05/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
> >
> >
> >>I'm still working on building a Parser for our query syntax. I've
> >>encountered an issue with nondeterminism. I've included my grammar file:
> >>
> >>My question is how can I assure that the boolean predicate AND not the
> >>quoted string literal "you AND I" do not collide? I'd be very thankful
> >>to anyone with comments about obvious problems with my grammar file.
> >>
> >>thanks,
> >>Mark
> >>
> >>
> >>
> >>>class SearchQueryParser extends Parser;
> >>>    options
> >>>    {
> >>>          k=3;
> >>>        exportVocab=SearchQuery;
> >>>        buildAST = true;   // uses CommonAST by default
> >>>
> >>>    }
> >>>
> >>>
> >>>expr
> >>>    :
> >>>        mexpr ((AND|OR|NOT) mexpr)*
> >>>    ;
> >>>
> >>>mexpr
> >>>    :
> >>>        LITERAL^ | IDENTIFIER^ ((EQUALS|NOT_EQUALS|LT|LTE|GT|GTE)
> >>>LITERAL^)+
> >>>    ;
> >>>
> >>>
> >>>atom
> >>>      :
> >>>          IDENTIFIER | LEFT_PAREN! expr RIGHT_PAREN!
> >>>    ;
> >>>
> >>>class SearchQueryLexer extends Lexer;
> >>>    options
> >>>    {
> >>>        charVocabulary='\3'..'\377';
> >>>    }
> >>>
> >>>WS
> >>>    :
> >>>        ('\n' | ' ' | '\t' | '\r')+
> >>>        {
> >>>            $setType(Token.SKIP);
> >>>        }
> >>>    ;
> >>>
> >>>
> >>>protected
> >>>SINGLE_QUOTE_STRING
> >>>    :
> >>>        '\''! (~('\''))* '\''!
> >>>    ;
> >>>
> >>>protected
> >>>DOUBLE_QUOTE_STRING
> >>>    :
> >>>        '"'! (~('"'))* '"'!
> >>>    ;
> >>>
> >>>LITERAL
> >>>    :
> >>>        SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
> >>>    ;
> >>>
> >>>IDENTIFIER
> >>>
> >>>    options
> >>>    {
> >>>        testLiterals=true;
> >>>    }
> >>>
> >>>    :
> >>>        ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
> >>>('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
> >>>    ;
> >>>
> >>>LEFT_PAREN
> >>>    :    '('        ;
> >>>
> >>>RIGHT_PAREN
> >>>    :    ')'        ;
> >>>
> >>>NOT
> >>>    :    ("NOT"|"not")    ;
> >>>
> >>>AND
> >>>    :    ("AND"|"and")    ;
> >>>
> >>>OR
> >>>    :    ("OR"|"or")        ;
> >>>
> >>>EQUALS
> >>>    :    '='        ;
> >>>
> >>>NOT_EQUALS
> >>>    :    "<>"    ;
> >>>
> >>>LT
> >>>    :    '<'        ;
> >>>
> >>>LTE
> >>>    :    "<="    ;
> >>>
> >>>GT
> >>>    :    '>'        ;
> >>>
> >>>GTE
> >>>    :    ">="    ;
> >>>
> >>>
> >>
> >>
> >>
>