[antlr-interest] Handling lexical nondeterminism in Tokens

Mark R. Diggory mdiggory at latte.harvard.edu
Tue Feb 7 08:45:14 PST 2006


Thanks, that is very close to what I need.

Seems not to work on the following cases, which should all be valid.

"((foo='bar')AND bim='bam');"
"(foo='bar' AND bim='bam');"
"foo='bar' AND bim='bam';"

I've tried different combinations of defining the parethises and the 
white space. Along the lines of the following

    : LEFT_PAREN! expr RIGHT_PAREN!
    | expr

But can't seem to get it right.

-Mark

Gabriel Radu wrote:

>Dear Mark,
>
>What about:
>
>  
>
>>class SearchQueryParser extends Parser;
>>    options
>>    {
>>        k=3;
>>        exportVocab=SearchQuery;
>>        buildAST = true;   // uses CommonAST by default
>>
>>    }
>>
>>statement
>>    : LEFT_PAREN! expr RIGHT_PAREN! EOST
>>    | expr EOST
>>    ;
>>
>>expr
>>    : mexpr ( bool_relations LEFT_PAREN! expr RIGHT_PAREN! )*
>>    ;
>>
>>mexpr
>>    : atom ( bool_relations atom )*
>>    ;
>>
>>atom
>>    : IDENTIFIER equivalence_relation LITERAL
>>    ;
>>
>>bool_relations : AND|OR|NOT ;
>>
>>equivalence_relation : EQUALS|NOT_EQUALS|LT|LTE|GT|GTE ;
>>    
>>
>
>where EOST is a token which marks the end of a statement. It can be
>something like end of line or semicolon. Don't forget to add it to the
>lexer as well.
>
>Let me know how you are getting on.
>
>
>Best regards,
>Gabriel
>
>
>
>
>On 06/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
>  
>
>>Thanks, I think I understand how this will help in the Lexer. I'm
>>currently having problems how to capture how to properly represent the
>>syntax in the Parser
>>
>>Here's a clarification of what I should be able to do with the query
>>language:
>>
>>The smallest "atom" is a LITERAL string, currently this can be in quotes
>>or not in quotes. For example:
>>
>>    
>>
>>>United States
>>>"Untied States"
>>>      
>>>
>>Alternatively a "atom" can be an equivalence relation. For instance:
>>
>>    
>>
>>>title="Untied States"
>>>title<>"Untied States"
>>>date>=2006
>>>      
>>>
>>Each atom can have parentheses around it.
>>
>>    
>>
>>>(title="Untied States")
>>>      
>>>
>>equivalence relations can be joined using boolean relations
>>
>>    
>>
>>>title="Untied States" AND date>=2006
>>>title="Untied States" OR date>=2006
>>>title="Untied States" NOT date>=2006
>>>      
>>>
>>boolean relations can be wrapped in parentheses to control precedence.
>>
>>    
>>
>>>title="Untied States" AND (date>=2006 OR author=Steven King)
>>>      
>>>
>>I've been trying to capture this using the following parser, but its
>>clear now that I'm missing the mark:
>>
>>    
>>
>>>class SearchQueryParser extends Parser;
>>>   options
>>>   {
>>>         k=3;
>>>       exportVocab=SearchQuery;
>>>       buildAST = true;   // uses CommonAST by default
>>>
>>>   }
>>>
>>>expr
>>>   :
>>>       mexpr ((AND|OR|NOT) mexpr)*
>>>   ;
>>>
>>>mexpr
>>>   :
>>>       LITERAL | (IDENTIFIER (EQUALS|NOT_EQUALS|LT|LTE|GT|GTE) LITERAL)+
>>>   ;
>>>
>>>atom
>>>     :
>>>         LEFT_PAREN! mexpr RIGHT_PAREN! | LEFT_PAREN! expr RIGHT_PAREN!
>>>   ;
>>>
>>>      
>>>
>>thanks again for your advice,
>>Mark
>>
>>Gabriel Radu wrote:
>>
>>    
>>
>>>Dear Mark,
>>>
>>>I suggest using syntactic predicates. Also increasing the lexers look
>>>ahead to 2 (k=2) for example may sort out the ambiguity between LT and
>>>LTE, and GT and GTE. However, if you use syntactic predicates for all
>>>tokens, increasing the look ahead may not be necessary.
>>>
>>>An example of using syntactic predicates for your grammar is following:
>>>
>>>
>>>class SearchQueryLexer extends Lexer;
>>>   options
>>>   {
>>>       charVocabulary='\3'..'\377';
>>>   }
>>>
>>>
>>>MAIN_LEXER_RULE
>>> : ( LITERAL ) => ( LITERAL { $setType( LITERAL ); } )
>>>
>>> | ( NOT_EQUALS ) => ( NOT_EQUALS { $setType( NOT_EQUALS ); } )
>>> | ( LTE ) => ( LTE { $setType( LTE ); } )
>>> | ( GTE ) => ( GTE { $setType( GTE ); } )
>>>
>>> | ( LT ) => ( LT { $setType( LT ); } )
>>> | ( GT ) => ( GT { $setType( GT ); } )
>>>
>>> | ( NOT ) => ( NOT { $setType( NOT ); } )
>>> | ( AND ) => ( AND { $setType( AND ); } )
>>> | ( OR ) => ( OR { $setType( OR ); } )
>>>
>>> | ( LEFT_PAREN ) => ( LEFT_PAREN { $setType( LEFT_PAREN ); } )
>>> | ( RIGHT_PAREN ) => ( RIGHT_PAREN { $setType( RIGHT_PAREN ); } )
>>>
>>> | ( EQUALS ) => ( EQUALS { $setType( EQUALS ); } )
>>>
>>> | ( IDENTIFIER ) => ( IDENTIFIER { $setType( IDENTIFIER ); } )
>>>
>>> | ( WS ) => WS
>>>
>>> ;
>>>
>>>protected
>>>WS
>>>   :
>>>       ('\n' | ' ' | '\t' | '\r')+
>>>       {
>>>           $setType(Token.SKIP);
>>>       }
>>>   ;
>>>
>>>
>>>protected
>>>SINGLE_QUOTE_STRING
>>>   :
>>>       '\''! (~('\''))* '\''!
>>>   ;
>>>
>>>protected
>>>DOUBLE_QUOTE_STRING
>>>   :
>>>       '"'! (~('"'))* '"'!
>>>   ;
>>>
>>>protected
>>>LITERAL
>>>   :
>>>       SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
>>>   ;
>>>
>>>protected
>>>IDENTIFIER
>>>
>>>   options
>>>   {
>>>       testLiterals=true;
>>>   }
>>>
>>>   :
>>>       ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
>>>('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
>>>   ;
>>>
>>>protected
>>>LEFT_PAREN
>>>   :    '('        ;
>>>
>>>protected
>>>RIGHT_PAREN
>>>   :    ')'        ;
>>>
>>>protected
>>>NOT
>>>   :    ("NOT"|"not")    ;
>>>
>>>protected
>>>AND
>>>   :    ("AND"|"and")    ;
>>>
>>>protected
>>>OR
>>>   :    ("OR"|"or")        ;
>>>
>>>protected
>>>EQUALS
>>>   :    '='        ;
>>>
>>>protected
>>>NOT_EQUALS
>>>   :    "<>"    ;
>>>
>>>protected
>>>LT
>>>   :    '<'        ;
>>>
>>>protected
>>>LTE
>>>   :    "<="    ;
>>>
>>>protected
>>>GT
>>>   :    '>'        ;
>>>
>>>protected
>>>GTE
>>>   :    ">="    ;
>>>
>>>
>>>The syntactic predicates are in MAIN_LEXER_RULE. The order of
>>>productions (alternative rules) in MAIN_LEXER_RULE is important,
>>>because the lexer will try to match them in the order they are
>>>declared and will stop as soon as it finds a match. So for example LTE
>>>must be above LT because other ways the lexer will match the LT and
>>>then an EQUALS in stead of LTE.
>>>
>>>Let me know if this has solved your problem.
>>>
>>>
>>>Best regards,
>>>Gabriel
>>>
>>>
>>>
>>>
>>>On 05/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
>>>
>>>
>>>      
>>>
>>>>I'm still working on building a Parser for our query syntax. I've
>>>>encountered an issue with nondeterminism. I've included my grammar file:
>>>>
>>>>My question is how can I assure that the boolean predicate AND not the
>>>>quoted string literal "you AND I" do not collide? I'd be very thankful
>>>>to anyone with comments about obvious problems with my grammar file.
>>>>
>>>>thanks,
>>>>Mark
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>>>class SearchQueryParser extends Parser;
>>>>>   options
>>>>>   {
>>>>>         k=3;
>>>>>       exportVocab=SearchQuery;
>>>>>       buildAST = true;   // uses CommonAST by default
>>>>>
>>>>>   }
>>>>>
>>>>>
>>>>>expr
>>>>>   :
>>>>>       mexpr ((AND|OR|NOT) mexpr)*
>>>>>   ;
>>>>>
>>>>>mexpr
>>>>>   :
>>>>>       LITERAL^ | IDENTIFIER^ ((EQUALS|NOT_EQUALS|LT|LTE|GT|GTE)
>>>>>LITERAL^)+
>>>>>   ;
>>>>>
>>>>>
>>>>>atom
>>>>>     :
>>>>>         IDENTIFIER | LEFT_PAREN! expr RIGHT_PAREN!
>>>>>   ;
>>>>>
>>>>>class SearchQueryLexer extends Lexer;
>>>>>   options
>>>>>   {
>>>>>       charVocabulary='\3'..'\377';
>>>>>   }
>>>>>
>>>>>WS
>>>>>   :
>>>>>       ('\n' | ' ' | '\t' | '\r')+
>>>>>       {
>>>>>           $setType(Token.SKIP);
>>>>>       }
>>>>>   ;
>>>>>
>>>>>
>>>>>protected
>>>>>SINGLE_QUOTE_STRING
>>>>>   :
>>>>>       '\''! (~('\''))* '\''!
>>>>>   ;
>>>>>
>>>>>protected
>>>>>DOUBLE_QUOTE_STRING
>>>>>   :
>>>>>       '"'! (~('"'))* '"'!
>>>>>   ;
>>>>>
>>>>>LITERAL
>>>>>   :
>>>>>       SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
>>>>>   ;
>>>>>
>>>>>IDENTIFIER
>>>>>
>>>>>   options
>>>>>   {
>>>>>       testLiterals=true;
>>>>>   }
>>>>>
>>>>>   :
>>>>>       ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
>>>>>('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
>>>>>   ;
>>>>>
>>>>>LEFT_PAREN
>>>>>   :    '('        ;
>>>>>
>>>>>RIGHT_PAREN
>>>>>   :    ')'        ;
>>>>>
>>>>>NOT
>>>>>   :    ("NOT"|"not")    ;
>>>>>
>>>>>AND
>>>>>   :    ("AND"|"and")    ;
>>>>>
>>>>>OR
>>>>>   :    ("OR"|"or")        ;
>>>>>
>>>>>EQUALS
>>>>>   :    '='        ;
>>>>>
>>>>>NOT_EQUALS
>>>>>   :    "<>"    ;
>>>>>
>>>>>LT
>>>>>   :    '<'        ;
>>>>>
>>>>>LTE
>>>>>   :    "<="    ;
>>>>>
>>>>>GT
>>>>>   :    '>'        ;
>>>>>
>>>>>GTE
>>>>>   :    ">="    ;
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>
>>>>        
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20060207/fa427fd3/attachment-0001.html


More information about the antlr-interest mailing list