[antlr-interest] Handling lexical nondeterminism in Tokens
Mark R. Diggory
mdiggory at latte.harvard.edu
Tue Feb 7 08:45:14 PST 2006
Thanks, that is very close to what I need.
Seems not to work on the following cases, which should all be valid.
"((foo='bar')AND bim='bam');"
"(foo='bar' AND bim='bam');"
"foo='bar' AND bim='bam';"
I've tried different combinations of defining the parethises and the
white space. Along the lines of the following
: LEFT_PAREN! expr RIGHT_PAREN!
| expr
But can't seem to get it right.
-Mark
Gabriel Radu wrote:
>Dear Mark,
>
>What about:
>
>
>
>>class SearchQueryParser extends Parser;
>> options
>> {
>> k=3;
>> exportVocab=SearchQuery;
>> buildAST = true; // uses CommonAST by default
>>
>> }
>>
>>statement
>> : LEFT_PAREN! expr RIGHT_PAREN! EOST
>> | expr EOST
>> ;
>>
>>expr
>> : mexpr ( bool_relations LEFT_PAREN! expr RIGHT_PAREN! )*
>> ;
>>
>>mexpr
>> : atom ( bool_relations atom )*
>> ;
>>
>>atom
>> : IDENTIFIER equivalence_relation LITERAL
>> ;
>>
>>bool_relations : AND|OR|NOT ;
>>
>>equivalence_relation : EQUALS|NOT_EQUALS|LT|LTE|GT|GTE ;
>>
>>
>
>where EOST is a token which marks the end of a statement. It can be
>something like end of line or semicolon. Don't forget to add it to the
>lexer as well.
>
>Let me know how you are getting on.
>
>
>Best regards,
>Gabriel
>
>
>
>
>On 06/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
>
>
>>Thanks, I think I understand how this will help in the Lexer. I'm
>>currently having problems how to capture how to properly represent the
>>syntax in the Parser
>>
>>Here's a clarification of what I should be able to do with the query
>>language:
>>
>>The smallest "atom" is a LITERAL string, currently this can be in quotes
>>or not in quotes. For example:
>>
>>
>>
>>>United States
>>>"Untied States"
>>>
>>>
>>Alternatively a "atom" can be an equivalence relation. For instance:
>>
>>
>>
>>>title="Untied States"
>>>title<>"Untied States"
>>>date>=2006
>>>
>>>
>>Each atom can have parentheses around it.
>>
>>
>>
>>>(title="Untied States")
>>>
>>>
>>equivalence relations can be joined using boolean relations
>>
>>
>>
>>>title="Untied States" AND date>=2006
>>>title="Untied States" OR date>=2006
>>>title="Untied States" NOT date>=2006
>>>
>>>
>>boolean relations can be wrapped in parentheses to control precedence.
>>
>>
>>
>>>title="Untied States" AND (date>=2006 OR author=Steven King)
>>>
>>>
>>I've been trying to capture this using the following parser, but its
>>clear now that I'm missing the mark:
>>
>>
>>
>>>class SearchQueryParser extends Parser;
>>> options
>>> {
>>> k=3;
>>> exportVocab=SearchQuery;
>>> buildAST = true; // uses CommonAST by default
>>>
>>> }
>>>
>>>expr
>>> :
>>> mexpr ((AND|OR|NOT) mexpr)*
>>> ;
>>>
>>>mexpr
>>> :
>>> LITERAL | (IDENTIFIER (EQUALS|NOT_EQUALS|LT|LTE|GT|GTE) LITERAL)+
>>> ;
>>>
>>>atom
>>> :
>>> LEFT_PAREN! mexpr RIGHT_PAREN! | LEFT_PAREN! expr RIGHT_PAREN!
>>> ;
>>>
>>>
>>>
>>thanks again for your advice,
>>Mark
>>
>>Gabriel Radu wrote:
>>
>>
>>
>>>Dear Mark,
>>>
>>>I suggest using syntactic predicates. Also increasing the lexers look
>>>ahead to 2 (k=2) for example may sort out the ambiguity between LT and
>>>LTE, and GT and GTE. However, if you use syntactic predicates for all
>>>tokens, increasing the look ahead may not be necessary.
>>>
>>>An example of using syntactic predicates for your grammar is following:
>>>
>>>
>>>class SearchQueryLexer extends Lexer;
>>> options
>>> {
>>> charVocabulary='\3'..'\377';
>>> }
>>>
>>>
>>>MAIN_LEXER_RULE
>>> : ( LITERAL ) => ( LITERAL { $setType( LITERAL ); } )
>>>
>>> | ( NOT_EQUALS ) => ( NOT_EQUALS { $setType( NOT_EQUALS ); } )
>>> | ( LTE ) => ( LTE { $setType( LTE ); } )
>>> | ( GTE ) => ( GTE { $setType( GTE ); } )
>>>
>>> | ( LT ) => ( LT { $setType( LT ); } )
>>> | ( GT ) => ( GT { $setType( GT ); } )
>>>
>>> | ( NOT ) => ( NOT { $setType( NOT ); } )
>>> | ( AND ) => ( AND { $setType( AND ); } )
>>> | ( OR ) => ( OR { $setType( OR ); } )
>>>
>>> | ( LEFT_PAREN ) => ( LEFT_PAREN { $setType( LEFT_PAREN ); } )
>>> | ( RIGHT_PAREN ) => ( RIGHT_PAREN { $setType( RIGHT_PAREN ); } )
>>>
>>> | ( EQUALS ) => ( EQUALS { $setType( EQUALS ); } )
>>>
>>> | ( IDENTIFIER ) => ( IDENTIFIER { $setType( IDENTIFIER ); } )
>>>
>>> | ( WS ) => WS
>>>
>>> ;
>>>
>>>protected
>>>WS
>>> :
>>> ('\n' | ' ' | '\t' | '\r')+
>>> {
>>> $setType(Token.SKIP);
>>> }
>>> ;
>>>
>>>
>>>protected
>>>SINGLE_QUOTE_STRING
>>> :
>>> '\''! (~('\''))* '\''!
>>> ;
>>>
>>>protected
>>>DOUBLE_QUOTE_STRING
>>> :
>>> '"'! (~('"'))* '"'!
>>> ;
>>>
>>>protected
>>>LITERAL
>>> :
>>> SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
>>> ;
>>>
>>>protected
>>>IDENTIFIER
>>>
>>> options
>>> {
>>> testLiterals=true;
>>> }
>>>
>>> :
>>> ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
>>>('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
>>> ;
>>>
>>>protected
>>>LEFT_PAREN
>>> : '(' ;
>>>
>>>protected
>>>RIGHT_PAREN
>>> : ')' ;
>>>
>>>protected
>>>NOT
>>> : ("NOT"|"not") ;
>>>
>>>protected
>>>AND
>>> : ("AND"|"and") ;
>>>
>>>protected
>>>OR
>>> : ("OR"|"or") ;
>>>
>>>protected
>>>EQUALS
>>> : '=' ;
>>>
>>>protected
>>>NOT_EQUALS
>>> : "<>" ;
>>>
>>>protected
>>>LT
>>> : '<' ;
>>>
>>>protected
>>>LTE
>>> : "<=" ;
>>>
>>>protected
>>>GT
>>> : '>' ;
>>>
>>>protected
>>>GTE
>>> : ">=" ;
>>>
>>>
>>>The syntactic predicates are in MAIN_LEXER_RULE. The order of
>>>productions (alternative rules) in MAIN_LEXER_RULE is important,
>>>because the lexer will try to match them in the order they are
>>>declared and will stop as soon as it finds a match. So for example LTE
>>>must be above LT because other ways the lexer will match the LT and
>>>then an EQUALS in stead of LTE.
>>>
>>>Let me know if this has solved your problem.
>>>
>>>
>>>Best regards,
>>>Gabriel
>>>
>>>
>>>
>>>
>>>On 05/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
>>>
>>>
>>>
>>>
>>>>I'm still working on building a Parser for our query syntax. I've
>>>>encountered an issue with nondeterminism. I've included my grammar file:
>>>>
>>>>My question is how can I assure that the boolean predicate AND not the
>>>>quoted string literal "you AND I" do not collide? I'd be very thankful
>>>>to anyone with comments about obvious problems with my grammar file.
>>>>
>>>>thanks,
>>>>Mark
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>class SearchQueryParser extends Parser;
>>>>> options
>>>>> {
>>>>> k=3;
>>>>> exportVocab=SearchQuery;
>>>>> buildAST = true; // uses CommonAST by default
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>>expr
>>>>> :
>>>>> mexpr ((AND|OR|NOT) mexpr)*
>>>>> ;
>>>>>
>>>>>mexpr
>>>>> :
>>>>> LITERAL^ | IDENTIFIER^ ((EQUALS|NOT_EQUALS|LT|LTE|GT|GTE)
>>>>>LITERAL^)+
>>>>> ;
>>>>>
>>>>>
>>>>>atom
>>>>> :
>>>>> IDENTIFIER | LEFT_PAREN! expr RIGHT_PAREN!
>>>>> ;
>>>>>
>>>>>class SearchQueryLexer extends Lexer;
>>>>> options
>>>>> {
>>>>> charVocabulary='\3'..'\377';
>>>>> }
>>>>>
>>>>>WS
>>>>> :
>>>>> ('\n' | ' ' | '\t' | '\r')+
>>>>> {
>>>>> $setType(Token.SKIP);
>>>>> }
>>>>> ;
>>>>>
>>>>>
>>>>>protected
>>>>>SINGLE_QUOTE_STRING
>>>>> :
>>>>> '\''! (~('\''))* '\''!
>>>>> ;
>>>>>
>>>>>protected
>>>>>DOUBLE_QUOTE_STRING
>>>>> :
>>>>> '"'! (~('"'))* '"'!
>>>>> ;
>>>>>
>>>>>LITERAL
>>>>> :
>>>>> SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
>>>>> ;
>>>>>
>>>>>IDENTIFIER
>>>>>
>>>>> options
>>>>> {
>>>>> testLiterals=true;
>>>>> }
>>>>>
>>>>> :
>>>>> ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
>>>>>('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
>>>>> ;
>>>>>
>>>>>LEFT_PAREN
>>>>> : '(' ;
>>>>>
>>>>>RIGHT_PAREN
>>>>> : ')' ;
>>>>>
>>>>>NOT
>>>>> : ("NOT"|"not") ;
>>>>>
>>>>>AND
>>>>> : ("AND"|"and") ;
>>>>>
>>>>>OR
>>>>> : ("OR"|"or") ;
>>>>>
>>>>>EQUALS
>>>>> : '=' ;
>>>>>
>>>>>NOT_EQUALS
>>>>> : "<>" ;
>>>>>
>>>>>LT
>>>>> : '<' ;
>>>>>
>>>>>LTE
>>>>> : "<=" ;
>>>>>
>>>>>GT
>>>>> : '>' ;
>>>>>
>>>>>GTE
>>>>> : ">=" ;
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20060207/fa427fd3/attachment-0001.html
More information about the antlr-interest
mailing list