[antlr-interest] Handling lexical nondeterminism in Tokens
Mark R. Diggory
mdiggory at latte.harvard.edu
Mon Feb 6 07:25:43 PST 2006
Thanks, I think I understand how this will help in the Lexer. I'm
currently having problems how to capture how to properly represent the
syntax in the Parser
Here's a clarification of what I should be able to do with the query
language:
The smallest "atom" is a LITERAL string, currently this can be in quotes
or not in quotes. For example:
> United States
> "Untied States"
Alternatively a "atom" can be an equivalence relation. For instance:
> title="Untied States"
> title<>"Untied States"
> date>=2006
Each atom can have parentheses around it.
> (title="Untied States")
equivalence relations can be joined using boolean relations
> title="Untied States" AND date>=2006
> title="Untied States" OR date>=2006
> title="Untied States" NOT date>=2006
boolean relations can be wrapped in parentheses to control precedence.
> title="Untied States" AND (date>=2006 OR author=Steven King)
I've been trying to capture this using the following parser, but its
clear now that I'm missing the mark:
>class SearchQueryParser extends Parser;
> options
> {
> k=3;
> exportVocab=SearchQuery;
> buildAST = true; // uses CommonAST by default
>
> }
>
>expr
> :
> mexpr ((AND|OR|NOT) mexpr)*
> ;
>
>mexpr
> :
> LITERAL | (IDENTIFIER (EQUALS|NOT_EQUALS|LT|LTE|GT|GTE) LITERAL)+
> ;
>
>atom
> :
> LEFT_PAREN! mexpr RIGHT_PAREN! | LEFT_PAREN! expr RIGHT_PAREN!
> ;
>
thanks again for your advice,
Mark
Gabriel Radu wrote:
>Dear Mark,
>
>I suggest using syntactic predicates. Also increasing the lexers look
>ahead to 2 (k=2) for example may sort out the ambiguity between LT and
>LTE, and GT and GTE. However, if you use syntactic predicates for all
>tokens, increasing the look ahead may not be necessary.
>
>An example of using syntactic predicates for your grammar is following:
>
>
>class SearchQueryLexer extends Lexer;
> options
> {
> charVocabulary='\3'..'\377';
> }
>
>
>MAIN_LEXER_RULE
> : ( LITERAL ) => ( LITERAL { $setType( LITERAL ); } )
>
> | ( NOT_EQUALS ) => ( NOT_EQUALS { $setType( NOT_EQUALS ); } )
> | ( LTE ) => ( LTE { $setType( LTE ); } )
> | ( GTE ) => ( GTE { $setType( GTE ); } )
>
> | ( LT ) => ( LT { $setType( LT ); } )
> | ( GT ) => ( GT { $setType( GT ); } )
>
> | ( NOT ) => ( NOT { $setType( NOT ); } )
> | ( AND ) => ( AND { $setType( AND ); } )
> | ( OR ) => ( OR { $setType( OR ); } )
>
> | ( LEFT_PAREN ) => ( LEFT_PAREN { $setType( LEFT_PAREN ); } )
> | ( RIGHT_PAREN ) => ( RIGHT_PAREN { $setType( RIGHT_PAREN ); } )
>
> | ( EQUALS ) => ( EQUALS { $setType( EQUALS ); } )
>
> | ( IDENTIFIER ) => ( IDENTIFIER { $setType( IDENTIFIER ); } )
>
> | ( WS ) => WS
>
> ;
>
>protected
>WS
> :
> ('\n' | ' ' | '\t' | '\r')+
> {
> $setType(Token.SKIP);
> }
> ;
>
>
>protected
>SINGLE_QUOTE_STRING
> :
> '\''! (~('\''))* '\''!
> ;
>
>protected
>DOUBLE_QUOTE_STRING
> :
> '"'! (~('"'))* '"'!
> ;
>
>protected
>LITERAL
> :
> SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
> ;
>
>protected
>IDENTIFIER
>
> options
> {
> testLiterals=true;
> }
>
> :
> ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
>('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
> ;
>
>protected
>LEFT_PAREN
> : '(' ;
>
>protected
>RIGHT_PAREN
> : ')' ;
>
>protected
>NOT
> : ("NOT"|"not") ;
>
>protected
>AND
> : ("AND"|"and") ;
>
>protected
>OR
> : ("OR"|"or") ;
>
>protected
>EQUALS
> : '=' ;
>
>protected
>NOT_EQUALS
> : "<>" ;
>
>protected
>LT
> : '<' ;
>
>protected
>LTE
> : "<=" ;
>
>protected
>GT
> : '>' ;
>
>protected
>GTE
> : ">=" ;
>
>
>The syntactic predicates are in MAIN_LEXER_RULE. The order of
>productions (alternative rules) in MAIN_LEXER_RULE is important,
>because the lexer will try to match them in the order they are
>declared and will stop as soon as it finds a match. So for example LTE
>must be above LT because other ways the lexer will match the LT and
>then an EQUALS in stead of LTE.
>
>Let me know if this has solved your problem.
>
>
>Best regards,
>Gabriel
>
>
>
>
>On 05/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
>
>
>>I'm still working on building a Parser for our query syntax. I've
>>encountered an issue with nondeterminism. I've included my grammar file:
>>
>>My question is how can I assure that the boolean predicate AND not the
>>quoted string literal "you AND I" do not collide? I'd be very thankful
>>to anyone with comments about obvious problems with my grammar file.
>>
>>thanks,
>>Mark
>>
>>
>>
>>>class SearchQueryParser extends Parser;
>>> options
>>> {
>>> k=3;
>>> exportVocab=SearchQuery;
>>> buildAST = true; // uses CommonAST by default
>>>
>>> }
>>>
>>>
>>>expr
>>> :
>>> mexpr ((AND|OR|NOT) mexpr)*
>>> ;
>>>
>>>mexpr
>>> :
>>> LITERAL^ | IDENTIFIER^ ((EQUALS|NOT_EQUALS|LT|LTE|GT|GTE)
>>>LITERAL^)+
>>> ;
>>>
>>>
>>>atom
>>> :
>>> IDENTIFIER | LEFT_PAREN! expr RIGHT_PAREN!
>>> ;
>>>
>>>class SearchQueryLexer extends Lexer;
>>> options
>>> {
>>> charVocabulary='\3'..'\377';
>>> }
>>>
>>>WS
>>> :
>>> ('\n' | ' ' | '\t' | '\r')+
>>> {
>>> $setType(Token.SKIP);
>>> }
>>> ;
>>>
>>>
>>>protected
>>>SINGLE_QUOTE_STRING
>>> :
>>> '\''! (~('\''))* '\''!
>>> ;
>>>
>>>protected
>>>DOUBLE_QUOTE_STRING
>>> :
>>> '"'! (~('"'))* '"'!
>>> ;
>>>
>>>LITERAL
>>> :
>>> SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
>>> ;
>>>
>>>IDENTIFIER
>>>
>>> options
>>> {
>>> testLiterals=true;
>>> }
>>>
>>> :
>>> ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
>>>('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
>>> ;
>>>
>>>LEFT_PAREN
>>> : '(' ;
>>>
>>>RIGHT_PAREN
>>> : ')' ;
>>>
>>>NOT
>>> : ("NOT"|"not") ;
>>>
>>>AND
>>> : ("AND"|"and") ;
>>>
>>>OR
>>> : ("OR"|"or") ;
>>>
>>>EQUALS
>>> : '=' ;
>>>
>>>NOT_EQUALS
>>> : "<>" ;
>>>
>>>LT
>>> : '<' ;
>>>
>>>LTE
>>> : "<=" ;
>>>
>>>GT
>>> : '>' ;
>>>
>>>GTE
>>> : ">=" ;
>>>
>>>
>>
>>
>>
More information about the antlr-interest
mailing list