[antlr-interest] Handling lexical nondeterminism in Tokens

Mon Feb 6 07:25:43 PST 2006

Thanks, I think I understand how this will help in the Lexer. I'm 
currently having problems how to capture how to properly represent the 
syntax in the Parser

Here's a clarification of what I should be able to do with the query 
language:

The smallest "atom" is a LITERAL string, currently this can be in quotes 
or not in quotes. For example:

> United States
> "Untied States"

Alternatively a "atom" can be an equivalence relation. For instance:

> title="Untied States"
> title<>"Untied States"
> date>=2006

Each atom can have parentheses around it.

> (title="Untied States")

equivalence relations can be joined using boolean relations

> title="Untied States" AND date>=2006
> title="Untied States" OR date>=2006
> title="Untied States" NOT date>=2006

boolean relations can be wrapped in parentheses to control precedence.

> title="Untied States" AND (date>=2006 OR author=Steven King)

I've been trying to capture this using the following parser, but its 
clear now that I'm missing the mark:

>class SearchQueryParser extends Parser;
>    options
>    {
>          k=3;
>        exportVocab=SearchQuery;
>        buildAST = true;   // uses CommonAST by default
>
>    }
>
>expr
>    :
>        mexpr ((AND|OR|NOT) mexpr)*
>    ;
>
>mexpr
>    :
>        LITERAL | (IDENTIFIER (EQUALS|NOT_EQUALS|LT|LTE|GT|GTE) LITERAL)+ 
>    ;
>
>atom
>      :
>          LEFT_PAREN! mexpr RIGHT_PAREN! | LEFT_PAREN! expr RIGHT_PAREN!
>    ;
>
thanks again for your advice,
Mark

Gabriel Radu wrote:

>Dear Mark,
>
>I suggest using syntactic predicates. Also increasing the lexers look
>ahead to 2 (k=2) for example may sort out the ambiguity between LT and
>LTE, and GT and GTE. However, if you use syntactic predicates for all
>tokens, increasing the look ahead may not be necessary.
>
>An example of using syntactic predicates for your grammar is following:
>
>
>class SearchQueryLexer extends Lexer;
>    options
>    {
>        charVocabulary='\3'..'\377';
>    }
>
>
>MAIN_LEXER_RULE
>  : ( LITERAL ) => ( LITERAL { $setType( LITERAL ); } )
>
>  | ( NOT_EQUALS ) => ( NOT_EQUALS { $setType( NOT_EQUALS ); } )
>  | ( LTE ) => ( LTE { $setType( LTE ); } )
>  | ( GTE ) => ( GTE { $setType( GTE ); } )
>
>  | ( LT ) => ( LT { $setType( LT ); } )
>  | ( GT ) => ( GT { $setType( GT ); } )
>
>  | ( NOT ) => ( NOT { $setType( NOT ); } )
>  | ( AND ) => ( AND { $setType( AND ); } )
>  | ( OR ) => ( OR { $setType( OR ); } )
>
>  | ( LEFT_PAREN ) => ( LEFT_PAREN { $setType( LEFT_PAREN ); } )
>  | ( RIGHT_PAREN ) => ( RIGHT_PAREN { $setType( RIGHT_PAREN ); } )
>
>  | ( EQUALS ) => ( EQUALS { $setType( EQUALS ); } )
>
>  | ( IDENTIFIER ) => ( IDENTIFIER { $setType( IDENTIFIER ); } )
>
>  | ( WS ) => WS
>
>  ;
>
>protected
>WS
>    :
>        ('\n' | ' ' | '\t' | '\r')+
>        {
>            $setType(Token.SKIP);
>        }
>    ;
>
>
>protected
>SINGLE_QUOTE_STRING
>    :
>        '\''! (~('\''))* '\''!
>    ;
>
>protected
>DOUBLE_QUOTE_STRING
>    :
>        '"'! (~('"'))* '"'!
>    ;
>
>protected
>LITERAL
>    :
>        SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
>    ;
>
>protected
>IDENTIFIER
>
>    options
>    {
>        testLiterals=true;
>    }
>
>    :
>        ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
>('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
>    ;
>
>protected
>LEFT_PAREN
>    :    '('        ;
>
>protected
>RIGHT_PAREN
>    :    ')'        ;
>
>protected
>NOT
>    :    ("NOT"|"not")    ;
>
>protected
>AND
>    :    ("AND"|"and")    ;
>
>protected
>OR
>    :    ("OR"|"or")        ;
>
>protected
>EQUALS
>    :    '='        ;
>
>protected
>NOT_EQUALS
>    :    "<>"    ;
>
>protected
>LT
>    :    '<'        ;
>
>protected
>LTE
>    :    "<="    ;
>
>protected
>GT
>    :    '>'        ;
>
>protected
>GTE
>    :    ">="    ;
>
>
>The syntactic predicates are in MAIN_LEXER_RULE. The order of
>productions (alternative rules) in MAIN_LEXER_RULE is important,
>because the lexer will try to match them in the order they are
>declared and will stop as soon as it finds a match. So for example LTE
>must be above LT because other ways the lexer will match the LT and
>then an EQUALS in stead of LTE.
>
>Let me know if this has solved your problem.
>
>
>Best regards,
>Gabriel
>
>
>
>
>On 05/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
>  
>
>>I'm still working on building a Parser for our query syntax. I've
>>encountered an issue with nondeterminism. I've included my grammar file:
>>
>>My question is how can I assure that the boolean predicate AND not the
>>quoted string literal "you AND I" do not collide? I'd be very thankful
>>to anyone with comments about obvious problems with my grammar file.
>>
>>thanks,
>>Mark
>>
>>    
>>
>>>class SearchQueryParser extends Parser;
>>>    options
>>>    {
>>>          k=3;
>>>        exportVocab=SearchQuery;
>>>        buildAST = true;   // uses CommonAST by default
>>>
>>>    }
>>>
>>>
>>>expr
>>>    :
>>>        mexpr ((AND|OR|NOT) mexpr)*
>>>    ;
>>>
>>>mexpr
>>>    :
>>>        LITERAL^ | IDENTIFIER^ ((EQUALS|NOT_EQUALS|LT|LTE|GT|GTE)
>>>LITERAL^)+
>>>    ;
>>>
>>>
>>>atom
>>>      :
>>>          IDENTIFIER | LEFT_PAREN! expr RIGHT_PAREN!
>>>    ;
>>>
>>>class SearchQueryLexer extends Lexer;
>>>    options
>>>    {
>>>        charVocabulary='\3'..'\377';
>>>    }
>>>
>>>WS
>>>    :
>>>        ('\n' | ' ' | '\t' | '\r')+
>>>        {
>>>            $setType(Token.SKIP);
>>>        }
>>>    ;
>>>
>>>
>>>protected
>>>SINGLE_QUOTE_STRING
>>>    :
>>>        '\''! (~('\''))* '\''!
>>>    ;
>>>
>>>protected
>>>DOUBLE_QUOTE_STRING
>>>    :
>>>        '"'! (~('"'))* '"'!
>>>    ;
>>>
>>>LITERAL
>>>    :
>>>        SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
>>>    ;
>>>
>>>IDENTIFIER
>>>
>>>    options
>>>    {
>>>        testLiterals=true;
>>>    }
>>>
>>>    :
>>>        ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
>>>('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
>>>    ;
>>>
>>>LEFT_PAREN
>>>    :    '('        ;
>>>
>>>RIGHT_PAREN
>>>    :    ')'        ;
>>>
>>>NOT
>>>    :    ("NOT"|"not")    ;
>>>
>>>AND
>>>    :    ("AND"|"and")    ;
>>>
>>>OR
>>>    :    ("OR"|"or")        ;
>>>
>>>EQUALS
>>>    :    '='        ;
>>>
>>>NOT_EQUALS
>>>    :    "<>"    ;
>>>
>>>LT
>>>    :    '<'        ;
>>>
>>>LTE
>>>    :    "<="    ;
>>>
>>>GT
>>>    :    '>'        ;
>>>
>>>GTE
>>>    :    ">="    ;
>>>      
>>>
>>
>>    
>>