[antlr-interest] Handling lexical nondeterminism in Tokens

Gabriel Radu gabriel.adrian.radu at googlemail.com
Tue Feb 7 10:06:09 PST 2006


Do you have any test cases which work? And what is the generated
parser complaining about when you run the test cases mentioned in your
previous email?

Best regards,
Gabriel



On 07/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
>  Thanks, that is very close to what I need.
>
>  Seems not to work on the following cases, which should all be valid.
>
>  "((foo='bar')AND bim='bam');"
>  "(foo='bar' AND bim='bam');"
>  "foo='bar' AND bim='bam';"
>
>  I've tried different combinations of defining the parethises and the white
> space. Along the lines of the following
>
>  : LEFT_PAREN! expr RIGHT_PAREN!
>  | expr
>  But can't seem to get it right.
>
>  -Mark
>
>
>  Gabriel Radu wrote:
>  Dear Mark,
>
> What about:
>
>
>
>  class SearchQueryParser extends Parser;
>  options
>  {
>  k=3;
>  exportVocab=SearchQuery;
>  buildAST = true; // uses CommonAST by default
>
>  }
>
> statement
>  : LEFT_PAREN! expr RIGHT_PAREN! EOST
>  | expr EOST
>  ;
>
> expr
>  : mexpr ( bool_relations LEFT_PAREN! expr RIGHT_PAREN! )*
>  ;
>
> mexpr
>  : atom ( bool_relations atom )*
>  ;
>
> atom
>  : IDENTIFIER equivalence_relation LITERAL
>  ;
>
> bool_relations : AND|OR|NOT ;
>
> equivalence_relation : EQUALS|NOT_EQUALS|LT|LTE|GT|GTE ;
>
>  where EOST is a token which marks the end of a statement. It can be
> something like end of line or semicolon. Don't forget to add it to the
> lexer as well.
>
> Let me know how you are getting on.
>
>
> Best regards,
> Gabriel
>
>
>
>
> On 06/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
>
>
>  Thanks, I think I understand how this will help in the Lexer. I'm
> currently having problems how to capture how to properly represent the
> syntax in the Parser
>
> Here's a clarification of what I should be able to do with the query
> language:
>
> The smallest "atom" is a LITERAL string, currently this can be in quotes
> or not in quotes. For example:
>
>
>
>  United States
> "Untied States"
>
>  Alternatively a "atom" can be an equivalence relation. For instance:
>
>
>
>  title="Untied States"
> title<>"Untied States"
> date>=2006
>
>  Each atom can have parentheses around it.
>
>
>
>  (title="Untied States")
>
>  equivalence relations can be joined using boolean relations
>
>
>
>  title="Untied States" AND date>=2006
> title="Untied States" OR date>=2006
> title="Untied States" NOT date>=2006
>
>  boolean relations can be wrapped in parentheses to control precedence.
>
>
>
>  title="Untied States" AND (date>=2006 OR author=Steven King)
>
>  I've been trying to capture this using the following parser, but its
> clear now that I'm missing the mark:
>
>
>
>  class SearchQueryParser extends Parser;
>  options
>  {
>  k=3;
>  exportVocab=SearchQuery;
>  buildAST = true; // uses CommonAST by default
>
>  }
>
> expr
>  :
>  mexpr ((AND|OR|NOT) mexpr)*
>  ;
>
> mexpr
>  :
>  LITERAL | (IDENTIFIER (EQUALS|NOT_EQUALS|LT|LTE|GT|GTE)
> LITERAL)+
>  ;
>
> atom
>  :
>  LEFT_PAREN! mexpr RIGHT_PAREN! | LEFT_PAREN! expr RIGHT_PAREN!
>  ;
>
>
>  thanks again for your advice,
> Mark
>
> Gabriel Radu wrote:
>
>
>
>  Dear Mark,
>
> I suggest using syntactic predicates. Also increasing the lexers look
> ahead to 2 (k=2) for example may sort out the ambiguity between LT and
> LTE, and GT and GTE. However, if you use syntactic predicates for all
> tokens, increasing the look ahead may not be necessary.
>
> An example of using syntactic predicates for your grammar is following:
>
>
> class SearchQueryLexer extends Lexer;
>  options
>  {
>  charVocabulary='\3'..'\377';
>  }
>
>
> MAIN_LEXER_RULE
>  : ( LITERAL ) => ( LITERAL { $setType( LITERAL ); } )
>
>  | ( NOT_EQUALS ) => ( NOT_EQUALS { $setType( NOT_EQUALS ); } )
>  | ( LTE ) => ( LTE { $setType( LTE ); } )
>  | ( GTE ) => ( GTE { $setType( GTE ); } )
>
>  | ( LT ) => ( LT { $setType( LT ); } )
>  | ( GT ) => ( GT { $setType( GT ); } )
>
>  | ( NOT ) => ( NOT { $setType( NOT ); } )
>  | ( AND ) => ( AND { $setType( AND ); } )
>  | ( OR ) => ( OR { $setType( OR ); } )
>
>  | ( LEFT_PAREN ) => ( LEFT_PAREN { $setType( LEFT_PAREN ); } )
>  | ( RIGHT_PAREN ) => ( RIGHT_PAREN { $setType( RIGHT_PAREN ); } )
>
>  | ( EQUALS ) => ( EQUALS { $setType( EQUALS ); } )
>
>  | ( IDENTIFIER ) => ( IDENTIFIER { $setType( IDENTIFIER ); } )
>
>  | ( WS ) => WS
>
>  ;
>
> protected
> WS
>  :
>  ('\n' | ' ' | '\t' | '\r')+
>  {
>  $setType(Token.SKIP);
>  }
>  ;
>
>
> protected
> SINGLE_QUOTE_STRING
>  :
>  '\''! (~('\''))* '\''!
>  ;
>
> protected
> DOUBLE_QUOTE_STRING
>  :
>  '"'! (~('"'))* '"'!
>  ;
>
> protected
> LITERAL
>  :
>  SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
>  ;
>
> protected
> IDENTIFIER
>
>  options
>  {
>  testLiterals=true;
>  }
>
>  :
>  ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
> ('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
>  ;
>
> protected
> LEFT_PAREN
>  : '(' ;
>
> protected
> RIGHT_PAREN
>  : ')' ;
>
> protected
> NOT
>  : ("NOT"|"not") ;
>
> protected
> AND
>  : ("AND"|"and") ;
>
> protected
> OR
>  : ("OR"|"or") ;
>
> protected
> EQUALS
>  : '=' ;
>
> protected
> NOT_EQUALS
>  : "<>" ;
>
> protected
> LT
>  : '<' ;
>
> protected
> LTE
>  : "<=" ;
>
> protected
> GT
>  : '>' ;
>
> protected
> GTE
>  : ">=" ;
>
>
> The syntactic predicates are in MAIN_LEXER_RULE. The order of
> productions (alternative rules) in MAIN_LEXER_RULE is important,
> because the lexer will try to match them in the order they are
> declared and will stop as soon as it finds a match. So for example LTE
> must be above LT because other ways the lexer will match the LT and
> then an EQUALS in stead of LTE.
>
> Let me know if this has solved your problem.
>
>
> Best regards,
> Gabriel
>
>
>
>
> On 05/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
>
>
>
>
>  I'm still working on building a Parser for our query syntax. I've
> encountered an issue with nondeterminism. I've included my grammar file:
>
> My question is how can I assure that the boolean predicate AND not the
> quoted string literal "you AND I" do not collide? I'd be very thankful
> to anyone with comments about obvious problems with my grammar file.
>
> thanks,
> Mark
>
>
>
>
>
>  class SearchQueryParser extends Parser;
>  options
>  {
>  k=3;
>  exportVocab=SearchQuery;
>  buildAST = true; // uses CommonAST by default
>
>  }
>
>
> expr
>  :
>  mexpr ((AND|OR|NOT) mexpr)*
>  ;
>
> mexpr
>  :
>  LITERAL^ | IDENTIFIER^ ((EQUALS|NOT_EQUALS|LT|LTE|GT|GTE)
> LITERAL^)+
>  ;
>
>
> atom
>  :
>  IDENTIFIER | LEFT_PAREN! expr RIGHT_PAREN!
>  ;
>
> class SearchQueryLexer extends Lexer;
>  options
>  {
>  charVocabulary='\3'..'\377';
>  }
>
> WS
>  :
>  ('\n' | ' ' | '\t' | '\r')+
>  {
>  $setType(Token.SKIP);
>  }
>  ;
>
>
> protected
> SINGLE_QUOTE_STRING
>  :
>  '\''! (~('\''))* '\''!
>  ;
>
> protected
> DOUBLE_QUOTE_STRING
>  :
>  '"'! (~('"'))* '"'!
>  ;
>
> LITERAL
>  :
>  SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
>  ;
>
> IDENTIFIER
>
>  options
>  {
>  testLiterals=true;
>  }
>
>  :
>  ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
> ('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
>  ;
>
> LEFT_PAREN
>  : '(' ;
>
> RIGHT_PAREN
>  : ')' ;
>
> NOT
>  : ("NOT"|"not") ;
>
> AND
>  : ("AND"|"and") ;
>
> OR
>  : ("OR"|"or") ;
>
> EQUALS
>  : '=' ;
>
> NOT_EQUALS
>  : "<>" ;
>
> LT
>  : '<' ;
>
> LTE
>  : "<=" ;
>
> GT
>  : '>' ;
>
> GTE
>  : ">=" ;
>
>
>
>
>
>


More information about the antlr-interest mailing list