[antlr-interest] Handling lexical nondeterminism in Tokens

Mon Feb 6 03:40:47 PST 2006

Dear Mark,

I suggest using syntactic predicates. Also increasing the lexers look
ahead to 2 (k=2) for example may sort out the ambiguity between LT and
LTE, and GT and GTE. However, if you use syntactic predicates for all
tokens, increasing the look ahead may not be necessary.

An example of using syntactic predicates for your grammar is following:

class SearchQueryLexer extends Lexer;
    options
    {
        charVocabulary='\3'..'\377';
    }

MAIN_LEXER_RULE
  : ( LITERAL ) => ( LITERAL { $setType( LITERAL ); } )

  | ( NOT_EQUALS ) => ( NOT_EQUALS { $setType( NOT_EQUALS ); } )
  | ( LTE ) => ( LTE { $setType( LTE ); } )
  | ( GTE ) => ( GTE { $setType( GTE ); } )

  | ( LT ) => ( LT { $setType( LT ); } )
  | ( GT ) => ( GT { $setType( GT ); } )

  | ( NOT ) => ( NOT { $setType( NOT ); } )
  | ( AND ) => ( AND { $setType( AND ); } )
  | ( OR ) => ( OR { $setType( OR ); } )

  | ( LEFT_PAREN ) => ( LEFT_PAREN { $setType( LEFT_PAREN ); } )
  | ( RIGHT_PAREN ) => ( RIGHT_PAREN { $setType( RIGHT_PAREN ); } )

  | ( EQUALS ) => ( EQUALS { $setType( EQUALS ); } )

  | ( IDENTIFIER ) => ( IDENTIFIER { $setType( IDENTIFIER ); } )

  | ( WS ) => WS

  ;

protected
WS
    :
        ('\n' | ' ' | '\t' | '\r')+
        {
            $setType(Token.SKIP);
        }
    ;

protected
SINGLE_QUOTE_STRING
    :
        '\''! (~('\''))* '\''!
    ;

protected
DOUBLE_QUOTE_STRING
    :
        '"'! (~('"'))* '"'!
    ;

protected
LITERAL
    :
        SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
    ;

protected
IDENTIFIER

    options
    {
        testLiterals=true;
    }

    :
        ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
    ;

protected
LEFT_PAREN
    :    '('        ;

protected
RIGHT_PAREN
    :    ')'        ;

protected
NOT
    :    ("NOT"|"not")    ;

protected
AND
    :    ("AND"|"and")    ;

protected
OR
    :    ("OR"|"or")        ;

protected
EQUALS
    :    '='        ;

protected
NOT_EQUALS
    :    "<>"    ;

protected
LT
    :    '<'        ;

protected
LTE
    :    "<="    ;

protected
GT
    :    '>'        ;

protected
GTE
    :    ">="    ;

The syntactic predicates are in MAIN_LEXER_RULE. The order of
productions (alternative rules) in MAIN_LEXER_RULE is important,
because the lexer will try to match them in the order they are
declared and will stop as soon as it finds a match. So for example LTE
must be above LT because other ways the lexer will match the LT and
then an EQUALS in stead of LTE.

Let me know if this has solved your problem.

Best regards,
Gabriel

On 05/02/06, Mark R. Diggory <mdiggory at latte.harvard.edu> wrote:
> I'm still working on building a Parser for our query syntax. I've
> encountered an issue with nondeterminism. I've included my grammar file:
>
> My question is how can I assure that the boolean predicate AND not the
> quoted string literal "you AND I" do not collide? I'd be very thankful
> to anyone with comments about obvious problems with my grammar file.
>
> thanks,
> Mark
>
> > class SearchQueryParser extends Parser;
> >     options
> >     {
> >           k=3;
> >         exportVocab=SearchQuery;
> >         buildAST = true;   // uses CommonAST by default
> >
> >     }
> >
> >
> > expr
> >     :
> >         mexpr ((AND|OR|NOT) mexpr)*
> >     ;
> >
> > mexpr
> >     :
> >         LITERAL^ | IDENTIFIER^ ((EQUALS|NOT_EQUALS|LT|LTE|GT|GTE)
> > LITERAL^)+
> >     ;
> >
> >
> > atom
> >       :
> >           IDENTIFIER | LEFT_PAREN! expr RIGHT_PAREN!
> >     ;
> >
> > class SearchQueryLexer extends Lexer;
> >     options
> >     {
> >         charVocabulary='\3'..'\377';
> >     }
> >
> > WS
> >     :
> >         ('\n' | ' ' | '\t' | '\r')+
> >         {
> >             $setType(Token.SKIP);
> >         }
> >     ;
> >
> >
> > protected
> > SINGLE_QUOTE_STRING
> >     :
> >         '\''! (~('\''))* '\''!
> >     ;
> >
> > protected
> > DOUBLE_QUOTE_STRING
> >     :
> >         '"'! (~('"'))* '"'!
> >     ;
> >
> > LITERAL
> >     :
> >         SINGLE_QUOTE_STRING | DOUBLE_QUOTE_STRING
> >     ;
> >
> > IDENTIFIER
> >
> >     options
> >     {
> >         testLiterals=true;
> >     }
> >
> >     :
> >         ('\241'..'\377'|'a'..'z'|'A'..'Z'|'_')
> > ('\241'..'\377'|'a'..'z'|'A'..'Z'|'-'|'_'|'0'..'9'|'.')*
> >     ;
> >
> > LEFT_PAREN
> >     :    '('        ;
> >
> > RIGHT_PAREN
> >     :    ')'        ;
> >
> > NOT
> >     :    ("NOT"|"not")    ;
> >
> > AND
> >     :    ("AND"|"and")    ;
> >
> > OR
> >     :    ("OR"|"or")        ;
> >
> > EQUALS
> >     :    '='        ;
> >
> > NOT_EQUALS
> >     :    "<>"    ;
> >
> > LT
> >     :    '<'        ;
> >
> > LTE
> >     :    "<="    ;
> >
> > GT
> >     :    '>'        ;
> >
> > GTE
> >     :    ">="    ;
>
>
>