[antlr-interest] DMQL Grammar - ANTLR Eats Characters

Fri Mar 20 06:23:43 PDT 2009

Unfortunately, ANTLR implementation of Lexer doesn't backtrack (maybe 
for efficiency purposes). Backtracking and taking an alternative can be 
done only in the parser. This was already discussed in the list. See - 
http://www.antlr.org/pipermail/antlr-interest/2009-February/032981.html.

- Indhu

Mihai Danila wrote:
>
> Thanks Indhu,
>
> In the link you sent, you troubleshoot a slightly different, but the 
> post did help.
>
> In my scenario, the lexer chooses a rule based on a prefix and fails 
> to fall back to try a collection of shorter tokens. The lexer doesn't 
> go as far as TOR before deciding simply because by the time a TO is 
> read there is no alternative to TO in lexer scope (except there would 
> be if it wasn't greedy as per my note below). Your indication about 
> the longest possible token policy has cleared it up for me. The only 
> alternative to TODAY by the time TO has been read is to create an 
> alphanumeric out of alphanumericTokens, and of course that is a parser 
> rule and is therefore is outside of the lexer's horizon. This must be 
> the problem.
>
> A question still remains. If the lexer cannot create a valid token 
> without dropping characters, shouldn't it fall back and try to produce 
> smaller tokens (which my grammar allows for, the smaller tokens being 
> D and A) to give a chance to the parser? Apparently, the lexer is 
> prematurely moving into an error state without noticing that a 
> different token arrangement would keep it in the green.
>
>
> Mihai
>
> On Tue, Mar 10, 2009 at 3:48 AM, Indhu Bharathi 
> <indhu.b at s7software.com <mailto:indhu.b at s7software.com>> wrote:
>
>     Try this:
>
>     Today: ( (Today_) => 'Today' ) ;
>     fragment Today_
>         :    'Today'
>         ;
>
>     However, I'm not sure if this's the most elegant way to fix it.
>
>     Read the following thread to understand more on why exactly this
>     happens:
>     http://www.antlr.org/pipermail/antlr-interest/2009-February/032959.html
>
>     - Indhu
>
>
>     ----- Original Message -----
>     From: Mihai Danila <viridium at gmail.com <mailto:viridium at gmail.com>>
>     To: antlr-interest at antlr.org <mailto:antlr-interest at antlr.org>
>     Sent: Tuesday, March 10, 2009 6:30:43 AM GMT+0530 Asia/Calcutta
>     Subject: [antlr-interest] DMQL Grammar - ANTLR Eats Characters
>
>
>     Hi,
>
>     I thought I had my DMQL grammar nailed after several months of no
>     issues, until recently a query failed. I've already massaged the
>     grammar in a few ways so I'm a bit at a loss as to what the
>     problem is this time. Do I have to enumerate all the possible
>     token prefixes (including TO, TOD, TODA, N, NO, A, AN, O) in the
>     alphanumericToken rule to fix this one? Am I missing something?
>
>     Here's the query:
>     (f=I?TORO)
>
>     If I debug this, here's what ANTLR parses:
>     (f=I?O)
>
>     Here's the grammar:
>     grammar Dmql;
>
>     options {
>     output=AST;
>     }
>
>     tokens {
>     Or; And; Not;
>     FieldCriteria;
>     LookupAnd; LookupNot; LookupOr; LookupAny;
>     StringList; StringEquals; StringStartsWith;
>     StringContains; StringChar; EmptyString;
>     RangeList; RangeBetween; RangeGreater; RangeLower;
>     ConstantValue;
>     }
>
>     @header { package com.stratusdata.dmql.parser.antlr; }
>     @lexer::header { package com.stratusdata.dmql.parser.antlr; }
>
>     @rulecatch {
>       catch (RecognitionException re) {
>         throw re;
>       }
>     }
>
>     dmql: searchCondition;
>     searchCondition: queryClause (('|' | BoolOr) queryClause)* -> ^(Or
>     queryClause+);
>     queryClause: booleanElement ((',' | BoolAnd) booleanElement)* ->
>     ^(And booleanElement+);
>     booleanElement: queryElement | ('~' | BoolNot) queryElement ->
>     ^(Not queryElement);
>     queryElement: '('! (fieldCriteria | searchCondition) ')'!;
>
>     fieldCriteria: field '=' fieldValue -> ^(FieldCriteria field
>     fieldValue);
>     field: ('_' | alphanumericToken)+ -> ConstantValue[$field.text];
>     fieldValue: lookupList | stringList | rangeList | nonInteger |
>     period | stringLiteral | empty;
>     stringLiteral: StringLiteral;
>     empty: '.EMPTY.' -> EmptyString;
>
>     lookupList: lookupOr | lookupAnd | lookupNot | lookupAny;
>     lookupOr: '|' lookup (',' lookup)* -> ^(LookupOr lookup+);
>     lookupAnd: '+' lookup (',' lookup)* -> ^(LookupAnd lookup+);
>     lookupNot: '~' lookup (',' lookup)* -> ^(LookupNot lookup+);
>     lookupAny: '.ANY.' -> LookupAny;
>     lookup: alphanumeric | stringLiteral;
>
>     stringList: string (',' string)* -> ^(StringList string+);
>     string: stringEq | stringStart | stringContains | stringChar;
>     stringEq: alphanumeric -> ^(StringEquals alphanumeric);
>     stringStart: alphanumeric '*'  -> ^(StringStartsWith alphanumeric);
>     stringContains: '*' alphanumeric '*' -> ^(StringContains
>     alphanumeric);
>     stringChar: alphanumeric? ('?' alphanumeric?)+ -> ^(StringChar
>     ConstantValue[$stringChar.text]);
>
>     rangeList: dateTimeRangeList | dateRangeList | timeRangeList |
>     numericRangeList;
>     dateTimeRangeList: dateTimeRange (',' dateTimeRange)* ->
>     ^(RangeList dateTimeRange+);
>     dateRangeList: dateRange (',' dateRange)* -> ^(RangeList dateRange+);
>     timeRangeList: timeRange (',' timeRange)* -> ^(RangeList timeRange+);
>     numericRangeList: numericRange (',' numericRange)* -> ^(RangeList
>     numericRange+);
>     dateTimeRange: x=dateTime '-' y=dateTime -> ^(RangeBetween $x $y)
>     | x=dateTime '-' -> ^(RangeLower $x)
>     | x=dateTime '+' -> ^(RangeGreater $x);
>     dateRange: x=date '-' y=date -> ^(RangeBetween $x $y)
>     | x=date '-' -> ^(RangeLower $x)
>     | x=date '+' -> ^(RangeGreater $x);
>     timeRange: x=time '-' y=time -> ^(RangeBetween $x $y)
>     | x=time '-' -> ^(RangeLower $x)
>     | x=time '+' -> ^(RangeGreater $x);
>     numericRange: x=number '-' y=number -> ^(RangeBetween $x $y)
>     | x=number '-' -> ^(RangeLower $x)
>     | x=number '+' -> ^(RangeGreater $x);
>     period: (isoDateTime | isoDate | isoTime) ->
>     ConstantValue[$period.text];
>     dateTime: (isoDateTime | Now) -> ConstantValue[$dateTime.text];
>     date: (isoDate | Today) -> ConstantValue[$date.text];
>     time: isoTime -> ConstantValue[$time.text];
>     number: integer | nonInteger;
>     integer: D+ -> ConstantValue[$integer.text];
>     nonInteger: (negativeNumber | positiveDecimal) ->
>     ConstantValue[$nonInteger.text];
>     negativeNumber: '-' D+ ('.' D+)?;
>     positiveDecimal: D+ '.' D+;
>
>     timeZoneOffset: ('+' | '-') D D ':' D D;
>     isoDate: D D D D '-' D D '-' D D;
>     isoTime: D D ':' D D ':' D D ('.' D (D D?)?)?;
>     isoDateTime: isoDate 'T' isoTime ('Z' | timeZoneOffset)?;
>
>     alphanumeric: alphanumericToken+ -> ConstantValue[$alphanumeric.text];
>     alphanumericToken: (D | A | BoolNot | BoolAnd | BoolOr | Now |
>     Today | 'T' | 'Z');
>
>     BoolNot: 'NOT';
>     BoolAnd: 'AND';
>     BoolOr: 'OR';
>     Now: 'NOW';
>     Today: 'TODAY';
>     StringLiteral: ('"' (~('\u0000'..'\u001F' | '\u007F' | '"') |
>     ('""'))* '"');
>     A: (('A'..'Z') | ('a'..'z'));
>     D: ('0'..'9');
>     Whitespace: (' ' | '\t' | '\n') { $channel = HIDDEN; };
>
>