[antlr-interest] DMQL Grammar - ANTLR Eats Characters

Mihai Danila viridium at gmail.com
Fri Mar 20 14:28:06 PDT 2009


I forgot to CC the list on this one.

On Fri, Mar 20, 2009 at 11:22 AM, Mihai Danila <viridium at gmail.com> wrote:

>
> Cool; thanks for the links. Your way of fixing it is more elegant than
> enumerating all the token prefixes in the alphanumericToken rule. Maybe
> I'll demote alphanumeric and alphanumericToken back to lexer rules. This
> will create problems with specifying ISO dates, problems which can be fixed
> with constructs like the one yo gave me two mails ago. Since I can't escape
> these constructs, might as well have the lexer construct bigger tokens (I
> suspect it'll be more efficient than having the parser construct
> alphanumerics).
>
> I'm looking forward for the day when the generator UI will be able to take
> on some of these tasks of warning of problematic productions, namely
> productions that would cause the lexer to err out like token prefixes.
>
> Also, I find it dangerous that the auto-recovery feature is enabled by
> default. With my grammar, this means that my software was able to produce
> valid queries inconsistent with the semantics of the query string being
> parsed. Of course, all is solved in time, as I learn more and more about
> ANTLR, but I'd rather it erred on the safe side. In fact, most use cases out
> there will prefer fail fast parsing, I would think. You certainly don't see
> any compilers out there ignore a few characters and still try to compile.
>
> Again, thanks and this has cleared it up.
>
>
> Mihai
>
> On Fri, Mar 20, 2009 at 9:23 AM, Indhu Bharathi <indhu.b at s7software.com>wrote:
>
>> Unfortunately, ANTLR implementation of Lexer doesn't backtrack (maybe for
>> efficiency purposes). Backtracking and taking an alternative can be done
>> only in the parser. This was already discussed in the list. See -
>> http://www.antlr.org/pipermail/antlr-interest/2009-February/032981.html.
>>
>> - Indhu
>>
>> Mihai Danila wrote:
>>
>>>
>>> Thanks Indhu,
>>>
>>> In the link you sent, you troubleshoot a slightly different, but the post
>>> did help.
>>>
>>> In my scenario, the lexer chooses a rule based on a prefix and fails to
>>> fall back to try a collection of shorter tokens. The lexer doesn't go as far
>>> as TOR before deciding simply because by the time a TO is read there is no
>>> alternative to TO in lexer scope (except there would be if it wasn't greedy
>>> as per my note below). Your indication about the longest possible token
>>> policy has cleared it up for me. The only alternative to TODAY by the time
>>> TO has been read is to create an alphanumeric out of alphanumericTokens, and
>>> of course that is a parser rule and is therefore is outside of the lexer's
>>> horizon. This must be the problem.
>>>
>>> A question still remains. If the lexer cannot create a valid token
>>> without dropping characters, shouldn't it fall back and try to produce
>>> smaller tokens (which my grammar allows for, the smaller tokens being D and
>>> A) to give a chance to the parser? Apparently, the lexer is prematurely
>>> moving into an error state without noticing that a different token
>>> arrangement would keep it in the green.
>>>
>>>
>>> Mihai
>>>
>>> On Tue, Mar 10, 2009 at 3:48 AM, Indhu Bharathi <indhu.b at s7software.com<mailto:
>>> indhu.b at s7software.com>> wrote:
>>>
>>>    Try this:
>>>
>>>    Today: ( (Today_) => 'Today' ) ;
>>>    fragment Today_
>>>        :    'Today'
>>>        ;
>>>
>>>    However, I'm not sure if this's the most elegant way to fix it.
>>>
>>>    Read the following thread to understand more on why exactly this
>>>    happens:
>>>
>>> http://www.antlr.org/pipermail/antlr-interest/2009-February/032959.html
>>>
>>>    - Indhu
>>>
>>>
>>>    ----- Original Message -----
>>>    From: Mihai Danila <viridium at gmail.com <mailto:viridium at gmail.com>>
>>>    To: antlr-interest at antlr.org <mailto:antlr-interest at antlr.org>
>>>    Sent: Tuesday, March 10, 2009 6:30:43 AM GMT+0530 Asia/Calcutta
>>>    Subject: [antlr-interest] DMQL Grammar - ANTLR Eats Characters
>>>
>>>
>>>    Hi,
>>>
>>>    I thought I had my DMQL grammar nailed after several months of no
>>>    issues, until recently a query failed. I've already massaged the
>>>    grammar in a few ways so I'm a bit at a loss as to what the
>>>    problem is this time. Do I have to enumerate all the possible
>>>    token prefixes (including TO, TOD, TODA, N, NO, A, AN, O) in the
>>>    alphanumericToken rule to fix this one? Am I missing something?
>>>
>>>    Here's the query:
>>>    (f=I?TORO)
>>>
>>>    If I debug this, here's what ANTLR parses:
>>>    (f=I?O)
>>>
>>>    Here's the grammar:
>>>    grammar Dmql;
>>>
>>>    options {
>>>    output=AST;
>>>    }
>>>
>>>    tokens {
>>>    Or; And; Not;
>>>    FieldCriteria;
>>>    LookupAnd; LookupNot; LookupOr; LookupAny;
>>>    StringList; StringEquals; StringStartsWith;
>>>    StringContains; StringChar; EmptyString;
>>>    RangeList; RangeBetween; RangeGreater; RangeLower;
>>>    ConstantValue;
>>>    }
>>>
>>>    @header { package com.stratusdata.dmql.parser.antlr; }
>>>    @lexer::header { package com.stratusdata.dmql.parser.antlr; }
>>>
>>>    @rulecatch {
>>>      catch (RecognitionException re) {
>>>        throw re;
>>>      }
>>>    }
>>>
>>>    dmql: searchCondition;
>>>    searchCondition: queryClause (('|' | BoolOr) queryClause)* -> ^(Or
>>>    queryClause+);
>>>    queryClause: booleanElement ((',' | BoolAnd) booleanElement)* ->
>>>    ^(And booleanElement+);
>>>    booleanElement: queryElement | ('~' | BoolNot) queryElement ->
>>>    ^(Not queryElement);
>>>    queryElement: '('! (fieldCriteria | searchCondition) ')'!;
>>>
>>>    fieldCriteria: field '=' fieldValue -> ^(FieldCriteria field
>>>    fieldValue);
>>>    field: ('_' | alphanumericToken)+ -> ConstantValue[$field.text];
>>>    fieldValue: lookupList | stringList | rangeList | nonInteger |
>>>    period | stringLiteral | empty;
>>>    stringLiteral: StringLiteral;
>>>    empty: '.EMPTY.' -> EmptyString;
>>>
>>>    lookupList: lookupOr | lookupAnd | lookupNot | lookupAny;
>>>    lookupOr: '|' lookup (',' lookup)* -> ^(LookupOr lookup+);
>>>    lookupAnd: '+' lookup (',' lookup)* -> ^(LookupAnd lookup+);
>>>    lookupNot: '~' lookup (',' lookup)* -> ^(LookupNot lookup+);
>>>    lookupAny: '.ANY.' -> LookupAny;
>>>    lookup: alphanumeric | stringLiteral;
>>>
>>>    stringList: string (',' string)* -> ^(StringList string+);
>>>    string: stringEq | stringStart | stringContains | stringChar;
>>>    stringEq: alphanumeric -> ^(StringEquals alphanumeric);
>>>    stringStart: alphanumeric '*'  -> ^(StringStartsWith alphanumeric);
>>>    stringContains: '*' alphanumeric '*' -> ^(StringContains
>>>    alphanumeric);
>>>    stringChar: alphanumeric? ('?' alphanumeric?)+ -> ^(StringChar
>>>    ConstantValue[$stringChar.text]);
>>>
>>>    rangeList: dateTimeRangeList | dateRangeList | timeRangeList |
>>>    numericRangeList;
>>>    dateTimeRangeList: dateTimeRange (',' dateTimeRange)* ->
>>>    ^(RangeList dateTimeRange+);
>>>    dateRangeList: dateRange (',' dateRange)* -> ^(RangeList dateRange+);
>>>    timeRangeList: timeRange (',' timeRange)* -> ^(RangeList timeRange+);
>>>    numericRangeList: numericRange (',' numericRange)* -> ^(RangeList
>>>    numericRange+);
>>>    dateTimeRange: x=dateTime '-' y=dateTime -> ^(RangeBetween $x $y)
>>>    | x=dateTime '-' -> ^(RangeLower $x)
>>>    | x=dateTime '+' -> ^(RangeGreater $x);
>>>    dateRange: x=date '-' y=date -> ^(RangeBetween $x $y)
>>>    | x=date '-' -> ^(RangeLower $x)
>>>    | x=date '+' -> ^(RangeGreater $x);
>>>    timeRange: x=time '-' y=time -> ^(RangeBetween $x $y)
>>>    | x=time '-' -> ^(RangeLower $x)
>>>    | x=time '+' -> ^(RangeGreater $x);
>>>    numericRange: x=number '-' y=number -> ^(RangeBetween $x $y)
>>>    | x=number '-' -> ^(RangeLower $x)
>>>    | x=number '+' -> ^(RangeGreater $x);
>>>    period: (isoDateTime | isoDate | isoTime) ->
>>>    ConstantValue[$period.text];
>>>    dateTime: (isoDateTime | Now) -> ConstantValue[$dateTime.text];
>>>    date: (isoDate | Today) -> ConstantValue[$date.text];
>>>    time: isoTime -> ConstantValue[$time.text];
>>>    number: integer | nonInteger;
>>>    integer: D+ -> ConstantValue[$integer.text];
>>>    nonInteger: (negativeNumber | positiveDecimal) ->
>>>    ConstantValue[$nonInteger.text];
>>>    negativeNumber: '-' D+ ('.' D+)?;
>>>    positiveDecimal: D+ '.' D+;
>>>
>>>    timeZoneOffset: ('+' | '-') D D ':' D D;
>>>    isoDate: D D D D '-' D D '-' D D;
>>>    isoTime: D D ':' D D ':' D D ('.' D (D D?)?)?;
>>>    isoDateTime: isoDate 'T' isoTime ('Z' | timeZoneOffset)?;
>>>
>>>    alphanumeric: alphanumericToken+ -> ConstantValue[$alphanumeric.text];
>>>    alphanumericToken: (D | A | BoolNot | BoolAnd | BoolOr | Now |
>>>    Today | 'T' | 'Z');
>>>
>>>    BoolNot: 'NOT';
>>>    BoolAnd: 'AND';
>>>    BoolOr: 'OR';
>>>    Now: 'NOW';
>>>    Today: 'TODAY';
>>>    StringLiteral: ('"' (~('\u0000'..'\u001F' | '\u007F' | '"') |
>>>    ('""'))* '"');
>>>    A: (('A'..'Z') | ('a'..'z'));
>>>    D: ('0'..'9');
>>>    Whitespace: (' ' | '\t' | '\n') { $channel = HIDDEN; };
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090320/5b1afcad/attachment.html 


More information about the antlr-interest mailing list