[antlr-interest] DMQL Grammar - ANTLR Eats Characters
Mihai Danila
viridium at gmail.com
Fri Mar 20 14:28:06 PDT 2009
I forgot to CC the list on this one.
On Fri, Mar 20, 2009 at 11:22 AM, Mihai Danila <viridium at gmail.com> wrote:
>
> Cool; thanks for the links. Your way of fixing it is more elegant than
> enumerating all the token prefixes in the alphanumericToken rule. Maybe
> I'll demote alphanumeric and alphanumericToken back to lexer rules. This
> will create problems with specifying ISO dates, problems which can be fixed
> with constructs like the one yo gave me two mails ago. Since I can't escape
> these constructs, might as well have the lexer construct bigger tokens (I
> suspect it'll be more efficient than having the parser construct
> alphanumerics).
>
> I'm looking forward for the day when the generator UI will be able to take
> on some of these tasks of warning of problematic productions, namely
> productions that would cause the lexer to err out like token prefixes.
>
> Also, I find it dangerous that the auto-recovery feature is enabled by
> default. With my grammar, this means that my software was able to produce
> valid queries inconsistent with the semantics of the query string being
> parsed. Of course, all is solved in time, as I learn more and more about
> ANTLR, but I'd rather it erred on the safe side. In fact, most use cases out
> there will prefer fail fast parsing, I would think. You certainly don't see
> any compilers out there ignore a few characters and still try to compile.
>
> Again, thanks and this has cleared it up.
>
>
> Mihai
>
> On Fri, Mar 20, 2009 at 9:23 AM, Indhu Bharathi <indhu.b at s7software.com>wrote:
>
>> Unfortunately, ANTLR implementation of Lexer doesn't backtrack (maybe for
>> efficiency purposes). Backtracking and taking an alternative can be done
>> only in the parser. This was already discussed in the list. See -
>> http://www.antlr.org/pipermail/antlr-interest/2009-February/032981.html.
>>
>> - Indhu
>>
>> Mihai Danila wrote:
>>
>>>
>>> Thanks Indhu,
>>>
>>> In the link you sent, you troubleshoot a slightly different, but the post
>>> did help.
>>>
>>> In my scenario, the lexer chooses a rule based on a prefix and fails to
>>> fall back to try a collection of shorter tokens. The lexer doesn't go as far
>>> as TOR before deciding simply because by the time a TO is read there is no
>>> alternative to TO in lexer scope (except there would be if it wasn't greedy
>>> as per my note below). Your indication about the longest possible token
>>> policy has cleared it up for me. The only alternative to TODAY by the time
>>> TO has been read is to create an alphanumeric out of alphanumericTokens, and
>>> of course that is a parser rule and is therefore is outside of the lexer's
>>> horizon. This must be the problem.
>>>
>>> A question still remains. If the lexer cannot create a valid token
>>> without dropping characters, shouldn't it fall back and try to produce
>>> smaller tokens (which my grammar allows for, the smaller tokens being D and
>>> A) to give a chance to the parser? Apparently, the lexer is prematurely
>>> moving into an error state without noticing that a different token
>>> arrangement would keep it in the green.
>>>
>>>
>>> Mihai
>>>
>>> On Tue, Mar 10, 2009 at 3:48 AM, Indhu Bharathi <indhu.b at s7software.com<mailto:
>>> indhu.b at s7software.com>> wrote:
>>>
>>> Try this:
>>>
>>> Today: ( (Today_) => 'Today' ) ;
>>> fragment Today_
>>> : 'Today'
>>> ;
>>>
>>> However, I'm not sure if this's the most elegant way to fix it.
>>>
>>> Read the following thread to understand more on why exactly this
>>> happens:
>>>
>>> http://www.antlr.org/pipermail/antlr-interest/2009-February/032959.html
>>>
>>> - Indhu
>>>
>>>
>>> ----- Original Message -----
>>> From: Mihai Danila <viridium at gmail.com <mailto:viridium at gmail.com>>
>>> To: antlr-interest at antlr.org <mailto:antlr-interest at antlr.org>
>>> Sent: Tuesday, March 10, 2009 6:30:43 AM GMT+0530 Asia/Calcutta
>>> Subject: [antlr-interest] DMQL Grammar - ANTLR Eats Characters
>>>
>>>
>>> Hi,
>>>
>>> I thought I had my DMQL grammar nailed after several months of no
>>> issues, until recently a query failed. I've already massaged the
>>> grammar in a few ways so I'm a bit at a loss as to what the
>>> problem is this time. Do I have to enumerate all the possible
>>> token prefixes (including TO, TOD, TODA, N, NO, A, AN, O) in the
>>> alphanumericToken rule to fix this one? Am I missing something?
>>>
>>> Here's the query:
>>> (f=I?TORO)
>>>
>>> If I debug this, here's what ANTLR parses:
>>> (f=I?O)
>>>
>>> Here's the grammar:
>>> grammar Dmql;
>>>
>>> options {
>>> output=AST;
>>> }
>>>
>>> tokens {
>>> Or; And; Not;
>>> FieldCriteria;
>>> LookupAnd; LookupNot; LookupOr; LookupAny;
>>> StringList; StringEquals; StringStartsWith;
>>> StringContains; StringChar; EmptyString;
>>> RangeList; RangeBetween; RangeGreater; RangeLower;
>>> ConstantValue;
>>> }
>>>
>>> @header { package com.stratusdata.dmql.parser.antlr; }
>>> @lexer::header { package com.stratusdata.dmql.parser.antlr; }
>>>
>>> @rulecatch {
>>> catch (RecognitionException re) {
>>> throw re;
>>> }
>>> }
>>>
>>> dmql: searchCondition;
>>> searchCondition: queryClause (('|' | BoolOr) queryClause)* -> ^(Or
>>> queryClause+);
>>> queryClause: booleanElement ((',' | BoolAnd) booleanElement)* ->
>>> ^(And booleanElement+);
>>> booleanElement: queryElement | ('~' | BoolNot) queryElement ->
>>> ^(Not queryElement);
>>> queryElement: '('! (fieldCriteria | searchCondition) ')'!;
>>>
>>> fieldCriteria: field '=' fieldValue -> ^(FieldCriteria field
>>> fieldValue);
>>> field: ('_' | alphanumericToken)+ -> ConstantValue[$field.text];
>>> fieldValue: lookupList | stringList | rangeList | nonInteger |
>>> period | stringLiteral | empty;
>>> stringLiteral: StringLiteral;
>>> empty: '.EMPTY.' -> EmptyString;
>>>
>>> lookupList: lookupOr | lookupAnd | lookupNot | lookupAny;
>>> lookupOr: '|' lookup (',' lookup)* -> ^(LookupOr lookup+);
>>> lookupAnd: '+' lookup (',' lookup)* -> ^(LookupAnd lookup+);
>>> lookupNot: '~' lookup (',' lookup)* -> ^(LookupNot lookup+);
>>> lookupAny: '.ANY.' -> LookupAny;
>>> lookup: alphanumeric | stringLiteral;
>>>
>>> stringList: string (',' string)* -> ^(StringList string+);
>>> string: stringEq | stringStart | stringContains | stringChar;
>>> stringEq: alphanumeric -> ^(StringEquals alphanumeric);
>>> stringStart: alphanumeric '*' -> ^(StringStartsWith alphanumeric);
>>> stringContains: '*' alphanumeric '*' -> ^(StringContains
>>> alphanumeric);
>>> stringChar: alphanumeric? ('?' alphanumeric?)+ -> ^(StringChar
>>> ConstantValue[$stringChar.text]);
>>>
>>> rangeList: dateTimeRangeList | dateRangeList | timeRangeList |
>>> numericRangeList;
>>> dateTimeRangeList: dateTimeRange (',' dateTimeRange)* ->
>>> ^(RangeList dateTimeRange+);
>>> dateRangeList: dateRange (',' dateRange)* -> ^(RangeList dateRange+);
>>> timeRangeList: timeRange (',' timeRange)* -> ^(RangeList timeRange+);
>>> numericRangeList: numericRange (',' numericRange)* -> ^(RangeList
>>> numericRange+);
>>> dateTimeRange: x=dateTime '-' y=dateTime -> ^(RangeBetween $x $y)
>>> | x=dateTime '-' -> ^(RangeLower $x)
>>> | x=dateTime '+' -> ^(RangeGreater $x);
>>> dateRange: x=date '-' y=date -> ^(RangeBetween $x $y)
>>> | x=date '-' -> ^(RangeLower $x)
>>> | x=date '+' -> ^(RangeGreater $x);
>>> timeRange: x=time '-' y=time -> ^(RangeBetween $x $y)
>>> | x=time '-' -> ^(RangeLower $x)
>>> | x=time '+' -> ^(RangeGreater $x);
>>> numericRange: x=number '-' y=number -> ^(RangeBetween $x $y)
>>> | x=number '-' -> ^(RangeLower $x)
>>> | x=number '+' -> ^(RangeGreater $x);
>>> period: (isoDateTime | isoDate | isoTime) ->
>>> ConstantValue[$period.text];
>>> dateTime: (isoDateTime | Now) -> ConstantValue[$dateTime.text];
>>> date: (isoDate | Today) -> ConstantValue[$date.text];
>>> time: isoTime -> ConstantValue[$time.text];
>>> number: integer | nonInteger;
>>> integer: D+ -> ConstantValue[$integer.text];
>>> nonInteger: (negativeNumber | positiveDecimal) ->
>>> ConstantValue[$nonInteger.text];
>>> negativeNumber: '-' D+ ('.' D+)?;
>>> positiveDecimal: D+ '.' D+;
>>>
>>> timeZoneOffset: ('+' | '-') D D ':' D D;
>>> isoDate: D D D D '-' D D '-' D D;
>>> isoTime: D D ':' D D ':' D D ('.' D (D D?)?)?;
>>> isoDateTime: isoDate 'T' isoTime ('Z' | timeZoneOffset)?;
>>>
>>> alphanumeric: alphanumericToken+ -> ConstantValue[$alphanumeric.text];
>>> alphanumericToken: (D | A | BoolNot | BoolAnd | BoolOr | Now |
>>> Today | 'T' | 'Z');
>>>
>>> BoolNot: 'NOT';
>>> BoolAnd: 'AND';
>>> BoolOr: 'OR';
>>> Now: 'NOW';
>>> Today: 'TODAY';
>>> StringLiteral: ('"' (~('\u0000'..'\u001F' | '\u007F' | '"') |
>>> ('""'))* '"');
>>> A: (('A'..'Z') | ('a'..'z'));
>>> D: ('0'..'9');
>>> Whitespace: (' ' | '\t' | '\n') { $channel = HIDDEN; };
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090320/5b1afcad/attachment.html
More information about the antlr-interest
mailing list