[antlr-interest] How can I selectively avoid tokenization?

Wed Jul 1 04:47:02 PDT 2009

Hi  Jim,

I have tried to use gated semantic predicates but I still can't find a
solution.

First,  I added a flag 'in_expression' to the lexer and set it true on '$<'
and false on '>'.

Then,  I use a predicate for the string token as in:

STRING :  {in_expression}?=> QUOTE ~QUOTE* QUOTE
fragment QUOTE : '"'

Next,  I need to make sure that anything present in the text to be ignored
can be tokenized.
How do I do that? If I add a token like:

OTHER : . ;

it matches everything,  including letters of an identifier! (and I get no
ambiguity warning from Antlr?)
If I add a predicate as in:

OTHER :            {!in_expression}?=> . ;

things get worse,  because now I get 'No viable alternative' errors on any
identifier in an expression.

Here is a simplified grammar that I am using to experiment:

@lexer::members {
    private Boolean in_expression = false;
}

top : token* EOF;

token
  : BEGIN_EXPRESSION   { System.err.println("BEGIN_EXPRESSION"); }
  | ID                 { System.err.println("ID: " + $ID.text); }
  | STRING             { System.err.println("STRING: " + $STRING.text); }
  | END_EXPRESSION     { System.err.println("END_EXPRESSION"); }
  | OTHER              { System.err.println("OTHER: " + $OTHER.text); }
  ;

BEGIN_EXPRESSION : '$<' { in_expression = true; };
ID :               ('A..Z' | 'a..z')+ ;
STRING :           {in_expression}?=> '"' ~'"'* '"';
END_EXPRESSION :   '>'   { in_expression = false; };
OTHER :            {!in_expression}?=> . ;

On Fri, Jun 26, 2009 at 2:23 PM, Jim Idle <jimi at temporal-wave.com> wrote:

> Gated semantic predicates on the string literals.
>
> Jim
>
>
>
> On Jun 26, 2009, at 2:18 AM, Johan Cockx <johan at sikanda.be> wrote:
>
> Hi Jim,
>
> Thanks for the suggestion.
>
> However, it doesn't solve the problem.
>
> One token in my expressions is a quoted string.  The text to be ignored may
> also contain quotes,  but these should be ignored,  so that '$<' after an
> ignored quote is recognized.
>
> If I let the lexer tokenize the text to be ignored and then just ignore
> these tokens,
> it will recognize the quote as the start of a quoted string token and
> happily add the '$<' to the quoted string,  instead of recognizing it as a
> token.
>
> I think I need to completely avoid tokenizing the text to be ignored,  but
> I don't know how to do this.
>
> All suggestions are very welcome.
>
> Regards, Johan
>
> On Thu, Jun 25, 2009 at 6:05 PM, Jim Idle < <jimi at temporal-wave.com>
> jimi at temporal-wave.com> wrote:
>
>> Create a Boolean member variable and default it to off. Turn it on web you
>> hit $< and off when you get >
>>
>> In your tokens call skip() if the variable is false. Make sure you cater
>> for locations that $< is not significant.
>>
>> Jim
>>
>>
>> On Jun 25, 2009, at 2:19 AM, Johan Cockx < <johan at sikanda.be>
>> johan at sikanda.be> wrote:
>>
>>  Hi,
>>>
>>> In my current project,  I need to recognize and parse expressions marked
>>> by $<...> (where '...' represents the omitted expression) in an arbitrary
>>> text (not containing $<).
>>>
>>> How can I tell the (Antlr-generated) lexer to ignore text outside the
>>> $<...> marks?
>>>
>>> I have tried to recognize each chunk of text between the marks as a
>>> single (large) token,  but that causes all sorts of problems and doesn't
>>> seem to be the right way to go.
>>>
>>> Can anyone point me in a better direction?
>>>
>>> Thanks,
>>> Johan
>>>
>>> List: <http://www.antlr.org/mailman/listinfo/antlr-interest>
>>> http://www.antlr.org/mailman/listinfo/antlr-interest
>>> Unsubscribe:
>>> <http://www.antlr.org/mailman/options/antlr-interest/your-email-address>
>>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090701/f7483f4a/attachment.html