[antlr-interest] Forcing the lexer to never error

A Z asicaddress at gmail.com
Sat Jun 16 07:35:42 PDT 2012


Thanks for the response.

I tried this but it doesn't give the expected behavior. The lexer
still generates exceptions for certain character sequences.

On 6/16/12, Jim Idle <jimi at temporal-wave.com> wrote:
> You just want one rule as the last rule:
>
> INVALID : . ;
>
> Jim
>
> -----Original Message-----
> From: antlr-interest-bounces at antlr.org
> [mailto:antlr-interest-bounces at antlr.org] On Behalf Of A Z
> Sent: Saturday, June 16, 2012 6:14 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] Forcing the lexer to never error
>
> Hello all,
>
>   This is all using ANTLR 3.4 with the C target. I'm trying to modify my
> lexer grammar to never trigger a lexer error but instead emit a special
> token, INVALID. So far I've done this by adding all invalid sequences of
> characters to a special rule INVALID.
>
>
> ASCOLCOLAS                 : '*::*';
>
> INVALID :
>  ...
>  | '*:'
>  | '*::';
>
> This works but it gets tedious for certain complex lexer rules. For
> instance the rule for a line directive is as follows:
>
> DIR_LINE :
>   'line'  SLSpace+ DecDigits SLSpace+ StrChars SLSpace+ DecDigits SLSpace*
> '\n'
>
> To handle this I'd have to add a fairly complex alternative to the INVALID
> rule
>
> INVALID:
>   ...
>  | 'line'
>   (
>     ~SLSpace
>   | SLSpace+
>    (
>      ~DecDigits
>    | DecDigits ...
>    )
>   )
>
> I also tried adding alternatives to the DIR_LINE rule instead.
> Unfortunately ANTLR sometimes fails to generate the code in this case,
> even after letting it run for several minutes. I also don't have a way to
> set the token type to INVALID. ANTLR places the token type assignment
> after any lexer rules actions, overriding my changes.
>
> DIR_LINE :
>   'line'
>   (
>     SLSpace+
>     (
>       DecDigits
>       (
>        ...
>       |
>       )
>     | ~DecDigits {LEXSTATE->type = INVALID;} //This gets ignored in the C
> code
>     )
>   | ~SLSpace {ctx->lineError();}
>   )
>
>
>
> My first question is, are there performance issues caused by adding the
> separate INVALID rule as opposed to alternative in existing rules?
> My understanding is yes since lookahead is needed to determine whether
> REALNUM or INVALID should be entered, for instance.
>
> Secondly, is there a way to force the token type based on a rule action?
>
> Thanks
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>


More information about the antlr-interest mailing list