[antlr-interest] Lexer: strings that are starting sub-strings of another

Jim Idle jimi at temporal-wave.com
Sat Jul 21 08:22:18 PDT 2012


This language sounds too verbose to me, and having tokens that span
whitespace is going to bite you later. What about tabs, more than one
space and so on? You are better tokenizing the individual words and
constructing the sentences in the parser.

However:

fragment OP_GE : ;

GT    : 'is greater than'
         (   ' or equal to' { $type = OP_GE; }
           |
         )
      ;


Again though, I think that you may want to step back and consider whether
such verbose expression syntax is really a benefit or not.

Jim


> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Krishnan Subramanian
> Sent: Saturday, July 21, 2012 3:31 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] Lexer: strings that are starting sub-strings
> of another
>
> Hi all,
>
> I've been exploring ANTLR for creating a custom DSL for a scripting
> language with the intention being to generate a parser and lexer in C#.
>
> I've started by generating writing a lexer grammar and a parser
> grammar. This mostly works fine.
>
> However, I've run into a lexer case where my language can contain words
> that are [starting] sub-strings of another and should be treated
> differently.
>
> For e.g. the script is ~English where I can have:
>
>                 if (someVar is greater than anotherVar)
> // someVar > anotherVar where GT is defined as 'is greater than'
>                 if (somevar is greater than or equal to anotherVar)
> // someVar >= anotherVar where OP_GE is defined as 'is greater than or
> equal to'
>
> In my lexer grammar, I have two definitions:
>
> GT          :               'is greater than';
> OP_GE  :               'is greater than or equal to';
>
> The generated (C#) lexer barfs at runtime with an NoViableAltException
> and then mangles GT when it encounters it in a test case truncating a
> few characters and erroneously reporting it as an identifier. This
> obviously works with GT being defined as a '>' and a OP_GE being
> defined as a '>='.
>
> Questions:
> =========
>
> I'm not that familiar with ANTLR yet, and I suspect this might have
> something to do with lookaheads (1 or 2), but I don't know what to do.
> Relative ordering within the lexer grammar has no effect.
>
> I've tried using syntactic predicates; but that did not change anything
> with respect to runtime behavior. I probably did something wrong in
> terms of specifying it for a lexer grammar.
>
> And I don't know if this is a general ANTLR issue or a generated C#
> thing, but maybe someone has pointers? Specifying a custom lookahead?
> Could be a solution if it works, but seems fragile. Or is there some
> solution I'm missing?
>
> Thanks,
>
> -krish
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address


More information about the antlr-interest mailing list