[antlr-interest] Lexer: strings that are starting sub-strings of another

Benjamin S Wolf jokeserver at gmail.com
Sat Jul 21 19:29:03 PDT 2012


Lexers aren't that great at distinguishing tokens using multiple
lookahead, but my current grammar gets around similar issues by
splitting tokens at word boundaries, and using low level parser rules
as "macro" tokens. Eg. if I wanted "aaa bbb" to be a token and "aaa"
were a token, then I would add a token "bbb" and have a parser rule
"aaa_bbb : AAA BBB". Of course, this means that the whitespace (or
whatever you put on channel HIDDEN) between aaa and bbb is irrelevant.

In your particular case, I'd recommend something slightly different:
split "is", "greater than", "or", and "equal to" into separate tokens,
and encode the ordering of said tokens for comparisons like "is
greater than" and "is equal to" and "is greater than or equal to" as
parser rules.

On Sat, Jul 21, 2012 at 3:31 AM, Krishnan Subramanian
<krishsub at microsoft.com> wrote:
> Hi all,
>
> I've been exploring ANTLR for creating a custom DSL for a scripting language with the intention being to generate a parser and lexer in C#.
>
> I've started by generating writing a lexer grammar and a parser grammar. This mostly works fine.
>
> However, I've run into a lexer case where my language can contain words that are [starting] sub-strings of another and should be treated differently.
>
> For e.g. the script is ~English where I can have:
>
>                 if (someVar is greater than anotherVar)                                                // someVar > anotherVar where GT is defined as 'is greater than'
>                 if (somevar is greater than or equal to anotherVar)          // someVar >= anotherVar where OP_GE is defined as 'is greater than or equal to'
>
> In my lexer grammar, I have two definitions:
>
> GT          :               'is greater than';
> OP_GE  :               'is greater than or equal to';
>
> The generated (C#) lexer barfs at runtime with an NoViableAltException and then mangles GT when it encounters it in a test case truncating a few characters and erroneously reporting it as an identifier. This obviously works with GT being defined as a '>' and a OP_GE being defined as a '>='.
>
> Questions:
> =========
>
> I'm not that familiar with ANTLR yet, and I suspect this might have something to do with lookaheads (1 or 2), but I don't know what to do. Relative ordering within the lexer grammar has no effect.
>
> I've tried using syntactic predicates; but that did not change anything with respect to runtime behavior. I probably did something wrong in terms of specifying it for a lexer grammar.
>
> And I don't know if this is a general ANTLR issue or a generated C# thing, but maybe someone has pointers? Specifying a custom lookahead? Could be a solution if it works, but seems fragile. Or is there some solution I'm missing?
>
> Thanks,
>
> -krish
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address


More information about the antlr-interest mailing list