[antlr-interest] Lexer: strings that are starting sub-strings of another

Sat Jul 21 03:31:20 PDT 2012

Hi all,

I've been exploring ANTLR for creating a custom DSL for a scripting language with the intention being to generate a parser and lexer in C#.

I've started by generating writing a lexer grammar and a parser grammar. This mostly works fine.

However, I've run into a lexer case where my language can contain words that are [starting] sub-strings of another and should be treated differently.

For e.g. the script is ~English where I can have:

                if (someVar is greater than anotherVar)                                                // someVar > anotherVar where GT is defined as 'is greater than'
                if (somevar is greater than or equal to anotherVar)          // someVar >= anotherVar where OP_GE is defined as 'is greater than or equal to'

In my lexer grammar, I have two definitions:

GT          :               'is greater than';
OP_GE  :               'is greater than or equal to';

The generated (C#) lexer barfs at runtime with an NoViableAltException and then mangles GT when it encounters it in a test case truncating a few characters and erroneously reporting it as an identifier. This obviously works with GT being defined as a '>' and a OP_GE being defined as a '>='.

Questions:
=========

I'm not that familiar with ANTLR yet, and I suspect this might have something to do with lookaheads (1 or 2), but I don't know what to do. Relative ordering within the lexer grammar has no effect.

I've tried using syntactic predicates; but that did not change anything with respect to runtime behavior. I probably did something wrong in terms of specifying it for a lexer grammar.

And I don't know if this is a general ANTLR issue or a generated C# thing, but maybe someone has pointers? Specifying a custom lookahead? Could be a solution if it works, but seems fragile. Or is there some solution I'm missing?

Thanks,

-krish