[antlr-interest] Lexer: strings that are starting sub-strings of another

Sun Jul 22 03:23:28 PDT 2012

Hello Benjamin, Jim:

Your responses are appreciated.

Maybe a bit of context is relevant here - which also makes my post long, but given that this is a relatively new area for me (language design, lexers, parsers), I'd more than appreciate thoughts of others more experienced in this area.

The context here is that there is an existing language that was built by a large public sector organization - essentially a homebrew solution developed over a number of years. This homebrew solution has a verbose language and an editor environment where a number of business specifications are defined (think 10s of thousands of lines of specifications). From the specifications, target code solutions are generated (C#/Java/etc.). The homebrew solution is built using a platform that is considered a business risk; and the intention is to use more standardized platforms and tools (say Eclipse or Visual Studio) in order to be able to write/maintain the specifications while still maintaining the code generation targets.

The challenge is that the homebrew solution specification language has no formal grammar defined.

So my initial approach was to define a lexer grammar and a parser grammar for the existing language for business specifications that would enable me to plug in the language into a development environment (say Visual Studio).

And without having to change key language elements in the existing specification language.

So if the specification language contained a keyword 'is greater than'; then treating it as a contextual keyword vs. an actual keyword (in my thinking) also meant the difference between a lexer recognizing it immediately (and being able to flag as such immediately in an editor that gives you individual lines instead of the whole text block) vs. a parser later on in the cycle recognizing it as one after seeing sequences of tokens and interpreting it as a keyword given certain rules - say the tokens appear in a particular order.

Thoughts?

-krish

-----Original Message-----
From: Benjamin S Wolf [mailto:jokeserver at gmail.com] 
Sent: Sunday, July 22, 2012 4:29 AM
To: Krishnan Subramanian
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Lexer: strings that are starting sub-strings of another

Lexers aren't that great at distinguishing tokens using multiple lookahead, but my current grammar gets around similar issues by splitting tokens at word boundaries, and using low level parser rules as "macro" tokens. Eg. if I wanted "aaa bbb" to be a token and "aaa"
were a token, then I would add a token "bbb" and have a parser rule "aaa_bbb : AAA BBB". Of course, this means that the whitespace (or whatever you put on channel HIDDEN) between aaa and bbb is irrelevant.

In your particular case, I'd recommend something slightly different:
split "is", "greater than", "or", and "equal to" into separate tokens, and encode the ordering of said tokens for comparisons like "is greater than" and "is equal to" and "is greater than or equal to" as parser rules.

On Sat, Jul 21, 2012 at 3:31 AM, Krishnan Subramanian <krishsub at microsoft.com> wrote:
> Hi all,
>
> I've been exploring ANTLR for creating a custom DSL for a scripting language with the intention being to generate a parser and lexer in C#.
>
> I've started by generating writing a lexer grammar and a parser grammar. This mostly works fine.
>
> However, I've run into a lexer case where my language can contain words that are [starting] sub-strings of another and should be treated differently.
>
> For e.g. the script is ~English where I can have:
>
>                 if (someVar is greater than anotherVar)                                                // someVar > anotherVar where GT is defined as 'is greater than'
>                 if (somevar is greater than or equal to anotherVar)          // someVar >= anotherVar where OP_GE is defined as 'is greater than or equal to'
>
> In my lexer grammar, I have two definitions:
>
> GT          :               'is greater than';
> OP_GE  :               'is greater than or equal to';
>
> The generated (C#) lexer barfs at runtime with an NoViableAltException and then mangles GT when it encounters it in a test case truncating a few characters and erroneously reporting it as an identifier. This obviously works with GT being defined as a '>' and a OP_GE being defined as a '>='.
>
> Questions:
> =========
>
> I'm not that familiar with ANTLR yet, and I suspect this might have something to do with lookaheads (1 or 2), but I don't know what to do. Relative ordering within the lexer grammar has no effect.
>
> I've tried using syntactic predicates; but that did not change anything with respect to runtime behavior. I probably did something wrong in terms of specifying it for a lexer grammar.
>
> And I don't know if this is a general ANTLR issue or a generated C# thing, but maybe someone has pointers? Specifying a custom lookahead? Could be a solution if it works, but seems fragile. Or is there some solution I'm missing?
>
> Thanks,
>
> -krish
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: 
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address