[antlr-interest] Stupid languages, and parsing them

Sat Apr 11 12:31:50 PDT 2009

Here's one way you can handle the keyword scoping problems straight from
the parser:

In your parser, you instead of referencing IDENTIFIER, create two rules
like this:

identifier : IDENTIFIER;
withSyntaxIdentifier : IDENTIFIER | KEYWORD1 | KEYWORD2 ;

And reference these two as appropriate from the other parser rules.

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Sam
Barnett-Cormack
Sent: Saturday, April 11, 2009 1:45 PM
To: ANTLR Interest Mailing List
Subject: [antlr-interest] Stupid languages, and parsing them

Hi all,

In my ongoing project, I need to parse a really crazy structure that 
wants to change the lexing rules dependent on syntactic factors. I hate 
this.

Within the thing I'm talking about, whitespace and comments are handled 
as they are the rest of the time (thankfully). Alphanumeric tokens are 
all one type, and commas are allowed, and '[' and '{' (and closing 
versions of such) have special meaning. Then there's things that are 
&whatever ('&' followed by alphabetic followed by any number of 
alphanumeric). Those are already distinct types. However, once into this

weird 'zone', most keywords aren't keywords anymore and must be treated 
as alphanumeric tokens.

Now, this state is entered by 'WITH SYNTAX {' (and exited with '}')

The problem is the specification considers the starter to be three 
tokens, and any amount of whitespace and comments is allowed between 
each. I can easily see that I could use gated predicates to switch 
between two lexer "modes". That's one solution. I can see two broad 
solutions:

1) Use member variables to track if the most recent non-WS, non-comment 
token was WITH, SYNTAX, and { (a sort of look-behind implemented 
kludgily by putting an action in *every* rule, or by overruling the emit

stuff to keep track of the last 2 things on the DEFAULT channel), use 
these to switch into crazy-mode where much is different.

2) Make the parser just accept *everything* within the definition of 
syntax, and deal with in some other way (????) later. It has to be that 
bad, as the "normal" lexer sees '[[' as a token, and the "weird" version

has to see it as two '[' tokens.

Anyone got any thoughts? Any ideas which would be less pain? Is there 
already some way of tracking recently-emitted token on a specific
channel?

Thanks,

Sam

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe:
http://www.antlr.org/mailman/options/antlr-interest/your-email-address