[antlr-interest] Stupid languages, and parsing them

Sam Barnett-Cormack s.barnett-cormack at lancaster.ac.uk
Sat Apr 11 11:45:28 PDT 2009


Hi all,

In my ongoing project, I need to parse a really crazy structure that 
wants to change the lexing rules dependent on syntactic factors. I hate 
this.

Within the thing I'm talking about, whitespace and comments are handled 
as they are the rest of the time (thankfully). Alphanumeric tokens are 
all one type, and commas are allowed, and '[' and '{' (and closing 
versions of such) have special meaning. Then there's things that are 
&whatever ('&' followed by alphabetic followed by any number of 
alphanumeric). Those are already distinct types. However, once into this 
weird 'zone', most keywords aren't keywords anymore and must be treated 
as alphanumeric tokens.

Now, this state is entered by 'WITH SYNTAX {' (and exited with '}')

The problem is the specification considers the starter to be three 
tokens, and any amount of whitespace and comments is allowed between 
each. I can easily see that I could use gated predicates to switch 
between two lexer "modes". That's one solution. I can see two broad 
solutions:

1) Use member variables to track if the most recent non-WS, non-comment 
token was WITH, SYNTAX, and { (a sort of look-behind implemented 
kludgily by putting an action in *every* rule, or by overruling the emit 
stuff to keep track of the last 2 things on the DEFAULT channel), use 
these to switch into crazy-mode where much is different.

2) Make the parser just accept *everything* within the definition of 
syntax, and deal with in some other way (????) later. It has to be that 
bad, as the "normal" lexer sees '[[' as a token, and the "weird" version 
has to see it as two '[' tokens.

Anyone got any thoughts? Any ideas which would be less pain? Is there 
already some way of tracking recently-emitted token on a specific channel?

Thanks,

Sam


More information about the antlr-interest mailing list