[antlr-interest] Stupid languages, and parsing them
Sam Harwell
sharwell at pixelminegames.com
Sat Apr 11 12:31:50 PDT 2009
Here's one way you can handle the keyword scoping problems straight from
the parser:
In your parser, you instead of referencing IDENTIFIER, create two rules
like this:
identifier : IDENTIFIER;
withSyntaxIdentifier : IDENTIFIER | KEYWORD1 | KEYWORD2 ;
And reference these two as appropriate from the other parser rules.
-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Sam
Barnett-Cormack
Sent: Saturday, April 11, 2009 1:45 PM
To: ANTLR Interest Mailing List
Subject: [antlr-interest] Stupid languages, and parsing them
Hi all,
In my ongoing project, I need to parse a really crazy structure that
wants to change the lexing rules dependent on syntactic factors. I hate
this.
Within the thing I'm talking about, whitespace and comments are handled
as they are the rest of the time (thankfully). Alphanumeric tokens are
all one type, and commas are allowed, and '[' and '{' (and closing
versions of such) have special meaning. Then there's things that are
&whatever ('&' followed by alphabetic followed by any number of
alphanumeric). Those are already distinct types. However, once into this
weird 'zone', most keywords aren't keywords anymore and must be treated
as alphanumeric tokens.
Now, this state is entered by 'WITH SYNTAX {' (and exited with '}')
The problem is the specification considers the starter to be three
tokens, and any amount of whitespace and comments is allowed between
each. I can easily see that I could use gated predicates to switch
between two lexer "modes". That's one solution. I can see two broad
solutions:
1) Use member variables to track if the most recent non-WS, non-comment
token was WITH, SYNTAX, and { (a sort of look-behind implemented
kludgily by putting an action in *every* rule, or by overruling the emit
stuff to keep track of the last 2 things on the DEFAULT channel), use
these to switch into crazy-mode where much is different.
2) Make the parser just accept *everything* within the definition of
syntax, and deal with in some other way (????) later. It has to be that
bad, as the "normal" lexer sees '[[' as a token, and the "weird" version
has to see it as two '[' tokens.
Anyone got any thoughts? Any ideas which would be less pain? Is there
already some way of tracking recently-emitted token on a specific
channel?
Thanks,
Sam
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe:
http://www.antlr.org/mailman/options/antlr-interest/your-email-address
More information about the antlr-interest
mailing list