[antlr-interest] Stupid languages, and parsing them

Sam Barnett-Cormack s.barnett-cormack at lancaster.ac.uk
Sat Apr 11 12:46:56 PDT 2009


Sam Harwell wrote:
> Here's one way you can handle the keyword scoping problems straight from
> the parser:
> 
> In your parser, you instead of referencing IDENTIFIER, create two rules
> like this:
> 
> identifier : IDENTIFIER;
> withSyntaxIdentifier : IDENTIFIER | KEYWORD1 | KEYWORD2 ;
> 
> And reference these two as appropriate from the other parser rules.

Ah, yes, but I have quite a lot of keywords... about 83. Not so handy a 
way to do it then.

I'll have a look at the Island Grammar stuff Thomas Brandon suggested, I 
think.

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org
> [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Sam
> Barnett-Cormack
> Sent: Saturday, April 11, 2009 1:45 PM
> To: ANTLR Interest Mailing List
> Subject: [antlr-interest] Stupid languages, and parsing them
> 
> Hi all,
> 
> In my ongoing project, I need to parse a really crazy structure that 
> wants to change the lexing rules dependent on syntactic factors. I hate 
> this.
> 
> Within the thing I'm talking about, whitespace and comments are handled 
> as they are the rest of the time (thankfully). Alphanumeric tokens are 
> all one type, and commas are allowed, and '[' and '{' (and closing 
> versions of such) have special meaning. Then there's things that are 
> &whatever ('&' followed by alphabetic followed by any number of 
> alphanumeric). Those are already distinct types. However, once into this
> 
> weird 'zone', most keywords aren't keywords anymore and must be treated 
> as alphanumeric tokens.
> 
> Now, this state is entered by 'WITH SYNTAX {' (and exited with '}')
> 
> The problem is the specification considers the starter to be three 
> tokens, and any amount of whitespace and comments is allowed between 
> each. I can easily see that I could use gated predicates to switch 
> between two lexer "modes". That's one solution. I can see two broad 
> solutions:
> 
> 1) Use member variables to track if the most recent non-WS, non-comment 
> token was WITH, SYNTAX, and { (a sort of look-behind implemented 
> kludgily by putting an action in *every* rule, or by overruling the emit
> 
> stuff to keep track of the last 2 things on the DEFAULT channel), use 
> these to switch into crazy-mode where much is different.
> 
> 2) Make the parser just accept *everything* within the definition of 
> syntax, and deal with in some other way (????) later. It has to be that 
> bad, as the "normal" lexer sees '[[' as a token, and the "weird" version
> 
> has to see it as two '[' tokens.
> 
> Anyone got any thoughts? Any ideas which would be less pain? Is there 
> already some way of tracking recently-emitted token on a specific
> channel?
> 
> Thanks,
> 
> Sam
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address



More information about the antlr-interest mailing list