[antlr-interest] Stupid languages, and parsing them
Sam Barnett-Cormack
s.barnett-cormack at lancaster.ac.uk
Sat Apr 11 12:46:56 PDT 2009
Sam Harwell wrote:
> Here's one way you can handle the keyword scoping problems straight from
> the parser:
>
> In your parser, you instead of referencing IDENTIFIER, create two rules
> like this:
>
> identifier : IDENTIFIER;
> withSyntaxIdentifier : IDENTIFIER | KEYWORD1 | KEYWORD2 ;
>
> And reference these two as appropriate from the other parser rules.
Ah, yes, but I have quite a lot of keywords... about 83. Not so handy a
way to do it then.
I'll have a look at the Island Grammar stuff Thomas Brandon suggested, I
think.
> -----Original Message-----
> From: antlr-interest-bounces at antlr.org
> [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Sam
> Barnett-Cormack
> Sent: Saturday, April 11, 2009 1:45 PM
> To: ANTLR Interest Mailing List
> Subject: [antlr-interest] Stupid languages, and parsing them
>
> Hi all,
>
> In my ongoing project, I need to parse a really crazy structure that
> wants to change the lexing rules dependent on syntactic factors. I hate
> this.
>
> Within the thing I'm talking about, whitespace and comments are handled
> as they are the rest of the time (thankfully). Alphanumeric tokens are
> all one type, and commas are allowed, and '[' and '{' (and closing
> versions of such) have special meaning. Then there's things that are
> &whatever ('&' followed by alphabetic followed by any number of
> alphanumeric). Those are already distinct types. However, once into this
>
> weird 'zone', most keywords aren't keywords anymore and must be treated
> as alphanumeric tokens.
>
> Now, this state is entered by 'WITH SYNTAX {' (and exited with '}')
>
> The problem is the specification considers the starter to be three
> tokens, and any amount of whitespace and comments is allowed between
> each. I can easily see that I could use gated predicates to switch
> between two lexer "modes". That's one solution. I can see two broad
> solutions:
>
> 1) Use member variables to track if the most recent non-WS, non-comment
> token was WITH, SYNTAX, and { (a sort of look-behind implemented
> kludgily by putting an action in *every* rule, or by overruling the emit
>
> stuff to keep track of the last 2 things on the DEFAULT channel), use
> these to switch into crazy-mode where much is different.
>
> 2) Make the parser just accept *everything* within the definition of
> syntax, and deal with in some other way (????) later. It has to be that
> bad, as the "normal" lexer sees '[[' as a token, and the "weird" version
>
> has to see it as two '[' tokens.
>
> Anyone got any thoughts? Any ideas which would be less pain? Is there
> already some way of tracking recently-emitted token on a specific
> channel?
>
> Thanks,
>
> Sam
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
More information about the antlr-interest
mailing list