[antlr-interest] Stupid languages, and parsing them

Sat Apr 11 12:08:32 PDT 2009

On Sun, Apr 12, 2009 at 4:45 AM, Sam Barnett-Cormack
<s.barnett-cormack at lancaster.ac.uk> wrote:
> Hi all,
>
> In my ongoing project, I need to parse a really crazy structure that
> wants to change the lexing rules dependent on syntactic factors. I hate
> this.
>
> Within the thing I'm talking about, whitespace and comments are handled
> as they are the rest of the time (thankfully). Alphanumeric tokens are
> all one type, and commas are allowed, and '[' and '{' (and closing
> versions of such) have special meaning. Then there's things that are
> &whatever ('&' followed by alphabetic followed by any number of
> alphanumeric). Those are already distinct types. However, once into this
> weird 'zone', most keywords aren't keywords anymore and must be treated
> as alphanumeric tokens.
>
> Now, this state is entered by 'WITH SYNTAX {' (and exited with '}')
>
> The problem is the specification considers the starter to be three
> tokens, and any amount of whitespace and comments is allowed between
> each. I can easily see that I could use gated predicates to switch
> between two lexer "modes". That's one solution. I can see two broad
> solutions:
>
> 1) Use member variables to track if the most recent non-WS, non-comment
> token was WITH, SYNTAX, and { (a sort of look-behind implemented
> kludgily by putting an action in *every* rule, or by overruling the emit
> stuff to keep track of the last 2 things on the DEFAULT channel), use
> these to switch into crazy-mode where much is different.
>
> 2) Make the parser just accept *everything* within the definition of
> syntax, and deal with in some other way (????) later. It has to be that
> bad, as the "normal" lexer sees '[[' as a token, and the "weird" version
> has to see it as two '[' tokens.
>
> Anyone got any thoughts? Any ideas which would be less pain? Is there
> already some way of tracking recently-emitted token on a specific channel?
You probably want to look at the island grammar example in the
examples pack. Here you switch to an alternate lexer to parse the
block. This is likely easier and more efficient than using predicates.
That has the lexer switching under lexer control so you will have to
deal with the whitespace\comments in your start sequence. You can have
it under parser control
(http://www.antlr.org/wiki/display/ANTLR3/Island+Grammars+Under+Parser+Control)
though I think the start sequence is simple enough that you are better
to have it under lexer control. I would think something like:
WITH_SYNTAX: 'WITH' (WS|COMMENT)+ 'SYNTAX' (WS|COMMENT)+ '{' {
enterWithSyntax(); };
would be easier than your lookback idea. If you really want three
seperate tokens then you could override emit to allow multiple tokens.
This is still likely simpler than the alternate.

Tom.

>
> Thanks,
>
> Sam
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>