[antlr-interest] Stupid languages, and parsing them

Sat Apr 11 13:01:22 PDT 2009

Thomas Brandon wrote:
> On Sun, Apr 12, 2009 at 4:45 AM, Sam Barnett-Cormack
> <s.barnett-cormack at lancaster.ac.uk> wrote:
>> Hi all,
>>
>> In my ongoing project, I need to parse a really crazy structure that
>> wants to change the lexing rules dependent on syntactic factors. I hate
>> this.
>>
>> Within the thing I'm talking about, whitespace and comments are handled
>> as they are the rest of the time (thankfully). Alphanumeric tokens are
>> all one type, and commas are allowed, and '[' and '{' (and closing
>> versions of such) have special meaning. Then there's things that are
>> &whatever ('&' followed by alphabetic followed by any number of
>> alphanumeric). Those are already distinct types. However, once into this
>> weird 'zone', most keywords aren't keywords anymore and must be treated
>> as alphanumeric tokens.
>>
>> Now, this state is entered by 'WITH SYNTAX {' (and exited with '}')
>>
>> The problem is the specification considers the starter to be three
>> tokens, and any amount of whitespace and comments is allowed between
>> each. I can easily see that I could use gated predicates to switch
>> between two lexer "modes". That's one solution. I can see two broad
>> solutions:
>>
>> 1) Use member variables to track if the most recent non-WS, non-comment
>> token was WITH, SYNTAX, and { (a sort of look-behind implemented
>> kludgily by putting an action in *every* rule, or by overruling the emit
>> stuff to keep track of the last 2 things on the DEFAULT channel), use
>> these to switch into crazy-mode where much is different.
>>
>> 2) Make the parser just accept *everything* within the definition of
>> syntax, and deal with in some other way (????) later. It has to be that
>> bad, as the "normal" lexer sees '[[' as a token, and the "weird" version
>> has to see it as two '[' tokens.
>>
>> Anyone got any thoughts? Any ideas which would be less pain? Is there
>> already some way of tracking recently-emitted token on a specific channel?
> You probably want to look at the island grammar example in the
> examples pack. Here you switch to an alternate lexer to parse the
> block. This is likely easier and more efficient than using predicates.
> That has the lexer switching under lexer control so you will have to
> deal with the whitespace\comments in your start sequence. You can have
> it under parser control
> (http://www.antlr.org/wiki/display/ANTLR3/Island+Grammars+Under+Parser+Control)
> though I think the start sequence is simple enough that you are better
> to have it under lexer control. I would think something like:
> WITH_SYNTAX: 'WITH' (WS|COMMENT)+ 'SYNTAX' (WS|COMMENT)+ '{' {
> enterWithSyntax(); };
> would be easier than your lookback idea. If you really want three
> seperate tokens then you could override emit to allow multiple tokens.
> This is still likely simpler than the alternate.

I'm not sure an island grammar would work, as I need the eventual AST of 
the "WITH SYNTAX" block to be included in the final AST of the master 
grammar.

Unless, that is, I can invoke a full lexer/parser combination, get the 
tree out of it, and somehow have the lexer pass that tree into the token 
stream (which sounds wacky) and have the parser pull in the whole tree. 
That would be, perhaps, painful. Or, I suppose, with a custom token type 
it might be possible to wrap up the whole token stream from the inner 
lexer in a single token, and use a parse-only island grammar from the 
parser to handle that and accept the resulting AST and integrate it. 
I've just no idea how to start doing either of those things. I'll do 
some reading and prodding, but if anyone can give pointers I'd be 
greatful - being able to do at least the lexing separately (parsing 
isn't a bother to do in the main parser) would be good, and the code to 
emit multiple tokens looks scary. That said, I guess I could use an 
island lexer, and use multiple token emit to emit all of the tokens from 
the island in order. I just have to make sure that the two share token 
definitions, so I'd probably have to do something odd... and I have no 
idea how to make two lexers share a portion of token vocabulary without 
sharing the rules for those tokens.

Wow, that was rambling... if anyone manages to fight through that and 
then come up with some useful advice (kudos to you if you can), it'd be 
appreciated.

Sam (BC)