[antlr-interest] Stupid languages, and parsing them

Sun Apr 12 06:29:16 PDT 2009

Thomas Brandon wrote:
> On Sun, Apr 12, 2009 at 6:01 AM, Sam Barnett-Cormack
> <s.barnett-cormack at lancaster.ac.uk> wrote:
>> I'm not sure an island grammar would work, as I need the eventual AST of the
>> "WITH SYNTAX" block to be included in the final AST of the master grammar.
>>
>> Unless, that is, I can invoke a full lexer/parser combination, get the tree
>> out of it, and somehow have the lexer pass that tree into the token stream
>> (which sounds wacky) and have the parser pull in the whole tree. That would
>> be, perhaps, painful. Or, I suppose, with a custom token type it might be
>> possible to wrap up the whole token stream from the inner lexer in a single
>> token, and use a parse-only island grammar from the parser to handle that
>> and accept the resulting AST and integrate it. I've just no idea how to
>> start doing either of those things. I'll do some reading and prodding, but
>> if anyone can give pointers I'd be greatful - being able to do at least the
>> lexing separately (parsing isn't a bother to do in the main parser) would be
>> good, and the code to emit multiple tokens looks scary. That said, I guess I
>> could use an island lexer, and use multiple token emit to emit all of the
>> tokens from the island in order. I just have to make sure that the two share
>> token definitions, so I'd probably have to do something odd... and I have no
>> idea how to make two lexers share a portion of token vocabulary without
>> sharing the rules for those tokens.
>>
>> Wow, that was rambling... if anyone manages to fight through that and then
>> come up with some useful advice (kudos to you if you can), it'd be
>> appreciated.
>>
> Ah yes, the island-grammar example doesn't show integration of the
> ASTs. Options would be:
> 1) Use two lexers and a single parser. If you have your main lexer
> import the tokenVocab of your sub-lexer and the parser import the
> tokenVocab of the main lexer then you should be able to have a
> combined token vocabulary. You could have the WITH_SYNTAX block emit
> all the tokens of the sub-block using multiple emits, but it's
> probably better to override next token and have it handle the
> sub-lexer.
> So you would have something like (untested):
> @members {
>   WithSyntaxLexer withSyntaxLexer = null;
>   boolean inWithSyntaxBlock = false;
>   void enterWithSyntaxBlock() {
>       if ( withSyntaxLexer == null )
>         withSyntaxLexer = new WithSyntaxLexer(input, state);
>       inWithSyntaxBlock = true;
>   }
> 
>   public Token nextToken() {
>     while (true) {
>       state.token = null;
>       state.channel = Token.DEFAULT_CHANNEL;
>       state.tokenStartCharIndex = input.index();
>       state.tokenStartCharPositionInLine = input.getCharPositionInLine();
>       state.tokenStartLine = input.getLine();
>       state.text = null;
>       if ( input.LA(1)==CharStream.EOF ) {
>         return Token.EOF_TOKEN;
>       }
>       try {
>         // CHANGES HERE
>         if ( inWithSyntaxBlock ) {
>           // Lex from withSyntaxLexer instead
>           withSyntaxLexer.mTokens();
>           if ( state.type == WITH_SYNTAX_END )
>             inWithSyntaxBlock = false; // Switch back to main lexer
> for next token
>         } else
>           mTokens();
>         // END CHANGES
>         if ( state.token==null ) {
>           emit();
>         }
>         else if ( state.token==Token.SKIP_TOKEN ) {
>           continue;
>         }
>         return state.token;
>       }
>       catch (NoViableAltException nva) {
>         reportError(nva);
>         recover(nva); // throw out current char and try again
>       }
>       catch (RecognitionException re) {
>         reportError(re);
>         // match() routine has already called recover()
>       }
>     }
>   }
> }
> Where WITH_SYNTAX_END is the token for the final '}' in the with
> syntax block. By sharing the input stream and state object the two
> lexers should keep in step and it's just a matter of calling mTokens
> from whichever lexer should be in control.

That's pretty interesting. I think I'll have to go along with that sort 
of thing, but testing is going to be a pain - my test files don't use 
this language feature, but do use an obsolete language feature that I'll 
be mapping to the same outcome as this later in the AST. I guess I'll 
just concoct a suitable test file.

> 2) Have two lexers and two parsers. Here the main lexer would returns
> a single token for the whole "WITH SYNTAX { ... }" block and then have
> the parser invoke a new lexer\parser on the text of that token and
> insert the resulting tree into the main tree. Assuming the end marker
> of the block is easy enough to detect this shouldn't get too messy.
> You will probably need to handle nested "{...}" blocks (if these are
> allowed in the block) and the presence of '}' in strings and comments.
> See the handling of action code blocks in the ANTLR grammar for an
> example of this. The main rules are:

Sounds more complicated, really.

Back to the first one - my current lexer/parser is a combined grammar. 
How do I make it import a tokenvocab but still add things to the final 
tokenvocab itself?

Sam (BC)