[antlr-interest] Stupid languages, and parsing them

Sat Apr 11 20:39:33 PDT 2009

On Sun, Apr 12, 2009 at 6:01 AM, Sam Barnett-Cormack
<s.barnett-cormack at lancaster.ac.uk> wrote:
> I'm not sure an island grammar would work, as I need the eventual AST of the
> "WITH SYNTAX" block to be included in the final AST of the master grammar.
>
> Unless, that is, I can invoke a full lexer/parser combination, get the tree
> out of it, and somehow have the lexer pass that tree into the token stream
> (which sounds wacky) and have the parser pull in the whole tree. That would
> be, perhaps, painful. Or, I suppose, with a custom token type it might be
> possible to wrap up the whole token stream from the inner lexer in a single
> token, and use a parse-only island grammar from the parser to handle that
> and accept the resulting AST and integrate it. I've just no idea how to
> start doing either of those things. I'll do some reading and prodding, but
> if anyone can give pointers I'd be greatful - being able to do at least the
> lexing separately (parsing isn't a bother to do in the main parser) would be
> good, and the code to emit multiple tokens looks scary. That said, I guess I
> could use an island lexer, and use multiple token emit to emit all of the
> tokens from the island in order. I just have to make sure that the two share
> token definitions, so I'd probably have to do something odd... and I have no
> idea how to make two lexers share a portion of token vocabulary without
> sharing the rules for those tokens.
>
> Wow, that was rambling... if anyone manages to fight through that and then
> come up with some useful advice (kudos to you if you can), it'd be
> appreciated.
>
Ah yes, the island-grammar example doesn't show integration of the
ASTs. Options would be:
1) Use two lexers and a single parser. If you have your main lexer
import the tokenVocab of your sub-lexer and the parser import the
tokenVocab of the main lexer then you should be able to have a
combined token vocabulary. You could have the WITH_SYNTAX block emit
all the tokens of the sub-block using multiple emits, but it's
probably better to override next token and have it handle the
sub-lexer.
So you would have something like (untested):
@members {
  WithSyntaxLexer withSyntaxLexer = null;
  boolean inWithSyntaxBlock = false;
  void enterWithSyntaxBlock() {
      if ( withSyntaxLexer == null )
        withSyntaxLexer = new WithSyntaxLexer(input, state);
      inWithSyntaxBlock = true;
  }

  public Token nextToken() {
    while (true) {
      state.token = null;
      state.channel = Token.DEFAULT_CHANNEL;
      state.tokenStartCharIndex = input.index();
      state.tokenStartCharPositionInLine = input.getCharPositionInLine();
      state.tokenStartLine = input.getLine();
      state.text = null;
      if ( input.LA(1)==CharStream.EOF ) {
        return Token.EOF_TOKEN;
      }
      try {
        // CHANGES HERE
        if ( inWithSyntaxBlock ) {
          // Lex from withSyntaxLexer instead
          withSyntaxLexer.mTokens();
          if ( state.type == WITH_SYNTAX_END )
            inWithSyntaxBlock = false; // Switch back to main lexer
for next token
        } else
          mTokens();
        // END CHANGES
        if ( state.token==null ) {
          emit();
        }
        else if ( state.token==Token.SKIP_TOKEN ) {
          continue;
        }
        return state.token;
      }
      catch (NoViableAltException nva) {
        reportError(nva);
        recover(nva); // throw out current char and try again
      }
      catch (RecognitionException re) {
        reportError(re);
        // match() routine has already called recover()
      }
    }
  }
}
Where WITH_SYNTAX_END is the token for the final '}' in the with
syntax block. By sharing the input stream and state object the two
lexers should keep in step and it's just a matter of calling mTokens
from whichever lexer should be in control.

2) Have two lexers and two parsers. Here the main lexer would returns
a single token for the whole "WITH SYNTAX { ... }" block and then have
the parser invoke a new lexer\parser on the text of that token and
insert the resulting tree into the main tree. Assuming the end marker
of the block is easy enough to detect this shouldn't get too messy.
You will probably need to handle nested "{...}" blocks (if these are
allowed in the block) and the presence of '}' in strings and comments.
See the handling of action code blocks in the ANTLR grammar for an
example of this. The main rules are:
ACTION
	:	NESTED_ACTION ( '?' {$type = SEMPRED;} )?
	;

fragment
NESTED_ACTION :
	'{'
	(	options {greedy=false; k=3;}
	:	NESTED_ACTION
	|	SL_COMMENT
	|	ML_COMMENT
	|	ACTION_STRING_LITERAL
	|	ACTION_CHAR_LITERAL
	|	.
	)*
	'}'
   ;

Tom.

> Sam (BC)
>
>