[antlr-interest] Stupid languages, and parsing them
Sam Barnett-Cormack
s.barnett-cormack at lancaster.ac.uk
Sun Apr 12 06:29:16 PDT 2009
Thomas Brandon wrote:
> On Sun, Apr 12, 2009 at 6:01 AM, Sam Barnett-Cormack
> <s.barnett-cormack at lancaster.ac.uk> wrote:
>> I'm not sure an island grammar would work, as I need the eventual AST of the
>> "WITH SYNTAX" block to be included in the final AST of the master grammar.
>>
>> Unless, that is, I can invoke a full lexer/parser combination, get the tree
>> out of it, and somehow have the lexer pass that tree into the token stream
>> (which sounds wacky) and have the parser pull in the whole tree. That would
>> be, perhaps, painful. Or, I suppose, with a custom token type it might be
>> possible to wrap up the whole token stream from the inner lexer in a single
>> token, and use a parse-only island grammar from the parser to handle that
>> and accept the resulting AST and integrate it. I've just no idea how to
>> start doing either of those things. I'll do some reading and prodding, but
>> if anyone can give pointers I'd be greatful - being able to do at least the
>> lexing separately (parsing isn't a bother to do in the main parser) would be
>> good, and the code to emit multiple tokens looks scary. That said, I guess I
>> could use an island lexer, and use multiple token emit to emit all of the
>> tokens from the island in order. I just have to make sure that the two share
>> token definitions, so I'd probably have to do something odd... and I have no
>> idea how to make two lexers share a portion of token vocabulary without
>> sharing the rules for those tokens.
>>
>> Wow, that was rambling... if anyone manages to fight through that and then
>> come up with some useful advice (kudos to you if you can), it'd be
>> appreciated.
>>
> Ah yes, the island-grammar example doesn't show integration of the
> ASTs. Options would be:
> 1) Use two lexers and a single parser. If you have your main lexer
> import the tokenVocab of your sub-lexer and the parser import the
> tokenVocab of the main lexer then you should be able to have a
> combined token vocabulary. You could have the WITH_SYNTAX block emit
> all the tokens of the sub-block using multiple emits, but it's
> probably better to override next token and have it handle the
> sub-lexer.
> So you would have something like (untested):
> @members {
> WithSyntaxLexer withSyntaxLexer = null;
> boolean inWithSyntaxBlock = false;
> void enterWithSyntaxBlock() {
> if ( withSyntaxLexer == null )
> withSyntaxLexer = new WithSyntaxLexer(input, state);
> inWithSyntaxBlock = true;
> }
>
> public Token nextToken() {
> while (true) {
> state.token = null;
> state.channel = Token.DEFAULT_CHANNEL;
> state.tokenStartCharIndex = input.index();
> state.tokenStartCharPositionInLine = input.getCharPositionInLine();
> state.tokenStartLine = input.getLine();
> state.text = null;
> if ( input.LA(1)==CharStream.EOF ) {
> return Token.EOF_TOKEN;
> }
> try {
> // CHANGES HERE
> if ( inWithSyntaxBlock ) {
> // Lex from withSyntaxLexer instead
> withSyntaxLexer.mTokens();
> if ( state.type == WITH_SYNTAX_END )
> inWithSyntaxBlock = false; // Switch back to main lexer
> for next token
> } else
> mTokens();
> // END CHANGES
> if ( state.token==null ) {
> emit();
> }
> else if ( state.token==Token.SKIP_TOKEN ) {
> continue;
> }
> return state.token;
> }
> catch (NoViableAltException nva) {
> reportError(nva);
> recover(nva); // throw out current char and try again
> }
> catch (RecognitionException re) {
> reportError(re);
> // match() routine has already called recover()
> }
> }
> }
> }
> Where WITH_SYNTAX_END is the token for the final '}' in the with
> syntax block. By sharing the input stream and state object the two
> lexers should keep in step and it's just a matter of calling mTokens
> from whichever lexer should be in control.
That's pretty interesting. I think I'll have to go along with that sort
of thing, but testing is going to be a pain - my test files don't use
this language feature, but do use an obsolete language feature that I'll
be mapping to the same outcome as this later in the AST. I guess I'll
just concoct a suitable test file.
> 2) Have two lexers and two parsers. Here the main lexer would returns
> a single token for the whole "WITH SYNTAX { ... }" block and then have
> the parser invoke a new lexer\parser on the text of that token and
> insert the resulting tree into the main tree. Assuming the end marker
> of the block is easy enough to detect this shouldn't get too messy.
> You will probably need to handle nested "{...}" blocks (if these are
> allowed in the block) and the presence of '}' in strings and comments.
> See the handling of action code blocks in the ANTLR grammar for an
> example of this. The main rules are:
Sounds more complicated, really.
Back to the first one - my current lexer/parser is a combined grammar.
How do I make it import a tokenvocab but still add things to the final
tokenvocab itself?
Sam (BC)
More information about the antlr-interest
mailing list