[antlr-interest] ENHANCEMENT - Have "lexer grammar" generate recognition for string literals in tokenVocab
Austin Hastings
Austin_Hastings at Yahoo.com
Wed Oct 10 04:34:05 PDT 2007
Gavin,
Let's try this again. I'm pretty sure there's no cycle.
Let's say my DESIRED lexer looks like this:
lexer grammar Lex;
options {tokenVocab = Parser;}
ID : ('a'..'z')+ ;
WS: (' '|'\t'|'\n')* {skip();}
Further, my DESIRED parser looks like this:
parser grammar Parser;
options {output=AST;} // Note: no tokenVocab here - this is the
source of tokens
file: (decl | call )*;
decl: 'int' ID ';' ;
call : ID '(' args? ')' ';'
;
args: ID ( ',' ID )* ;
My DESIRED AST parser looks like this:
tree grammar AST;
options {tokenVocab = Parser;}
tokens { CALL; DECLARE; ARGS; }
file: (decl|call)* ;
decl: 'int' ID ';'
-> ^(DECLARE ID)
;
call: ID '(' args? ')' ';'
-> ^(CALL ID args)
;
args: a+=ID (',' a+=ID)*
-> ^(ARGS $a+)
;
Notice that the lexer takes tokens from the parser. The AST parser takes
tokens from the parser, then adds a few of its own. The parser takes
tokens from nobody. So the tsort is going to be Parser first, then
lexer/ast in any order. What I'm proposing is that the parser would emit
lines like
ID = 7;
'int' = 3;
in its tokens file (it already does). Currently, the lexer can read this
with no problems, but then it doesn't do anything with the 'int'=3
tokens - imaginary token ids auto-generated in the parser.
What I want is for the lexer to automatically do what it ALREADY does in
combined mode - generate rules for recognizing the 'int'=3 tokens, and
return them to the parser.
What is interesting is that this all happens smoothly in the combined
mode. But if you hack at the separated parser/lexer, you can eventually
get around the error messages and get both of them on the same page,
token-list-wise, but the lexer mode stubbornly refuses to generate magic
tokens if it isn't a combined grammar.
I *can* get it to generate with a sed script, by taking the parser
tokens file and generating a bogus rule inside the lexer grammar
[[not_used: 'int' '(' ';' ')' ',' ID ; ]] and then running the lexer
grammar as a combined grammar. But I'm not confident that the generated
output is going to be valid, and I don't know where to begin testing it. :(
=Austin
Gavin Lambert wrote:
> Mere moments ago, I wrote:
> >What you'd need to be able to do to resolve this is to build an
> >initial lexer ignoring the vocab, then build the parser and figure
> >out its tokens, then go back and build the lexer again, inserting
> >the new tokens, and finally build the parser yet again since the
>
> Gah, sorry, accidentally pressed the Send key sequence. Anyway:
>
> ... finally build the parser yet again since the lexer may have
> changed its token ids around.
>
> All of which is quite a mess, and is what combined grammars do for you
> anyway.
>
> >The multiple parsers problem just doesn't happen - at least for me
> >- because there's one syntax parser and the rest are tree parsers.
> >The tree parsers depend on the output (AST) of the syntax parser,
> >so basically everybody want's to use the tokenVocab from the
> >syntax parser, instead of the lexer.
>
> But you can already do that. And if you don't have multiple syntax
> parsers then I don't see why you're not using a combined grammar
> anyway, since that would do everything you seem to want to do.
>
>
>
More information about the antlr-interest
mailing list