[antlr-interest] ENHANCEMENT - Have "lexer grammar" generate recognition for string literals in tokenVocab

Wed Oct 10 04:34:05 PDT 2007

Gavin,

Let's try this again. I'm pretty sure there's no cycle.

Let's say my DESIRED lexer looks like this:

lexer grammar Lex;
options {tokenVocab = Parser;}
ID : ('a'..'z')+ ;
WS: (' '|'\t'|'\n')* {skip();}

Further, my DESIRED parser looks like this:

parser grammar Parser;
options {output=AST;}    // Note: no tokenVocab here - this is the 
source of tokens

file: (decl | call )*;

decl: 'int' ID ';' ;

call : ID '(' args? ')' ';'
     ;

args: ID ( ',' ID )* ;

My DESIRED AST parser looks like this:

tree grammar AST;
options {tokenVocab = Parser;}
tokens { CALL; DECLARE; ARGS; }

file: (decl|call)* ;
decl: 'int' ID ';'
    -> ^(DECLARE ID)
    ;
call: ID '(' args? ')' ';'
    -> ^(CALL ID args)
    ;
args: a+=ID (',' a+=ID)*
    -> ^(ARGS $a+)
    ;

Notice that the lexer takes tokens from the parser. The AST parser takes 
tokens from the parser, then adds a few of its own. The parser takes 
tokens from nobody. So the tsort is going to be Parser first, then 
lexer/ast in any order. What I'm proposing is that the parser would emit 
lines like

ID = 7;
'int' = 3;

in its tokens file (it already does). Currently, the lexer can read this 
with no problems, but then it doesn't do anything with the 'int'=3 
tokens - imaginary token ids auto-generated in the parser.

What I want is for the lexer to automatically do what it ALREADY does in 
combined mode - generate rules for recognizing the 'int'=3 tokens, and 
return them to the parser.

What is interesting is that this all happens smoothly in the combined 
mode. But if you hack at the separated parser/lexer, you can eventually 
get around the error messages and get both of them on the same page, 
token-list-wise, but the lexer mode stubbornly refuses to generate magic 
tokens if it isn't a combined grammar.

I *can* get it to generate with a sed script, by taking the parser 
tokens file and generating a bogus rule inside the lexer grammar 
[[not_used: 'int' '(' ';' ')' ',' ID ; ]] and then running the lexer 
grammar as a combined grammar. But I'm not confident that the generated 
output is going to be valid, and I don't know where to begin testing it. :(

=Austin

Gavin Lambert wrote:
> Mere moments ago, I wrote:
> >What you'd need to be able to do to resolve this is to build an
> >initial lexer ignoring the vocab, then build the parser and figure
> >out its tokens, then go back and build the lexer again, inserting
> >the new tokens, and finally build the parser yet again since the
>
> Gah, sorry, accidentally pressed the Send key sequence.  Anyway:
>
> ... finally build the parser yet again since the lexer may have 
> changed its token ids around.
>
> All of which is quite a mess, and is what combined grammars do for you 
> anyway.
>
> >The multiple parsers problem just doesn't happen - at least for me
> >- because there's one syntax parser and the rest are tree parsers.
> >The tree parsers depend on the output (AST) of the syntax parser,
> >so basically everybody want's to use the tokenVocab from the
> >syntax parser, instead of the lexer.
>
> But you can already do that.  And if you don't have multiple syntax 
> parsers then I don't see why you're not using a combined grammar 
> anyway, since that would do everything you seem to want to do.
>
>
>