[antlr-interest] Help needed upgrading java.g to support Gene rics

Thu Mar 13 08:49:18 PST 2003

I'm not sure that's the best approach.  I haven't thought it through but it
seems like it would work in the LR world but not in the LL world.  I would
suggest trying this instead:

1. Eliminate ">>", ">>=", ">>>", and ">>>=" as tokens, make them all ">".
Then make parser rules sr: ">" ">" and zr:">" ">" ">".  Modify grammar to
use grammar rules instead of the tokens for those operators.

2. Compile, inspect and test.  Syntactic predicates may be necessary and may
need to be manually hoisted.

3. If that works then add in your generic stuff and test it out.  Only use
">" for your generics, don't use sr or zr.

4. There might be a better approach than this.  Can generics be initialized?
Then you have to worry about ">>=" as well.

Email me privately if you would like to discuss this over the phone.

Monty

-----Original Message-----
From: Matt Quail [mailto:matt at cortexebusiness.com.au]
Sent: Wednesday, March 12, 2003 7:20 PM
To: antlr-interest at yahoogroups.com
Subject: [antlr-interest] Help needed upgrading java.g to support
Generics

Hi all,

I'm trying to update the java.g grammar with support for Generics (as
defined 
by JSR14, grab the pdf spec at 
http://www.jcp.org/aboutJava/communityprocess/review/jsr014/index.html ). My

intent is to upgrade the grammar and submit a patch back to the "offical" 
java.g; so any help will hopefully help us all.

The MAJOR problem is that JDK1.5 will allow this:

List<List<String>> x = ...;
                 ^^
The problem is that the lexer will match ">>" as a shift-right token, but we

really want to parse it as two GT tokens in this context. The JSR pdf has a
BNF 
grammar that solves this problem, at it is that pattern that I am trying to 
implement in ANTLR. (A re-cap of this trick is given at the end of the
email.)

(Note that there is also a problem lexing ">>>", but lets just confine 
ourselves to ">>" for the moment.)

Okay, after a few false starts, I've come up with the following grammar
(note 
that it is not the full JavaRecogniser parser, just enough to parse a
SEMICOLON 
seperated list of types) (it uses the standard JavaLexer):

--------
compilationUnit
	:
         ( type SEMI ) *
		EOF!
	;

type
	:	referenceType
	|	builtInType (arrayDecl)?
	;

referenceType:
         identifier
         (  arrayDecl
         |  LT referenceTypeList1
         )?
     ;

referenceTypeList1:
         (referenceType1)=> referenceType1
     |
         (options{greedy=false;}: referenceType COMMA)+
         referenceType1
     ;

referenceType1:
         (referenceType GT)=> referenceType GT
     |
         identifier LT referenceTypeList2
     ;

referenceTypeList2 :
         (referenceType2)=> referenceType2
     |
         (options{greedy=false;}: referenceType COMMA)+
         referenceType2
     ;

referenceType2:
         referenceType SR
     ;

arrayDecl:
         (LBRACK RBRACK)+
     ;
// The primitive types.
builtInType
	:	"void"
	|	"boolean"
	|	"byte"
	|	"char"
	|	"short"
	|	"int"
	|	"float"
	|	"long"
	|	"double"
	;

identifier
	:	IDENT ( DOT^ IDENT)*
	;
--------

This grammar will sucessfully parse these constructs:
--------
String;
java.lang.String;
int;
float;
int[];
String[];
float[][][];
List<String>;
List<String[]>;
List<List<String[]> >;
List<List<String[]>>;

Map<String,Integer>;
Map<String,List<Integer> >;
Map<String,List<Integer>>;
Map<List<Integer>,String>;
Map<List<Integer>,List<String>>;

Map3<String,Integer,Float>;

Map<Map<String,String>,Map3<String,Integer,Float>>;
Map<List<String>,List<Integer>>;
--------

But it will not parse these:
Map3<List<String>,List<Integer>,List<Float>>;
Map3<String,List<Integer>,Float>;

The errors are:
G1.java:20:18: unexpected token: Integer
and
G1.java:24:24: unexpected token: Integer

Now, I can see why this is happening, it is caused by my non-greedy rules in

referenceTypeList1 and referenceTypeList2. But I need them to be non-greedy
(in 
some fashion), because I don't want them to match the last "referenceType"
that 
  preceeds the next GT or SR token.

(Making them both greedy means that it matches too many times...)

I'm starting to get to the limits of my understanding of ANTLR... I started 
thinking it was a look-ahead problem... but it really requires "lots" of 
lookahead, that's why I have those syntactic predicates everywhere).

Any help will be greatly appreciated! Have I gone down the wrong track?

=Matt

PS: The 'trick' JSR14 uses to parse ">>" and ">>>":
The 'naive' grammar for parameterized type declarations (using the notation 
used in the JLS) is:

ReferenceType ::= ClassOrInterfaceType
                 | ArrayType
                 | TypeVariable

TypeVariable ::= Identifier

ClassOrInterfaceType ::= ClassOrInterface TypeArgumentsOpt

ClassOrInterface ::= Identifier
                    | ClassOrInterfaceType . Identifier

TypeArguments ::= < ReferenceTypeList >

ReferenceTypeList ::= ReferenceType
                     | ReferenceTypeList , ReferenceType

The "trick" is as folows (copied verbatim from the JSR14 spec)

ReferenceType ::= ClassOrInterfaceType
                 | ArrayType
                 | TypeVariable

ClassOrInterfaceType ::= Name
                        | Name < ReferenceTypeList1

ReferenceTypeList1 ::= ReferenceType1
                      | ReferenceTypeList , ReferenceType1

ReferenceType1 ::= ReferenceType >
                  | Name < ReferenceTypeList2

ReferenceTypeList2 ::= ReferenceType2
                      | ReferenceTypeList , ReferenceType2

ReferenceType2 ::= ReferenceType >>
                  | Name < ReferenceTypeList3

ReferenceTypeList3 ::= ReferenceType3
                      | ReferenceTypeList , ReferenceType3

ReferenceType3 ::= ReferenceType >>>

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/