[antlr-interest] Help needed upgrading java.g to support Gene rics

mzukowski at yci.com mzukowski at yci.com
Thu Mar 13 13:58:56 PST 2003


If you are counting columns then you can enforce no spaces between >> for
the operator with a semantic predicate.  Otherwise you could have a
different token based on what preceded it by maintaining some state in your
lexer.  CGT if the > was preceded immediately by >.  Just GT otherwise.  I
haven't thought that one through.  The operator would be GT CGT or GT CGT
CGT.  A generic end token could be GT or CGT.

The semantic predicate is a good possible approach.  You might need a way to
propagate the end matches up the parse stack.  It depends on how nested all
the rules for declarations are, I haven't inspected it so I'm not sure.  I'm
just thinking aloud here.

Try this for typeArgsEnd

typeArgsEnd:
(  //matching zero doesn't make sense
GT {ltCount-=1;}
| SR {ltCount-=2;}
| BSR {ltCount-=3;}
)
// if there are more, match some more
{ltCount > 0}=> typeArgsEnd
;

You know, I don't think that will work as sketched below.  It'll choke on
Map<List<Integer>,String>; because you aren't nesting your calls.

Play with it and report back.

Monty

-----Original Message-----
From: Matt Quail [mailto:matt at cortexebusiness.com.au]
Sent: Thursday, March 13, 2003 1:45 PM
To: antlr-interest at yahoogroups.com
Subject: Re: [antlr-interest] Help needed upgrading java.g to support
Generics


Monty,

Thanks Monty! That has definitely given me something to think about. I will
try 
what you suggest, and remove the ">>", etc. tokens and parser them as GT GT 
instead.

So we may have a parser rule:

sr: GT GT;

The one issue with this is that it will allow WS between the two ">"
characters 
in the ">>" operator (which Java does not allow). I might have a play with
this 
approach, in any case. I may be able to solve this problem by changing WS
from 
"skip" tokens to a {option ignore=WS;}. Will need to think some more on that

one; any ideas?

The other idea I was tinkering with last night was to leave SR as is, and
have 
some rule like this for matching the end of a "double-nested" template:

.... (GT GT | SR)

Then for "triple-nested" we might have something like

.... (GT GT GT | SR GT | GT SR | BSR)

But I'm not sure what the "...." would be :) Maybe I need to use some
semantic 
predicates and actually count the number of ">" I need to match. Something
like 
this:

typeArgs: typeArgsBody typeArgsEnd;

typeArgsBody:
   LT {ltCount++;}
   ReferenceType
   (typeArsgBody)?
   ;

typeArgsEnd:
( // match 0,1,2 or 3 '>'
    {ltCount == 0}=>
|  {ltCount == 1}=> GT {ltCount-=1;}
|  {ltCount == 2}=> (GT GT | SR) {ltCount-=2;}
|  {ltCount == 3}=>
       (GT GT GT | SR GT | GT SR | BSR) {ltCount-=3;}
)
// if there are more, match some more
{ltCount > 0}=> typeArgsEnd
;

(Hmmm... it is ugly to have to use a semantic predicate... but this may be a

"quick win".)

I will try your suggestion and my idea above and report back to this list.

=Matt

mzukowski at yci.com wrote:
> I'm not sure that's the best approach.  I haven't thought it through but
it
> seems like it would work in the LR world but not in the LL world.  I would
> suggest trying this instead:
> 
> 1. Eliminate ">>", ">>=", ">>>", and ">>>=" as tokens, make them all ">".
> Then make parser rules sr: ">" ">" and zr:">" ">" ">".  Modify grammar to
> use grammar rules instead of the tokens for those operators.
> 
> 2. Compile, inspect and test.  Syntactic predicates may be necessary and
may
> need to be manually hoisted.
> 
> 3. If that works then add in your generic stuff and test it out.  Only use
> ">" for your generics, don't use sr or zr.
> 
> 4. There might be a better approach than this.  Can generics be
initialized?
> Then you have to worry about ">>=" as well.
> 
> Email me privately if you would like to discuss this over the phone.
> 
> Monty
> 
> -----Original Message-----
> From: Matt Quail [mailto:matt at cortexebusiness.com.au]
> Sent: Wednesday, March 12, 2003 7:20 PM
> To: antlr-interest at yahoogroups.com
> Subject: [antlr-interest] Help needed upgrading java.g to support
> Generics
> 
> 
> Hi all,
> 
> I'm trying to update the java.g grammar with support for Generics (as
> defined 
> by JSR14, grab the pdf spec at 
> http://www.jcp.org/aboutJava/communityprocess/review/jsr014/index.html ).
My
> 
> intent is to upgrade the grammar and submit a patch back to the "offical" 
> java.g; so any help will hopefully help us all.
> 
> The MAJOR problem is that JDK1.5 will allow this:
> 
> List<List<String>> x = ...;
>                  ^^
> The problem is that the lexer will match ">>" as a shift-right token, but
we
> 
> really want to parse it as two GT tokens in this context. The JSR pdf has
a
> BNF 
> grammar that solves this problem, at it is that pattern that I am trying
to 
> implement in ANTLR. (A re-cap of this trick is given at the end of the
> email.)
> 
> (Note that there is also a problem lexing ">>>", but lets just confine 
> ourselves to ">>" for the moment.)
> 
> Okay, after a few false starts, I've come up with the following grammar
> (note 
> that it is not the full JavaRecogniser parser, just enough to parse a
> SEMICOLON 
> seperated list of types) (it uses the standard JavaLexer):
> 
> --------
> compilationUnit
> 	:
>          ( type SEMI ) *
> 		EOF!
> 	;
> 
> type
> 	:	referenceType
> 	|	builtInType (arrayDecl)?
> 	;
> 
> referenceType:
>          identifier
>          (  arrayDecl
>          |  LT referenceTypeList1
>          )?
>      ;
> 
> referenceTypeList1:
>          (referenceType1)=> referenceType1
>      |
>          (options{greedy=false;}: referenceType COMMA)+
>          referenceType1
>      ;
> 
> referenceType1:
>          (referenceType GT)=> referenceType GT
>      |
>          identifier LT referenceTypeList2
>      ;
> 
> referenceTypeList2 :
>          (referenceType2)=> referenceType2
>      |
>          (options{greedy=false;}: referenceType COMMA)+
>          referenceType2
>      ;
> 
> referenceType2:
>          referenceType SR
>      ;
> 
> arrayDecl:
>          (LBRACK RBRACK)+
>      ;
> // The primitive types.
> builtInType
> 	:	"void"
> 	|	"boolean"
> 	|	"byte"
> 	|	"char"
> 	|	"short"
> 	|	"int"
> 	|	"float"
> 	|	"long"
> 	|	"double"
> 	;
> 
> identifier
> 	:	IDENT ( DOT^ IDENT)*
> 	;
> --------
> 
> This grammar will sucessfully parse these constructs:
> --------
> String;
> java.lang.String;
> int;
> float;
> int[];
> String[];
> float[][][];
> List<String>;
> List<String[]>;
> List<List<String[]> >;
> List<List<String[]>>;
> 
> Map<String,Integer>;
> Map<String,List<Integer> >;
> Map<String,List<Integer>>;
> Map<List<Integer>,String>;
> Map<List<Integer>,List<String>>;
> 
> Map3<String,Integer,Float>;
> 
> Map<Map<String,String>,Map3<String,Integer,Float>>;
> Map<List<String>,List<Integer>>;
> --------
> 
> But it will not parse these:
> Map3<List<String>,List<Integer>,List<Float>>;
> Map3<String,List<Integer>,Float>;
> 
> The errors are:
> G1.java:20:18: unexpected token: Integer
> and
> G1.java:24:24: unexpected token: Integer
> 
> Now, I can see why this is happening, it is caused by my non-greedy rules
in
> 
> referenceTypeList1 and referenceTypeList2. But I need them to be
non-greedy
> (in 
> some fashion), because I don't want them to match the last "referenceType"
> that 
>   preceeds the next GT or SR token.
> 
> (Making them both greedy means that it matches too many times...)
> 
> I'm starting to get to the limits of my understanding of ANTLR... I
started 
> thinking it was a look-ahead problem... but it really requires "lots" of 
> lookahead, that's why I have those syntactic predicates everywhere).
> 
> Any help will be greatly appreciated! Have I gone down the wrong track?
> 
> =Matt
> 
> PS: The 'trick' JSR14 uses to parse ">>" and ">>>":
> The 'naive' grammar for parameterized type declarations (using the
notation 
> used in the JLS) is:
> 
> ReferenceType ::= ClassOrInterfaceType
>                  | ArrayType
>                  | TypeVariable
> 
> TypeVariable ::= Identifier
> 
> ClassOrInterfaceType ::= ClassOrInterface TypeArgumentsOpt
> 
> ClassOrInterface ::= Identifier
>                     | ClassOrInterfaceType . Identifier
> 
> TypeArguments ::= < ReferenceTypeList >
> 
> ReferenceTypeList ::= ReferenceType
>                      | ReferenceTypeList , ReferenceType
> 
> 
> The "trick" is as folows (copied verbatim from the JSR14 spec)
> 
> ReferenceType ::= ClassOrInterfaceType
>                  | ArrayType
>                  | TypeVariable
> 
> ClassOrInterfaceType ::= Name
>                         | Name < ReferenceTypeList1
> 
> ReferenceTypeList1 ::= ReferenceType1
>                       | ReferenceTypeList , ReferenceType1
> 
> ReferenceType1 ::= ReferenceType >
>                   | Name < ReferenceTypeList2
> 
> ReferenceTypeList2 ::= ReferenceType2
>                       | ReferenceTypeList , ReferenceType2
> 
> ReferenceType2 ::= ReferenceType >>
>                   | Name < ReferenceTypeList3
> 
> ReferenceTypeList3 ::= ReferenceType3
>                       | ReferenceTypeList , ReferenceType3
> 
> ReferenceType3 ::= ReferenceType >>>
> 
> 
> 
>  
> 
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 
> 
> 
>  
> 
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 
> 
> 
> 
> 



 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 


 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 



More information about the antlr-interest mailing list