[antlr-interest] Re: more lexical determinism

howardckatz howardk at fatdog.com
Wed Dec 5 21:29:51 PST 2001


This has been an interesting exercise. I can see that this particular 
problem -- where two tokens consist of closely overlapping character 
sets -- is one that antlr doesn't handle that well. I can see one 
other approach that might work -- sticking some string-parsing Java 
code of my own either into the parser grammar or maybe in a
downstream TokenStream. Time to play I guess ...

Thanks for your help,
Howard

--- In antlr-interest at y..., Sinan <sinan.karasu at b...> wrote:
> howardckatz wrote:
> > 
> > --- In antlr-interest at y..., Terence Parr <parrt at j...> wrote:
> > 
> >  ...
> > 
> > > As for distinguishing between the two kinds of words/ids, you 
could
> > > do the following in one rule (assume Word unless you see _ or
> > > digit):
> > >
> > > Word: ( Letter | '_'  {$setType(Identifier);}) (Letter |
> > > Digit{$setType(Identifier);})*;
> > 
> > That didn't quite do it, I think, Doesn't the above say that 
anything
> > starting with a Letter is a Word? But that's not what I want, 
since
> > valid Identifiers can start with Letters too. The following
should 
be
> > legal input,
> > 
> >      id : word
> > 
> > but throws an "Unexpected token: id" error. I would guess the 
parser
> > sees this as "Word : Word" and accordingly chokes. Or am I
> > misunderstanding something?
> > 
> > Howard
> 
> There is no way lexer can distinguish between word and id, since 
they
> have the 
> same production ( or id is a subset of word....)
> 
> If you want to make the distinction in lexer, then you have to do
> something like
> 
> AnId : (Id Colon Word)=> Id ;
> 
> 
> But then you cant haver an Id without a Colon following.
> 
> One expensive way to do it is to pull everything into the Parser 
except
> characters , then
> 
> rule1 : id Colon word ;
> 
> id:  Character+ ;
> 
> word : Character+ ; 
> 
> or whatever....
> 
> But now you will get a zillion non-determinisms , which you fix by
> 
> rules:
> 	(rule1)=> rule1
> 	| (rule2)=> rule2
> 	| etc....
> 	;
> 	
>  This tends to be very expensive, but almost unavoidable in cases 
like
> Fortran
> where whitespace has no meaning.
> 
>  Don't forget that the lexer rules(productions/methods) are not 
called
> by parser.
> Actually , if it is not protected, then they are call from
nextToken 
in
> some magical order
> and the first maximum match will win....
> 
>  So you'll either get all either all words or all ids ( except when 
"_"
> is present)....
> 
> Sinan


 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 



More information about the antlr-interest mailing list