[antlr-interest] Re: more lexical determinism

Wed Dec 5 15:54:42 PST 2001

howardckatz wrote:
> 
> --- In antlr-interest at y..., Terence Parr <parrt at j...> wrote:
> 
>  ...
> 
> > As for distinguishing between the two kinds of words/ids, you could
> > do the following in one rule (assume Word unless you see _ or
> > digit):
> >
> > Word: ( Letter | '_'  {$setType(Identifier);}) (Letter |
> > Digit{$setType(Identifier);})*;
> 
> That didn't quite do it, I think, Doesn't the above say that anything
> starting with a Letter is a Word? But that's not what I want, since
> valid Identifiers can start with Letters too. The following should be
> legal input,
> 
>      id : word
> 
> but throws an "Unexpected token: id" error. I would guess the parser
> sees this as "Word : Word" and accordingly chokes. Or am I
> misunderstanding something?
> 
> Howard

There is no way lexer can distinguish between word and id, since they
have the 
same production ( or id is a subset of word....)

If you want to make the distinction in lexer, then you have to do
something like

AnId : (Id Colon Word)=> Id ;

But then you cant haver an Id without a Colon following.

One expensive way to do it is to pull everything into the Parser except
characters , then

rule1 : id Colon word ;

id:  Character+ ;

word : Character+ ; 

or whatever....

But now you will get a zillion non-determinisms , which you fix by

rules:
	(rule1)=> rule1
	| (rule2)=> rule2
	| etc....
	;

 This tends to be very expensive, but almost unavoidable in cases like
Fortran
where whitespace has no meaning.

 Don't forget that the lexer rules(productions/methods) are not called
by parser.
Actually , if it is not protected, then they are call from nextToken in
some magical order
and the first maximum match will win....

 So you'll either get all either all words or all ids ( except when "_"
is present)....

Sinan

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/