[antlr-interest] Re: more lexical determinism
howardckatz
howardk at fatdog.com
Wed Dec 5 21:29:51 PST 2001
This has been an interesting exercise. I can see that this particular
problem -- where two tokens consist of closely overlapping character
sets -- is one that antlr doesn't handle that well. I can see one
other approach that might work -- sticking some string-parsing Java
code of my own either into the parser grammar or maybe in a
downstream TokenStream. Time to play I guess ...
Thanks for your help,
Howard
--- In antlr-interest at y..., Sinan <sinan.karasu at b...> wrote:
> howardckatz wrote:
> >
> > --- In antlr-interest at y..., Terence Parr <parrt at j...> wrote:
> >
> > ...
> >
> > > As for distinguishing between the two kinds of words/ids, you
could
> > > do the following in one rule (assume Word unless you see _ or
> > > digit):
> > >
> > > Word: ( Letter | '_' {$setType(Identifier);}) (Letter |
> > > Digit{$setType(Identifier);})*;
> >
> > That didn't quite do it, I think, Doesn't the above say that
anything
> > starting with a Letter is a Word? But that's not what I want,
since
> > valid Identifiers can start with Letters too. The following
should
be
> > legal input,
> >
> > id : word
> >
> > but throws an "Unexpected token: id" error. I would guess the
parser
> > sees this as "Word : Word" and accordingly chokes. Or am I
> > misunderstanding something?
> >
> > Howard
>
> There is no way lexer can distinguish between word and id, since
they
> have the
> same production ( or id is a subset of word....)
>
> If you want to make the distinction in lexer, then you have to do
> something like
>
> AnId : (Id Colon Word)=> Id ;
>
>
> But then you cant haver an Id without a Colon following.
>
> One expensive way to do it is to pull everything into the Parser
except
> characters , then
>
> rule1 : id Colon word ;
>
> id: Character+ ;
>
> word : Character+ ;
>
> or whatever....
>
> But now you will get a zillion non-determinisms , which you fix by
>
> rules:
> (rule1)=> rule1
> | (rule2)=> rule2
> | etc....
> ;
>
> This tends to be very expensive, but almost unavoidable in cases
like
> Fortran
> where whitespace has no meaning.
>
> Don't forget that the lexer rules(productions/methods) are not
called
> by parser.
> Actually , if it is not protected, then they are call from
nextToken
in
> some magical order
> and the first maximum match will win....
>
> So you'll either get all either all words or all ids ( except when
"_"
> is present)....
>
> Sinan
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list