[antlr-interest] parser controlling lexer (was non-determinism.)

Greg Lindholm glindholm at yahoo.com
Wed Mar 26 14:12:03 PST 2003


It's been noted many times on this list that it's not safe to do this because of lookahead in the parser.   When you set the flag in the parser there can be many tokens already buffered and the lexer can be far ahead.
 Albert Huh <albert.huh at embarcadero-ca.com> wrote:actually, you can control the lexer from the parser to a degree.  i'm not sure about switching rules, but you can definitely make the lexer change the token type depending on if you've set a flag or not. the parser will need a reference to the lexer to do this. within the lexer, you can simply add some actions like {  if (nameFlag) {    setType(NAME);  }} i probably didn't use the proper method name for changing the type, but you get the idea. though this probably isn't the greatest idea.  ideally the lexer and parser should be able to run independantly. -----Original Message-----
From: Greg Lindholm [mailto:glindholm at yahoo.com]
Sent: Wednesday, March 26, 2003 1:12 PM
To: antlr-interest at yahoogroups.com
Subject: RE: [antlr-interest] non-determinism.


Sorry, lexers and parsers (certainly Antlr) don't work the way you want them to.  There is no facility for the parser to tell the lexer what tokens to look for.  
The lexer acts mostly independently from the parser and it's job is to translate a stream of characters into a stream of tokens. (The tokens are then consumed by the parser. ) So the lexer has to be able to look at a sequence of characters and decide what token type to give it. 
You probably need to read the Antlr documentation again and study some of the examples in order to create a workable approach to constructing a solution. 
If you want to describe what you are trying to accomplish someone may be able to suggest an approach you can take. 
Greg 
 mark kant <markkant2001 at yahoo.com> wrote: 1. it is not "required" distinguishing character but
that it may be present. The point is that there is a
list of token which are supersets of some other
tokens.

So in your question - what type of TOKEN is 'a' ?
The answer is that 'a' can be NAME, ID and also TOKEN.
If a digit follows 'a', then a lexer can either return
NAME or an ID. If I let the lexer consume the digit
and return an ID, but the parser was expecting only a
NAME, then ID would be incorrect.
To me a parser tell lexer to find a token from a set
of X tokens.
In my problem there are 2 or more sets of tokens. The
parser needs to tell lexer which set to look for.

Thanks again for the discussion. I will really
appreciate any help in this.

Mark

----------------------
--- Greg Lindholm wrote:
> 
> If each token type has a "required" distinguishing
> character then there would not be an
> non-determiniism, but that is not what you have
> written in the rules below.
> Did you decide which token type an 'a' is? How about
> a '9'? You're not going to get very far building a
> lexer until you make these basic decisions.
> Once you have some example cases, if you then have
> trouble building the lexer to match your examples,
> then people on this list will help you.
> mark kant wrote:There is a
> slight difference. Each of them also has
> extra characters to distinguish. Example TOKEN also
> has '~' character in it. If I expected an ID, but I
> return TOKEN_OR_ID, then how do I know it is a valid
> ID (it may have '~' in it, which makes it invalid
> ID,
> but valid TOKEN )
> 
> 
> Mark
> 
> ------------------------------
> --- Greg Lindholm wrote:
> > 
> > To understand the non-determinism it might help
> you
> > if you consider some example tokens with this
> lexer.
> > If your lexer sees the single character 'a' what
> > type of token would you like it to return? One of
> > the non-determinism this lexer has is that 'a'
> > matches the NAME, ID, and TOKEN rules. Which is
> it?
> > Note that ANTLR doesn't care what order the rules
> > appear in (unlike lex). Same thing goes with the
> > single character '9', it matches both TOKEN and
> > NUMBER.
> > So I recommend work up some example cases and
> decide
> > what you want your lexer to return for each case. 
> > In some languages a given sequence of characters
> can
> > mean completely different things (different token
> > type) based on the context of those characters. 
> > Antlr is basically a context-free lexer
> (predicates
> > can help sometimes). In these cases you might need
> > to delay exact identification of the token type
> > until you know the context (symantic analysis
> > phase). For example you might have the lexer
> return
> > a token type NAME_OR_ID then later figure out
> which
> > it is once you know the context.
> > Hope this helps,
> > Greg
> > 
> > mark kant wrote:How about
> > the following lexer
> > 
> > 
> > protected: 
> > ALPHA: ('a'..'z'|'A'..'Z')
> > ;
> > protected:
> > ALPHA_NUM: ('a'..'z'|'A'..'Z'|'0'..'9')
> > ;
> > protected:
> > DIGIT: '0'..'9'
> > ;
> > 
> > 
> > NAME: (ALPHA) ((ALPHA) | '_' | '.') )*
> > ;
> > 
> > ID: (ALPHA) ( (ALPHA_NUM) |'_'|'.'|'@')*
> > ;
> > 
> > TOKEN: (ALPHANUM|'_'|'.'|'@'|'%'|';'|'~')+
> > ;
> > 
> > NUMBER: ( DIGITS )+
> > ;
> > 
> > 
> > Thanks
> > 
> > Mark
> > --- mzukowski at yci.com wrote:
> > > remove your AT rule and then add a literal
> keyword
> > > AT='@' to the keywords
> > > section and test for it in TOKEN by turning on
> the
> > > option testLiterals=true.
> > > See the docs on literals.
> > > 
> > > Monty
> > > 
> > > -----Original Message-----
> > > From: mark kant [mailto:markkant2001 at yahoo.com]
> > > Sent: Tuesday, March 25, 2003 9:42 AM
> > > To: antlr-interest at yahoogroups.com
> > > Subject: [antlr-interest] non-determinism.
> > > 
> > > 
> > > Hi,
> > > 
> > > I get non-determinism in the following lexer
> > > (relevant
> > > portion of parser and lexer)
> > > 
> > > hosport: host COLON password
> > > 
> > > password: TOKEN
> > > 
> > > host: NAME AT TOKEN
> > > 
> > > 
> > > lexer ...............
> > > 
> > > COLON: ':'
> > > 
> > > SEMI: ';'
> > > 
> > > AT: '@'
> > > 
> > > TOKEN: ('a'..'z' | 'A'..'Z'
> > > |'0'..'9'|'.'|':'|';'|'@')+
> > > 
> > > 
> > > What is the best way to resolve it:
> > > 1. multiple lexers
> > > 2. syntactic predicates - not appropriate as I
> > have
> > > other similar rules for special characters
> > > 3. some kind of flag set in parser and lexer
> > checks
> > > it
> > > before matching a rule in lexer (how do I
> > > communicate
> > > the flag state from parser to lexer). I have
> done
> > > this
> > > in Lex and YAcc.
> > > 
> > > Thanks
> > > 
> > > Mark
> > > 



---------------------------------
Do you Yahoo!?
Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! 
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service. 

Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service. 



---------------------------------
Do you Yahoo!?
Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20030326/175820e6/attachment.html


More information about the antlr-interest mailing list