[antlr-interest] dfa-based lexers versus top-down antlr lexe rs

Fri May 2 12:04:55 PDT 2003

Lexers also keep state internally and/or through selectors.  Line number,
column number at the very least.  

Also if you unget through to the lexer you must unget k tokens previous in
the parser, because the rule you are in now may not be the one that would
have been predicted k tokens ago.  And the only sane way to accomplish that
would be to use syntactic predicates.  But of course there are no actions in
syn preds, so how to know what to unget?

So I'm leaning toward saying that it would not be worth doing it.

Can you think of any examples of how you would use such a feature?  

Monty

-----Original Message-----
From: Pete Forman [mailto:pete.forman at westerngeco.com]
Sent: Thursday, May 01, 2003 1:56 AM
To: antlr-interest at yahoogroups.com
Subject: Re: [antlr-interest] dfa-based lexers versus top-down antlr
lexers

At 2003-04-29 09:48 -0700, Terence Parr wrote:
>Concerning the interface between parser / lexer, yes: it's simple:
>
>interface TokenStream {
>         public Token nextToken();
>}

Would it be worth developing a lexer with a second method?

     public void unget(int numTokens);

This would allow lexer switching from within the parser.  The first
lexer could push back the input it has consumed allowing the second
lexer to make a clean start.

The semantics of the method are closest to unget() in C++ istream.
There is little point in doing putback(Token) as in C++ istream or
unread(Token) as in Java PushbackReader as the Token will not in
itself contain enough information to reconstruct the input to the
lexer.

Implementation would probably require extra buffering at the input
to the lexer to support backtracking.  Most of the lexers that I've
deployed are working on string inputs which would suit this proposal.
Rather more work would be needed for non random access file inputs.
The amount of buffering probably ought to be limited explicitly with
unget() throwing an exception if too much backtracking is attempted.

Another approach to implementation would be to make Token a bit
heavier.  It might carry the array of characters that formed it,
including those of adjacent skipped tokens.  Then unread(Token) would
contain enough info to back up the input to the lexer using putback()
or unread().  An intermediate approach would have a buffer in the
lexer or its input and store a pair of indexes in the token.

One benefit mentioned is allowing lexer switching in the parser.
Another might be to allow switching to a different lexer when a parse
error is detected to allow more powerful recovery from that error.

-- 
Pete Forman                -./\.-  Disclaimer: This post is originated
WesternGeco                  -./\.-   by myself and does not represent
pete.forman at westerngeco.com    -./\.-   opinion of Schlumberger, Baker
http://petef.port5.com           -./\.-   Hughes or their divisions.

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/