[antlr-interest] Another parsing question

Wed Aug 6 07:10:44 PDT 2008

Lexer states do not depend on communication between parser and lexer and are invisible to the parser--you just see different token types--and the lexer only uses lookahead to find token boundaries.  You switch lexer states between tokens.

I think that much of this discussion would be moot if ANTLR 3 lexers had the capabilities of ANTLR  2 lexers; unfortunately, that requires an efficient way of doing FOLLOW sets for unicode ranges--and no solution has yet presented itself for that.

--Loring

----- Original Message ----
From: Foust <javafoust at gmail.com>
To: Terence Parr <parrt at cs.usfca.edu>; antlr-interest at antlr.org
Sent: Tuesday, August 5, 2008 10:17:25 PM
Subject: Re: [antlr-interest] Another parsing question

RE: a lazy token evaluator, Ter wrote:
> the infinite lookahead makes comm between the two
impossible.
> It will simply consume less memory and allow
interactive parsers.

Backtracking would certainly be slower, but isn’t
there *some* method that would allow both infinite lookahead *and*
lexer states, in theory?

What do you think about keeping multiple lookahead
buffers, one per state?

1.  The
position in the input stream is saved before every rule, in case the parser
sets a state (using a provided API, so that the lexer will be able to handle the
housekeeping)

2.  The
parser asks the lexer for the next token (on demand -- if it needs more while
looking at that rule, it asks for more, up to the rest of the tokens in the
file, for an LL(*) parse)

3.  These
tokens are buffered, just as they are now, so they can be reused for
backtracking, but each lexer state has its own buffer of lookahead tokens.

4.  If
a parser rule alters the lexer state, housekeeping is performed to switch to
that state’s buffer and reset the input stream’s current position
appropriately

5.  Whenever
a rule is chosen, lookahead buffers for all other lexer states are invalidated.
(This is required because tokens
in different states could easily overlap. For example, in one state, an entire
line, or the contents of an entire block could be returned as a single token.)

6.  Whenever
a rule that changed state is exited, the previous state is restored, if int Lexer.pushState(int) was called, instead of Lexer.setState(int),
which would leave the state unchanged, until called again.

Brent

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Terence Parr
> Sent: Tuesday, August 05, 2008 2:53 PM
> To: Randall R Schulz
> Cc: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Another parsing
question
> 
> 
> On Aug 5, 2008, at 2:35 PM, Randall R Schulz wrote:
> 
> > On Tuesday 05 August 2008 14:22, Terence Parr
wrote:
> >> On Aug 5, 2008, at 12:44 PM, Foust wrote:
> >>>> Lexing is all done up front with no
input from the parser at all.
> >>>
> >>> Why was that done? Is it the price you
pay for infinite lookahead?
> >>
> >> Speed of access for ith char / token. 
Will build a lazy version. :)
> >
> > That's probably a good idea, but will you
preserve the current
> > semantics, making the lazy version simply a
performance variant, or
> > will you allow the parser to influence the
behavior of the lexer, as
> > so
> > many people initially believe to be possible
when the come to ANTLR?
> 
> Nope...the infinite lookahead makes comm between the
two impossible.
> It will simply consume less memory and allow
interactive parsers.
> 
> Ter

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080806/0e406fc7/attachment.html