[antlr-interest] Re: Positioning input stream (was EOL sequen ce)

Thu Dec 18 08:53:22 PST 2003

Yes, you do need to reset the lookahead buffer.  Doing your read from the
parser is a bad idea in general.  If you are strictly k=1 in your parser and
don't use any syntactic predicates then you may be able to do it reliably,
but I would strongly recommend doing it in the lexer.  ANTLR lexers are
powerful enough to actually be parsers in their own right.

Not being familiar with PostScript I'm not sure how practical that is.  For
this one rule you could use lexer states.  But wait, does the interpreter
use information on the stack for how many bytes to read?  

If so, you may be better off maintaining your stack in the lexer.  

The core of the problem is that the parser needs k tokens ahead of the
current match to be able to predict what to match next.  That "k" is at
least what you say k is, but with a syntactic predicate k is unbounded.  So
in the extreme case you may have already lexed the entire input stream
before you even start parsing.  The lookahead buffer is filled as needed, so
it doesn't always have k elements in it.

What is really happening below is that the lexer, which also has a lookahead
buffer, has already read the 'CR' and has it in its lookahead buffer.  It
has not lexed the whitespace yet.  The input stream has not read the LF yet.
Luckily for you, in this particular production the parser didn't need to
know LA(1) yet.  If it needed that then the whitespace would have been lexed
and skipped and then X would have been lexed, turned into a Token and put
into the parser's lookahead buffer.  The lexer would have read the following
LF to know to end lexing X and the input stream would be set at the
following CR.

Solution?  Do it in the lexer and switch lexer states when you know you're
going to read a fixed amount of data.  And before switching call the WS rule
to read all of the whitespace before the data.  I believe there is a note on
the antlr website or FAQ or manual about reading fixed length records for
more details.

Monty

-----Original Message-----
From: skapp at rochester.rr.com [mailto:skapp at rochester.rr.com] 
Sent: Wednesday, December 17, 2003 7:01 PM
To: antlr-interest at yahoogroups.com
Subject: [antlr-interest] Re: Positioning input stream (was EOL sequence)

Thanks for all the replies to date. Terence, I did look at your 
parser, which was a partial PostScript parser, but I am currently 
far past your example. I am using "k=2" for the lexer. I have 
cleaned up some of the ambiguity warnings - thanks to many people.

I have no problem consuming whitespace when I am *parsing* or 
*lexing*. The problem arises with PostScript's read operators, which 
permit interruption of the parsing process to read arbitrary data 
from the current input stream. 

PostScript has almost no productions. Once a token is recognized, it 
is immediately executed by the parser. The parser does not have to 
match against any sequence of tokens - all tokens are standalone. In 
this example, 

currentfile read<LF>X<LF>

"currentfile" is recognized as a name token, passed to the parser, 
and is immediately executed by the parser. Then "read" is recognized 
as a name token, passed to the parser and immediately executed. Now 
the read operator pulls one byte from the input stream, in this case 
the "X" byte from the input stream. For a EOL sequence of LF or CR, 
this sequence executes as expected - the next read from the input 
stream does indeed return the "X" byte. However, when I return from 
executing the read operator, two whitespace sequences are recognized 
by the lexer, a LF and another LF. I expected one since the input 
stream should now be positioned past the X - but why is there 
another? Do I need to clear out the lookahead buffer, and if so, how 
do I do this? 

For PostScript, standalone white space is tossed out, so this 
particular sequence is not a big problem unless I want an accurate 
line number. But the following sequence is a problem.

currentfile read<CR><LF>X<CR><LF>

Here the read operator picks up the <LF> instead of the X. When I 
return from executing the read operator, the lexer recognizes a CR 
and the "X" character. Since "X" is not a valid PostScript name 
operator (semantics not syntax), the interpretation fails. 
PostScript expects the read operator to obtain the "X" character and 
the next whitespace sequence to be the final CR-LF. 

It seems like I need advance warning that a CR-LF sequence is coming 
before a name operator like "read" is executed. But the next token 
has not yet been requested by the parser.

Any thoughts on how to get out of this?

   Regards,

      Steve

--- In antlr-interest at yahoogroups.com, Terence Parr <parrt at c...> 
wrote:
> Don't forget my mini postscript interpreter lab I had my students 
do 
> last semester....link on my course page at USF (CS652).
> 
> Ter
> On Wednesday, December 17, 2003, at 09:08  AM, Albert Huh wrote:
> 
> > i don't know too much about ps syntax, but you could simply make 
your 
> > whitespace rule consume spaces as well as newlines in your 
lexer.  the 
> > java example that comes with antlr does this.
> >
> > -----Original Message-----
> > From: skapp at r... [mailto:skapp at r...]
> > Sent: Wednesday, December 17, 2003 12:04 AM
> > To: antlr-interest at yahoogroups.com
> > Subject: [antlr-interest] Positioning input stream (was EOL 
sequence)
> >
> >
> > I have worked out enough details with the EOL sequences to
> > understand where my PostScript parser is failing. PostScript 
parsers
> > have to be able to handle the following four example sequences
> > identically:
> >
> > currentfile read 3
> > currentfile read<CR>3
> > currentfile read<LF>3
> > currentfile read<CR><LF>3
> >
> > where the "currentfile read" operator sequence instructs the
> > PostScript interpreter to read one byte from the input stream.
> >
> > There is no issue with the first three examples. The input stream
> > point just past the EOL byte once the "read" operator has been
> > recognized. Then the read operator simply has to pull one byte 
from
> > the input stream (a FileInputStream in this case).
> >
> > However, in the fourth case, the input stream points to the <LF>
> > character when the "read" operator has been recognized. The
> > PostScript spec states that "Any of the three forms of EOL ... is
> > treated as a single white-space character."
> >
> > How do I handle this? What can or should I do in the lexer 
versus in
> > the parser?
> >
> > Regards,
> >
> >    Steve
> >
> >
> >
> >
> >
> > Yahoo! Groups Links
> >
> > To visit your group on the web, go to:
> >  http://groups.yahoo.com/group/antlr-interest/
> >
> > To unsubscribe from this group, send an email to:
> >  antlr-interest-unsubscribe at yahoogroups.com
> >
> > Your use of Yahoo! Groups is subject to:
> >  http://docs.yahoo.com/info/terms/
> >
> >
> >
> >
> >
> > Yahoo! Groups Links
> >
> > To visit your group on the web, go to:
> >  http://groups.yahoo.com/group/antlr-interest/
> >
> > To unsubscribe from this group, send an email to:
> >  antlr-interest-unsubscribe at yahoogroups.com
> >
> > Your use of Yahoo! Groups is subject to:
> >  http://docs.yahoo.com/info/terms/
> >
> >
> >
> --
> Professor Comp. Sci., University of San Francisco
> Creator, ANTLR Parser Generator, http://www.antlr.org
> Co-founder, http://www.jguru.com
> Co-founder, http://www.knowspam.net enjoy email again!
> Co-founder, http://www.peerscope.com link sharing, pure-n-simple

Yahoo! Groups Links

To visit your group on the web, go to:
 http://groups.yahoo.com/group/antlr-interest/

To unsubscribe from this group, send an email to:
 antlr-interest-unsubscribe at yahoogroups.com

Your use of Yahoo! Groups is subject to:
 http://docs.yahoo.com/info/terms/ 

Yahoo! Groups Links

To visit your group on the web, go to:
 http://groups.yahoo.com/group/antlr-interest/

To unsubscribe from this group, send an email to:
 antlr-interest-unsubscribe at yahoogroups.com

Your use of Yahoo! Groups is subject to:
 http://docs.yahoo.com/info/terms/