[antlr-interest] Re: Positioning input stream (was EOL sequence)

Wed Dec 17 19:00:50 PST 2003

Thanks for all the replies to date. Terence, I did look at your 
parser, which was a partial PostScript parser, but I am currently 
far past your example. I am using "k=2" for the lexer. I have 
cleaned up some of the ambiguity warnings - thanks to many people.

I have no problem consuming whitespace when I am *parsing* or 
*lexing*. The problem arises with PostScript's read operators, which 
permit interruption of the parsing process to read arbitrary data 
from the current input stream. 

PostScript has almost no productions. Once a token is recognized, it 
is immediately executed by the parser. The parser does not have to 
match against any sequence of tokens - all tokens are standalone. In 
this example, 

currentfile read<LF>X<LF>

"currentfile" is recognized as a name token, passed to the parser, 
and is immediately executed by the parser. Then "read" is recognized 
as a name token, passed to the parser and immediately executed. Now 
the read operator pulls one byte from the input stream, in this case 
the "X" byte from the input stream. For a EOL sequence of LF or CR, 
this sequence executes as expected - the next read from the input 
stream does indeed return the "X" byte. However, when I return from 
executing the read operator, two whitespace sequences are recognized 
by the lexer, a LF and another LF. I expected one since the input 
stream should now be positioned past the X - but why is there 
another? Do I need to clear out the lookahead buffer, and if so, how 
do I do this? 

For PostScript, standalone white space is tossed out, so this 
particular sequence is not a big problem unless I want an accurate 
line number. But the following sequence is a problem.

currentfile read<CR><LF>X<CR><LF>

Here the read operator picks up the <LF> instead of the X. When I 
return from executing the read operator, the lexer recognizes a CR 
and the "X" character. Since "X" is not a valid PostScript name 
operator (semantics not syntax), the interpretation fails. 
PostScript expects the read operator to obtain the "X" character and 
the next whitespace sequence to be the final CR-LF. 

It seems like I need advance warning that a CR-LF sequence is coming 
before a name operator like "read" is executed. But the next token 
has not yet been requested by the parser.

Any thoughts on how to get out of this?

   Regards,

      Steve

--- In antlr-interest at yahoogroups.com, Terence Parr <parrt at c...> 
wrote:
> Don't forget my mini postscript interpreter lab I had my students 
do 
> last semester....link on my course page at USF (CS652).
> 
> Ter
> On Wednesday, December 17, 2003, at 09:08  AM, Albert Huh wrote:
> 
> > i don't know too much about ps syntax, but you could simply make 
your 
> > whitespace rule consume spaces as well as newlines in your 
lexer.  the 
> > java example that comes with antlr does this.
> >
> > -----Original Message-----
> > From: skapp at r... [mailto:skapp at r...]
> > Sent: Wednesday, December 17, 2003 12:04 AM
> > To: antlr-interest at yahoogroups.com
> > Subject: [antlr-interest] Positioning input stream (was EOL 
sequence)
> >
> >
> > I have worked out enough details with the EOL sequences to
> > understand where my PostScript parser is failing. PostScript 
parsers
> > have to be able to handle the following four example sequences
> > identically:
> >
> > currentfile read 3
> > currentfile read<CR>3
> > currentfile read<LF>3
> > currentfile read<CR><LF>3
> >
> > where the "currentfile read" operator sequence instructs the
> > PostScript interpreter to read one byte from the input stream.
> >
> > There is no issue with the first three examples. The input stream
> > point just past the EOL byte once the "read" operator has been
> > recognized. Then the read operator simply has to pull one byte 
from
> > the input stream (a FileInputStream in this case).
> >
> > However, in the fourth case, the input stream points to the <LF>
> > character when the "read" operator has been recognized. The
> > PostScript spec states that "Any of the three forms of EOL ... is
> > treated as a single white-space character."
> >
> > How do I handle this? What can or should I do in the lexer 
versus in
> > the parser?
> >
> > Regards,
> >
> >    Steve
> >
> >
> >
> >
> >
> > Yahoo! Groups Links
> >
> > To visit your group on the web, go to:
> >  http://groups.yahoo.com/group/antlr-interest/
> >
> > To unsubscribe from this group, send an email to:
> >  antlr-interest-unsubscribe at yahoogroups.com
> >
> > Your use of Yahoo! Groups is subject to:
> >  http://docs.yahoo.com/info/terms/
> >
> >
> >
> >
> >
> > Yahoo! Groups Links
> >
> > To visit your group on the web, go to:
> >  http://groups.yahoo.com/group/antlr-interest/
> >
> > To unsubscribe from this group, send an email to:
> >  antlr-interest-unsubscribe at yahoogroups.com
> >
> > Your use of Yahoo! Groups is subject to:
> >  http://docs.yahoo.com/info/terms/
> >
> >
> >
> --
> Professor Comp. Sci., University of San Francisco
> Creator, ANTLR Parser Generator, http://www.antlr.org
> Co-founder, http://www.jguru.com
> Co-founder, http://www.knowspam.net enjoy email again!
> Co-founder, http://www.peerscope.com link sharing, pure-n-simple

Yahoo! Groups Links

To visit your group on the web, go to:
 http://groups.yahoo.com/group/antlr-interest/

To unsubscribe from this group, send an email to:
 antlr-interest-unsubscribe at yahoogroups.com

Your use of Yahoo! Groups is subject to:
 http://docs.yahoo.com/info/terms/