[antlr-interest] Re: Positioning input stream (was EOL sequence)
skapp at rochester.rr.com
skapp at rochester.rr.com
Wed Dec 17 19:00:50 PST 2003
Thanks for all the replies to date. Terence, I did look at your
parser, which was a partial PostScript parser, but I am currently
far past your example. I am using "k=2" for the lexer. I have
cleaned up some of the ambiguity warnings - thanks to many people.
I have no problem consuming whitespace when I am *parsing* or
*lexing*. The problem arises with PostScript's read operators, which
permit interruption of the parsing process to read arbitrary data
from the current input stream.
PostScript has almost no productions. Once a token is recognized, it
is immediately executed by the parser. The parser does not have to
match against any sequence of tokens - all tokens are standalone. In
this example,
currentfile read<LF>X<LF>
"currentfile" is recognized as a name token, passed to the parser,
and is immediately executed by the parser. Then "read" is recognized
as a name token, passed to the parser and immediately executed. Now
the read operator pulls one byte from the input stream, in this case
the "X" byte from the input stream. For a EOL sequence of LF or CR,
this sequence executes as expected - the next read from the input
stream does indeed return the "X" byte. However, when I return from
executing the read operator, two whitespace sequences are recognized
by the lexer, a LF and another LF. I expected one since the input
stream should now be positioned past the X - but why is there
another? Do I need to clear out the lookahead buffer, and if so, how
do I do this?
For PostScript, standalone white space is tossed out, so this
particular sequence is not a big problem unless I want an accurate
line number. But the following sequence is a problem.
currentfile read<CR><LF>X<CR><LF>
Here the read operator picks up the <LF> instead of the X. When I
return from executing the read operator, the lexer recognizes a CR
and the "X" character. Since "X" is not a valid PostScript name
operator (semantics not syntax), the interpretation fails.
PostScript expects the read operator to obtain the "X" character and
the next whitespace sequence to be the final CR-LF.
It seems like I need advance warning that a CR-LF sequence is coming
before a name operator like "read" is executed. But the next token
has not yet been requested by the parser.
Any thoughts on how to get out of this?
Regards,
Steve
--- In antlr-interest at yahoogroups.com, Terence Parr <parrt at c...>
wrote:
> Don't forget my mini postscript interpreter lab I had my students
do
> last semester....link on my course page at USF (CS652).
>
> Ter
> On Wednesday, December 17, 2003, at 09:08 AM, Albert Huh wrote:
>
> > i don't know too much about ps syntax, but you could simply make
your
> > whitespace rule consume spaces as well as newlines in your
lexer. the
> > java example that comes with antlr does this.
> >
> > -----Original Message-----
> > From: skapp at r... [mailto:skapp at r...]
> > Sent: Wednesday, December 17, 2003 12:04 AM
> > To: antlr-interest at yahoogroups.com
> > Subject: [antlr-interest] Positioning input stream (was EOL
sequence)
> >
> >
> > I have worked out enough details with the EOL sequences to
> > understand where my PostScript parser is failing. PostScript
parsers
> > have to be able to handle the following four example sequences
> > identically:
> >
> > currentfile read 3
> > currentfile read<CR>3
> > currentfile read<LF>3
> > currentfile read<CR><LF>3
> >
> > where the "currentfile read" operator sequence instructs the
> > PostScript interpreter to read one byte from the input stream.
> >
> > There is no issue with the first three examples. The input stream
> > point just past the EOL byte once the "read" operator has been
> > recognized. Then the read operator simply has to pull one byte
from
> > the input stream (a FileInputStream in this case).
> >
> > However, in the fourth case, the input stream points to the <LF>
> > character when the "read" operator has been recognized. The
> > PostScript spec states that "Any of the three forms of EOL ... is
> > treated as a single white-space character."
> >
> > How do I handle this? What can or should I do in the lexer
versus in
> > the parser?
> >
> > Regards,
> >
> > Steve
> >
> >
> >
> >
> >
> > Yahoo! Groups Links
> >
> > To visit your group on the web, go to:
> > http://groups.yahoo.com/group/antlr-interest/
> >
> > To unsubscribe from this group, send an email to:
> > antlr-interest-unsubscribe at yahoogroups.com
> >
> > Your use of Yahoo! Groups is subject to:
> > http://docs.yahoo.com/info/terms/
> >
> >
> >
> >
> >
> > Yahoo! Groups Links
> >
> > To visit your group on the web, go to:
> > http://groups.yahoo.com/group/antlr-interest/
> >
> > To unsubscribe from this group, send an email to:
> > antlr-interest-unsubscribe at yahoogroups.com
> >
> > Your use of Yahoo! Groups is subject to:
> > http://docs.yahoo.com/info/terms/
> >
> >
> >
> --
> Professor Comp. Sci., University of San Francisco
> Creator, ANTLR Parser Generator, http://www.antlr.org
> Co-founder, http://www.jguru.com
> Co-founder, http://www.knowspam.net enjoy email again!
> Co-founder, http://www.peerscope.com link sharing, pure-n-simple
Yahoo! Groups Links
To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list