[antlr-interest] Re: Positioning input stream (was EOL sequen
ce)
mzukowski at yci.com
mzukowski at yci.com
Fri Dec 19 08:05:30 PST 2003
Are you calling read() from your lexer or parser? If it's from the parser
there are still potentially issues of synchronization. If you find yourself
wondering how in the world it could be screwing up I would look there first.
Monty
-----Original Message-----
From: skappskapp [mailto:skapp at rochester.rr.com]
Sent: Friday, December 19, 2003 3:04 AM
To: antlr-interest at yahoogroups.com
Subject: [antlr-interest] Re: Positioning input stream (was EOL sequen ce)
Thanks for the advice from all. Unfortunately, the records are not
fixed length - in general they can be any length. The image
operators usually read a fixed amount of data. I've run into other
PostScript code that uses the read operators to discard lines of
input that should not be interpreted. So I cannot make any
assumptions about the amount of data to be read.
However I did find a solution, albeit a somewhat strange one.
Instead of attempting to ensure that I could synchronize the input
stream that the read operators were using with the lookahead stream
that the scanner was using, I simply built on top of the scanner.
Since I have no syntactic predicates in my grammar, I added a read()
method to my lexer:
public int read() throws CharStreamException {
int output = LA(1);
if (output == EOF_CHAR) {
output = PSFile.EOF;
}
else {
consume();
}
return output;
}
Now the file operators are reading from the same "stream" that the
lexer is. Once I match a name object in the lexer, the
consumeWhiteSpace() method is called:
public void consumeWhiteSpace() throws CharStreamException {
char value = LA(1);
if (value == '\r') {
consume();
if (LA(1) == '\n') {
consume();
}
newline();
}
else {
if (value == '\n') {
consume();
newline();
}
else if ((value == ' ') || (value == '\t') ||
(value == '\f') ||(value == '\0')) {
consume();
}
}
}
Any objections out there as to why this should not or could work?
Steve
--- In antlr-interest at yahoogroups.com, mzukowski at y... wrote:
> Yes, you do need to reset the lookahead buffer. Doing your read
from the
> parser is a bad idea in general. If you are strictly k=1 in your
parser and
> don't use any syntactic predicates then you may be able to do it
reliably,
> but I would strongly recommend doing it in the lexer. ANTLR
lexers are
> powerful enough to actually be parsers in their own right.
>
> Not being familiar with PostScript I'm not sure how practical that
is. For
> this one rule you could use lexer states. But wait, does the
interpreter
> use information on the stack for how many bytes to read?
>
> If so, you may be better off maintaining your stack in the lexer.
>
> The core of the problem is that the parser needs k tokens ahead of
the
> current match to be able to predict what to match next. That "k"
is at
> least what you say k is, but with a syntactic predicate k is
unbounded. So
> in the extreme case you may have already lexed the entire input
stream
> before you even start parsing. The lookahead buffer is filled as
needed, so
> it doesn't always have k elements in it.
>
> What is really happening below is that the lexer, which also has a
lookahead
> buffer, has already read the 'CR' and has it in its lookahead
buffer. It
> has not lexed the whitespace yet. The input stream has not read
the LF yet.
> Luckily for you, in this particular production the parser didn't
need to
> know LA(1) yet. If it needed that then the whitespace would have
been lexed
> and skipped and then X would have been lexed, turned into a Token
and put
> into the parser's lookahead buffer. The lexer would have read the
following
> LF to know to end lexing X and the input stream would be set at the
> following CR.
>
> Solution? Do it in the lexer and switch lexer states when you
know you're
> going to read a fixed amount of data. And before switching call
the WS rule
> to read all of the whitespace before the data. I believe there is
a note on
> the antlr website or FAQ or manual about reading fixed length
records for
> more details.
>
> Monty
>
> -----Original Message-----
> From: skapp at r... [mailto:skapp at r...]
> Sent: Wednesday, December 17, 2003 7:01 PM
> To: antlr-interest at yahoogroups.com
> Subject: [antlr-interest] Re: Positioning input stream (was EOL
sequence)
>
> Thanks for all the replies to date. Terence, I did look at your
> parser, which was a partial PostScript parser, but I am currently
> far past your example. I am using "k=2" for the lexer. I have
> cleaned up some of the ambiguity warnings - thanks to many people.
>
> I have no problem consuming whitespace when I am *parsing* or
> *lexing*. The problem arises with PostScript's read operators,
which
> permit interruption of the parsing process to read arbitrary data
> from the current input stream.
>
> PostScript has almost no productions. Once a token is recognized,
it
> is immediately executed by the parser. The parser does not have to
> match against any sequence of tokens - all tokens are standalone.
In
> this example,
>
> currentfile read<LF>X<LF>
>
> "currentfile" is recognized as a name token, passed to the parser,
> and is immediately executed by the parser. Then "read" is
recognized
> as a name token, passed to the parser and immediately executed.
Now
> the read operator pulls one byte from the input stream, in this
case
> the "X" byte from the input stream. For a EOL sequence of LF or
CR,
> this sequence executes as expected - the next read from the input
> stream does indeed return the "X" byte. However, when I return
from
> executing the read operator, two whitespace sequences are
recognized
> by the lexer, a LF and another LF. I expected one since the input
> stream should now be positioned past the X - but why is there
> another? Do I need to clear out the lookahead buffer, and if so,
how
> do I do this?
>
> For PostScript, standalone white space is tossed out, so this
> particular sequence is not a big problem unless I want an accurate
> line number. But the following sequence is a problem.
>
> currentfile read<CR><LF>X<CR><LF>
>
> Here the read operator picks up the <LF> instead of the X. When I
> return from executing the read operator, the lexer recognizes a CR
> and the "X" character. Since "X" is not a valid PostScript name
> operator (semantics not syntax), the interpretation fails.
> PostScript expects the read operator to obtain the "X" character
and
> the next whitespace sequence to be the final CR-LF.
>
> It seems like I need advance warning that a CR-LF sequence is
coming
> before a name operator like "read" is executed. But the next token
> has not yet been requested by the parser.
>
> Any thoughts on how to get out of this?
>
> Regards,
>
> Steve
>
>
>
>
>
> --- In antlr-interest at yahoogroups.com, Terence Parr <parrt at c...>
> wrote:
> > Don't forget my mini postscript interpreter lab I had my
students
> do
> > last semester....link on my course page at USF (CS652).
> >
> > Ter
> > On Wednesday, December 17, 2003, at 09:08 AM, Albert Huh wrote:
> >
> > > i don't know too much about ps syntax, but you could simply
make
> your
> > > whitespace rule consume spaces as well as newlines in your
> lexer. the
> > > java example that comes with antlr does this.
> > >
> > > -----Original Message-----
> > > From: skapp at r... [mailto:skapp at r...]
> > > Sent: Wednesday, December 17, 2003 12:04 AM
> > > To: antlr-interest at yahoogroups.com
> > > Subject: [antlr-interest] Positioning input stream (was EOL
> sequence)
> > >
> > >
> > > I have worked out enough details with the EOL sequences to
> > > understand where my PostScript parser is failing. PostScript
> parsers
> > > have to be able to handle the following four example sequences
> > > identically:
> > >
> > > currentfile read 3
> > > currentfile read<CR>3
> > > currentfile read<LF>3
> > > currentfile read<CR><LF>3
> > >
> > > where the "currentfile read" operator sequence instructs the
> > > PostScript interpreter to read one byte from the input stream.
> > >
> > > There is no issue with the first three examples. The input
stream
> > > point just past the EOL byte once the "read" operator has been
> > > recognized. Then the read operator simply has to pull one byte
> from
> > > the input stream (a FileInputStream in this case).
> > >
> > > However, in the fourth case, the input stream points to the
<LF>
> > > character when the "read" operator has been recognized. The
> > > PostScript spec states that "Any of the three forms of EOL ...
is
> > > treated as a single white-space character."
> > >
> > > How do I handle this? What can or should I do in the lexer
> versus in
> > > the parser?
> > >
> > > Regards,
> > >
> > > Steve
> > >
> > >
> > >
> > >
> > >
> > > Yahoo! Groups Links
> > >
> > > To visit your group on the web, go to:
> > > http://groups.yahoo.com/group/antlr-interest/
> > >
> > > To unsubscribe from this group, send an email to:
> > > antlr-interest-unsubscribe at yahoogroups.com
> > >
> > > Your use of Yahoo! Groups is subject to:
> > > http://docs.yahoo.com/info/terms/
> > >
> > >
> > >
> > >
> > >
> > > Yahoo! Groups Links
> > >
> > > To visit your group on the web, go to:
> > > http://groups.yahoo.com/group/antlr-interest/
> > >
> > > To unsubscribe from this group, send an email to:
> > > antlr-interest-unsubscribe at yahoogroups.com
> > >
> > > Your use of Yahoo! Groups is subject to:
> > > http://docs.yahoo.com/info/terms/
> > >
> > >
> > >
> > --
> > Professor Comp. Sci., University of San Francisco
> > Creator, ANTLR Parser Generator, http://www.antlr.org
> > Co-founder, http://www.jguru.com
> > Co-founder, http://www.knowspam.net enjoy email again!
> > Co-founder, http://www.peerscope.com link sharing, pure-n-simple
>
>
>
>
> Yahoo! Groups Links
>
> To visit your group on the web, go to:
> http://groups.yahoo.com/group/antlr-interest/
>
> To unsubscribe from this group, send an email to:
> antlr-interest-unsubscribe at yahoogroups.com
>
> Your use of Yahoo! Groups is subject to:
> http://docs.yahoo.com/info/terms/
Yahoo! Groups Links
To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
Yahoo! Groups Links
To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list