[antlr-interest] Re: Positioning input stream (was EOL sequen ce)

Fri Dec 19 08:05:30 PST 2003

Are you calling read() from your lexer or parser?  If it's from the parser
there are still potentially issues of synchronization.  If you find yourself
wondering how in the world it could be screwing up I would look there first.

Monty

-----Original Message-----
From: skappskapp [mailto:skapp at rochester.rr.com] 
Sent: Friday, December 19, 2003 3:04 AM
To: antlr-interest at yahoogroups.com
Subject: [antlr-interest] Re: Positioning input stream (was EOL sequen ce)

Thanks for the advice from all. Unfortunately, the records are not 
fixed length - in general they can be any length. The image 
operators usually read a fixed amount of data. I've run into other 
PostScript code that uses the read operators to discard lines of 
input that should not be interpreted. So I cannot make any 
assumptions about the amount of data to be read.

However I did find a solution, albeit a somewhat strange one. 
Instead of attempting to ensure that I could synchronize the input 
stream that the read operators were using with the lookahead stream 
that the scanner was using, I simply built on top of the scanner. 
Since I have no syntactic predicates in my grammar, I added a read() 
method to my lexer:

public int read() throws CharStreamException {
    int output = LA(1);
    if (output == EOF_CHAR) {
        output = PSFile.EOF;
    }
    else {
        consume();
    }
    return output;
}

Now the file operators are reading from the same "stream" that the 
lexer is. Once I match a name object in the lexer, the 
consumeWhiteSpace() method is called:

public void consumeWhiteSpace() throws CharStreamException {
    char value = LA(1);
    if (value == '\r') {
        consume();
        if (LA(1) == '\n') {
            consume();
        }
        newline();
    }
    else {
        if (value == '\n') {
            consume();
            newline();
        }
        else if ((value == ' ') || (value == '\t') ||
                 (value == '\f') ||(value == '\0')) {
            consume();
        }
    }
}

Any objections out there as to why this should not or could work?

   Steve

--- In antlr-interest at yahoogroups.com, mzukowski at y... wrote:
> Yes, you do need to reset the lookahead buffer.  Doing your read 
from the
> parser is a bad idea in general.  If you are strictly k=1 in your 
parser and
> don't use any syntactic predicates then you may be able to do it 
reliably,
> but I would strongly recommend doing it in the lexer.  ANTLR 
lexers are
> powerful enough to actually be parsers in their own right.
> 
> Not being familiar with PostScript I'm not sure how practical that 
is.  For
> this one rule you could use lexer states.  But wait, does the 
interpreter
> use information on the stack for how many bytes to read?  
> 
> If so, you may be better off maintaining your stack in the lexer.  
> 
> The core of the problem is that the parser needs k tokens ahead of 
the
> current match to be able to predict what to match next.  That "k" 
is at
> least what you say k is, but with a syntactic predicate k is 
unbounded.  So
> in the extreme case you may have already lexed the entire input 
stream
> before you even start parsing.  The lookahead buffer is filled as 
needed, so
> it doesn't always have k elements in it.
> 
> What is really happening below is that the lexer, which also has a 
lookahead
> buffer, has already read the 'CR' and has it in its lookahead 
buffer.  It
> has not lexed the whitespace yet.  The input stream has not read 
the LF yet.
> Luckily for you, in this particular production the parser didn't 
need to
> know LA(1) yet.  If it needed that then the whitespace would have 
been lexed
> and skipped and then X would have been lexed, turned into a Token 
and put
> into the parser's lookahead buffer.  The lexer would have read the 
following
> LF to know to end lexing X and the input stream would be set at the
> following CR.
> 
> Solution?  Do it in the lexer and switch lexer states when you 
know you're
> going to read a fixed amount of data.  And before switching call 
the WS rule
> to read all of the whitespace before the data.  I believe there is 
a note on
> the antlr website or FAQ or manual about reading fixed length 
records for
> more details.
> 
> Monty
> 
> -----Original Message-----
> From: skapp at r... [mailto:skapp at r...] 
> Sent: Wednesday, December 17, 2003 7:01 PM
> To: antlr-interest at yahoogroups.com
> Subject: [antlr-interest] Re: Positioning input stream (was EOL 
sequence)
> 
> Thanks for all the replies to date. Terence, I did look at your 
> parser, which was a partial PostScript parser, but I am currently 
> far past your example. I am using "k=2" for the lexer. I have 
> cleaned up some of the ambiguity warnings - thanks to many people.
> 
> I have no problem consuming whitespace when I am *parsing* or 
> *lexing*. The problem arises with PostScript's read operators, 
which 
> permit interruption of the parsing process to read arbitrary data 
> from the current input stream. 
> 
> PostScript has almost no productions. Once a token is recognized, 
it 
> is immediately executed by the parser. The parser does not have to 
> match against any sequence of tokens - all tokens are standalone. 
In 
> this example, 
> 
> currentfile read<LF>X<LF>
> 
> "currentfile" is recognized as a name token, passed to the parser, 
> and is immediately executed by the parser. Then "read" is 
recognized 
> as a name token, passed to the parser and immediately executed. 
Now 
> the read operator pulls one byte from the input stream, in this 
case 
> the "X" byte from the input stream. For a EOL sequence of LF or 
CR, 
> this sequence executes as expected - the next read from the input 
> stream does indeed return the "X" byte. However, when I return 
from 
> executing the read operator, two whitespace sequences are 
recognized 
> by the lexer, a LF and another LF. I expected one since the input 
> stream should now be positioned past the X - but why is there 
> another? Do I need to clear out the lookahead buffer, and if so, 
how 
> do I do this? 
> 
> For PostScript, standalone white space is tossed out, so this 
> particular sequence is not a big problem unless I want an accurate 
> line number. But the following sequence is a problem.
> 
> currentfile read<CR><LF>X<CR><LF>
> 
> Here the read operator picks up the <LF> instead of the X. When I 
> return from executing the read operator, the lexer recognizes a CR 
> and the "X" character. Since "X" is not a valid PostScript name 
> operator (semantics not syntax), the interpretation fails. 
> PostScript expects the read operator to obtain the "X" character 
and 
> the next whitespace sequence to be the final CR-LF. 
> 
> It seems like I need advance warning that a CR-LF sequence is 
coming 
> before a name operator like "read" is executed. But the next token 
> has not yet been requested by the parser.
> 
> Any thoughts on how to get out of this?
> 
>    Regards,
> 
>       Steve
>  
> 
> 
> 
> 
> --- In antlr-interest at yahoogroups.com, Terence Parr <parrt at c...> 
> wrote:
> > Don't forget my mini postscript interpreter lab I had my 
students 
> do 
> > last semester....link on my course page at USF (CS652).
> > 
> > Ter
> > On Wednesday, December 17, 2003, at 09:08  AM, Albert Huh wrote:
> > 
> > > i don't know too much about ps syntax, but you could simply 
make 
> your 
> > > whitespace rule consume spaces as well as newlines in your 
> lexer.  the 
> > > java example that comes with antlr does this.
> > >
> > > -----Original Message-----
> > > From: skapp at r... [mailto:skapp at r...]
> > > Sent: Wednesday, December 17, 2003 12:04 AM
> > > To: antlr-interest at yahoogroups.com
> > > Subject: [antlr-interest] Positioning input stream (was EOL 
> sequence)
> > >
> > >
> > > I have worked out enough details with the EOL sequences to
> > > understand where my PostScript parser is failing. PostScript 
> parsers
> > > have to be able to handle the following four example sequences
> > > identically:
> > >
> > > currentfile read 3
> > > currentfile read<CR>3
> > > currentfile read<LF>3
> > > currentfile read<CR><LF>3
> > >
> > > where the "currentfile read" operator sequence instructs the
> > > PostScript interpreter to read one byte from the input stream.
> > >
> > > There is no issue with the first three examples. The input 
stream
> > > point just past the EOL byte once the "read" operator has been
> > > recognized. Then the read operator simply has to pull one byte 
> from
> > > the input stream (a FileInputStream in this case).
> > >
> > > However, in the fourth case, the input stream points to the 
<LF>
> > > character when the "read" operator has been recognized. The
> > > PostScript spec states that "Any of the three forms of EOL ... 
is
> > > treated as a single white-space character."
> > >
> > > How do I handle this? What can or should I do in the lexer 
> versus in
> > > the parser?
> > >
> > > Regards,
> > >
> > >    Steve
> > >
> > >
> > >
> > >
> > >
> > > Yahoo! Groups Links
> > >
> > > To visit your group on the web, go to:
> > >  http://groups.yahoo.com/group/antlr-interest/
> > >
> > > To unsubscribe from this group, send an email to:
> > >  antlr-interest-unsubscribe at yahoogroups.com
> > >
> > > Your use of Yahoo! Groups is subject to:
> > >  http://docs.yahoo.com/info/terms/
> > >
> > >
> > >
> > >
> > >
> > > Yahoo! Groups Links
> > >
> > > To visit your group on the web, go to:
> > >  http://groups.yahoo.com/group/antlr-interest/
> > >
> > > To unsubscribe from this group, send an email to:
> > >  antlr-interest-unsubscribe at yahoogroups.com
> > >
> > > Your use of Yahoo! Groups is subject to:
> > >  http://docs.yahoo.com/info/terms/
> > >
> > >
> > >
> > --
> > Professor Comp. Sci., University of San Francisco
> > Creator, ANTLR Parser Generator, http://www.antlr.org
> > Co-founder, http://www.jguru.com
> > Co-founder, http://www.knowspam.net enjoy email again!
> > Co-founder, http://www.peerscope.com link sharing, pure-n-simple
> 
> 
>  
> 
> Yahoo! Groups Links
> 
> To visit your group on the web, go to:
>  http://groups.yahoo.com/group/antlr-interest/
> 
> To unsubscribe from this group, send an email to:
>  antlr-interest-unsubscribe at yahoogroups.com
> 
> Your use of Yahoo! Groups is subject to:
>  http://docs.yahoo.com/info/terms/

Yahoo! Groups Links

To visit your group on the web, go to:
 http://groups.yahoo.com/group/antlr-interest/

To unsubscribe from this group, send an email to:
 antlr-interest-unsubscribe at yahoogroups.com

Your use of Yahoo! Groups is subject to:
 http://docs.yahoo.com/info/terms/ 

Yahoo! Groups Links

To visit your group on the web, go to:
 http://groups.yahoo.com/group/antlr-interest/

To unsubscribe from this group, send an email to:
 antlr-interest-unsubscribe at yahoogroups.com

Your use of Yahoo! Groups is subject to:
 http://docs.yahoo.com/info/terms/