[antlr-interest] Re: Positioning input stream (was EOL sequen ce)

Fri Dec 19 08:06:44 PST 2003

Also, I would inspect the generated code around any of your read()s to
convince yourself that it will behave.  Reading the generated code is a
great way to learn what antlr is doing, and that code is designed to be read
by humans (except the tree building parts :))

Monty

-----Original Message-----
From: skappskapp [mailto:skapp at rochester.rr.com] 
Sent: Friday, December 19, 2003 3:04 AM
To: antlr-interest at yahoogroups.com
Subject: [antlr-interest] Re: Positioning input stream (was EOL sequen ce)

Thanks for the advice from all. Unfortunately, the records are not 
fixed length - in general they can be any length. The image 
operators usually read a fixed amount of data. I've run into other 
PostScript code that uses the read operators to discard lines of 
input that should not be interpreted. So I cannot make any 
assumptions about the amount of data to be read.

However I did find a solution, albeit a somewhat strange one. 
Instead of attempting to ensure that I could synchronize the input 
stream that the read operators were using with the lookahead stream 
that the scanner was using, I simply built on top of the scanner. 
Since I have no syntactic predicates in my grammar, I added a read() 
method to my lexer:

public int read() throws CharStreamException {
    int output = LA(1);
    if (output == EOF_CHAR) {
        output = PSFile.EOF;
    }
    else {
        consume();
    }
    return output;
}

Now the file operators are reading from the same "stream" that the 
lexer is. Once I match a name object in the lexer, the 
consumeWhiteSpace() method is called:

public void consumeWhiteSpace() throws CharStreamException {
    char value = LA(1);
    if (value == '\r') {
        consume();
        if (LA(1) == '\n') {
            consume();
        }
        newline();
    }
    else {
        if (value == '\n') {
            consume();
            newline();
        }
        else if ((value == ' ') || (value == '\t') ||
                 (value == '\f') ||(value == '\0')) {
            consume();
        }
    }
}

Any objections out there as to why this should not or could work?

   Steve

--- In antlr-interest at yahoogroups.com, mzukowski at y... wrote:
> Yes, you do need to reset the lookahead buffer.  Doing your read 
from the
> parser is a bad idea in general.  If you are strictly k=1 in your 
parser and
> don't use any syntactic predicates then you may be able to do it 
reliably,
> but I would strongly recommend doing it in the lexer.  ANTLR 
lexers are
> powerful enough to actually be parsers in their own right.
> 
> Not being familiar with PostScript I'm not sure how practical that 
is.  For
> this one rule you could use lexer states.  But wait, does the 
interpreter
> use information on the stack for how many bytes to read?  
> 
> If so, you may be better off maintaining your stack in the lexer.  
> 
> The core of the problem is that the parser needs k tokens ahead of 
the
> current match to be able to predict what to match next.  That "k" 
is at
> least what you say k is, but with a syntactic predicate k is 
unbounded.  So
> in the extreme case you may have already lexed the entire input 
stream
> before you even start parsing.  The lookahead buffer is filled as 
needed, so
> it doesn't always have k elements in it.
> 
> What is really happening below is that the lexer, which also has a 
lookahead
> buffer, has already read the 'CR' and has it in its lookahead 
buffer.  It
> has not lexed the whitespace yet.  The input stream has not read 
the LF yet.
> Luckily for you, in this particular production the parser didn't 
need to
> know LA(1) yet.  If it needed that then the whitespace would have 
been lexed
> and skipped and then X would have been lexed, turned into a Token 
and put
> into the parser's lookahead buffer.  The lexer would have read the 
following
> LF to know to end lexing X and the input stream would be set at the
> following CR.
> 
> Solution?  Do it in the lexer and switch lexer states when you 
know you're
> going to read a fixed amount of data.  And before switching call 
the WS rule
> to read all of the whitespace before the data.  I believe there is 
a note on
> the antlr website or FAQ or manual about reading fixed length 
records for
> more details.
> 
> Monty
> 
> -----Original Message-----
> From: skapp at r... [mailto:skapp at r...] 
> Sent: Wednesday, December 17, 2003 7:01 PM
> To: antlr-interest at yahoogroups.com
> Subject: [antlr-interest] Re: Positioning input stream (was EOL 
sequence)
> 
> Thanks for all the replies to date. Terence, I did look at your 
> parser, which was a partial PostScript parser, but I am currently 
> far past your example. I am using "k=2" for the lexer. I have 
> cleaned up some of the ambiguity warnings - thanks to many people.
> 
> I have no problem consuming whitespace when I am *parsing* or 
> *lexing*. The problem arises with PostScript's read operators, 
which 
> permit interruption of the parsing process to read arbitrary data 
> from the current input stream. 
> 
> PostScript has almost no productions. Once a token is recognized, 
it 
> is immediately executed by the parser. The parser does not have to 
> match against any sequence of tokens - all tokens are standalone. 
In 
> this example, 
> 
> currentfile read<LF>X<LF>
> 
> "currentfile" is recognized as a name token, passed to the parser, 
> and is immediately executed by the parser. Then "read" is 
recognized 
> as a name token, passed to the parser and immediately executed. 
Now 
> the read operator pulls one byte from the input stream, in this 
case 
> the "X" byte from the input stream. For a EOL sequence of LF or 
CR, 
> this sequence executes as expected - the next read from the input 
> stream does indeed return the "X" byte. However, when I return 
from 
> executing the read operator, two whitespace sequences are 
recognized 
> by the lexer, a LF and another LF. I expected one since the input 
> stream should now be positioned past the X - but why is there 
> another? Do I need to clear out the lookahead buffer, and if so, 
how 
> do I do this? 
> 
> For PostScript, standalone white space is tossed out, so this 
> particular sequence is not a big problem unless I want an accurate 
> line number. But the following sequence is a problem.
> 
> currentfile read<CR><LF>X<CR><LF>
> 
> Here the read operator picks up the <LF> instead of the X. When I 
> return from executing the read operator, the lexer recognizes a CR 
> and the "X" character. Since "X" is not a valid PostScript name 
> operator (semantics not syntax), the interpretation fails. 
> PostScript expects the read operator to obtain the "X" character 
and 
> the next whitespace sequence to be the final CR-LF. 
> 
> It seems like I need advance warning that a CR-LF sequence is 
coming 
> before a name operator like "read" is executed. But the next token 
> has not yet been requested by the parser.
> 
> Any thoughts on how to get out of this?
> 
>    Regards,
> 
>       Steve
>  
> 
> 
> 
> 
> --- In antlr-interest at yahoogroups.com, Terence Parr <parrt at c...> 
> wrote:
> > Don't forget my mini postscript interpreter lab I had my 
students 
> do 
> > last semester....link on my course page at USF (CS652).
> > 
> > Ter
> > On Wednesday, December 17, 2003, at 09:08  AM, Albert Huh wrote:
> > 
> > > i don't know too much about ps syntax, but you could simply 
make 
> your 
> > > whitespace rule consume spaces as well as newlines in your 
> lexer.  the 
> > > java example that comes with antlr does this.
> > >
> > > -----Original Message-----
> > > From: skapp at r... [mailto:skapp at r...]
> > > Sent: Wednesday, December 17, 2003 12:04 AM
> > > To: antlr-interest at yahoogroups.com
> > > Subject: [antlr-interest] Positioning input stream (was EOL 
> sequence)
> > >
> > >
> > > I have worked out enough details with the EOL sequences to
> > > understand where my PostScript parser is failing. PostScript 
> parsers
> > > have to be able to handle the following four example sequences
> > > identically:
> > >
> > > currentfile read 3
> > > currentfile read<CR>3
> > > currentfile read<LF>3
> > > currentfile read<CR><LF>3
> > >
> > > where the "currentfile read" operator sequence instructs the
> > > PostScript interpreter to read one byte from the input stream.
> > >
> > > There is no issue with the first three examples. The input 
stream
> > > point just past the EOL byte once the "read" operator has been
> > > recognized. Then the read operator simply has to pull one byte 
> from
> > > the input stream (a FileInputStream in this case).
> > >
> > > However, in the fourth case, the input stream points to the 
<LF>
> > > character when the "read" operator has been recognized. The
> > > PostScript spec states that "Any of the three forms of EOL ... 
is
> > > treated as a single white-space character."
> > >
> > > How do I handle this? What can or should I do in the lexer 
> versus in
> > > the parser?
> > >
> > > Regards,
> > >
> > >    Steve
> > >
> > >
> > >
> > >
> > >
> > > Yahoo! Groups Links
> > >
> > > To visit your group on the web, go to:
> > >  http://groups.yahoo.com/group/antlr-interest/
> > >
> > > To unsubscribe from this group, send an email to:
> > >  antlr-interest-unsubscribe at yahoogroups.com
> > >
> > > Your use of Yahoo! Groups is subject to:
> > >  http://docs.yahoo.com/info/terms/
> > >
> > >
> > >
> > >
> > >
> > > Yahoo! Groups Links
> > >
> > > To visit your group on the web, go to:
> > >  http://groups.yahoo.com/group/antlr-interest/
> > >
> > > To unsubscribe from this group, send an email to:
> > >  antlr-interest-unsubscribe at yahoogroups.com
> > >
> > > Your use of Yahoo! Groups is subject to:
> > >  http://docs.yahoo.com/info/terms/
> > >
> > >
> > >
> > --
> > Professor Comp. Sci., University of San Francisco
> > Creator, ANTLR Parser Generator, http://www.antlr.org
> > Co-founder, http://www.jguru.com
> > Co-founder, http://www.knowspam.net enjoy email again!
> > Co-founder, http://www.peerscope.com link sharing, pure-n-simple
> 
> 
>  
> 
> Yahoo! Groups Links
> 
> To visit your group on the web, go to:
>  http://groups.yahoo.com/group/antlr-interest/
> 
> To unsubscribe from this group, send an email to:
>  antlr-interest-unsubscribe at yahoogroups.com
> 
> Your use of Yahoo! Groups is subject to:
>  http://docs.yahoo.com/info/terms/

Yahoo! Groups Links

To visit your group on the web, go to:
 http://groups.yahoo.com/group/antlr-interest/

To unsubscribe from this group, send an email to:
 antlr-interest-unsubscribe at yahoogroups.com

Your use of Yahoo! Groups is subject to:
 http://docs.yahoo.com/info/terms/ 

Yahoo! Groups Links

To visit your group on the web, go to:
 http://groups.yahoo.com/group/antlr-interest/

To unsubscribe from this group, send an email to:
 antlr-interest-unsubscribe at yahoogroups.com

Your use of Yahoo! Groups is subject to:
 http://docs.yahoo.com/info/terms/