[antlr-interest] Accessing input stream object with ANTLR and C++?

Sun Sep 8 23:51:32 PDT 2002

(Apologies again for taking so long to followup.  I've been trying to
read up on these things and avoid asking for help again, but I don't
quite have my head in the ANTLR mode....  I very much appreciate your
replies!)

 Ric Klaren writes:

 RK> On Tue, Aug 27, 2002 at 02:47:22PM -0600, Reid Rivenburgh wrote:
 >> So there's no good way to do my own reading of the input a la
 >> flex?

 RK> Well you have to go about it in a different way than in flex.

And that's the part that has me confused....  How?!  See below....

 >> Perhaps some trick of finding a token, marking the location,
 >> finding the next token, and processing the data between the two as
 >> a string...?

 RK> Well this is what you can do with tokenstream multiplexing. You
 RK> see the token, you switch to another lexer untill the endmarker
 RK> then you switch back. Inside the special lexer for the part
 RK> between the markers you can do whatever you want with
 RK> it. e.g. accumulate in a string, just ignore it, feed it to
 RK> something else.

I just don't quite see how this would work.  Briefly, I have defined a
grammar that matches certain bytes in a binary stream; each matched
byte is followed by a variable number of bytes associated with that
token (AKA a "segment").  The grammar defines the order in which these
segments can occur.  As far as the parser is concerned, the segments
are black boxes.

Unfortunately, there is no end marker to indicate the end of a
segment.  The two bytes following the matched token byte do contain
the number of bytes in that segment, though, which could help.

My original way of thinking is that when I match a token, I switch
control to my own code for reading the segment bytes from the
(std::)stream.  When finished reading the segment, I would reset the
parser (perhaps implicitly by modifying the stream) to resume parsing
at the end of the segment, which should be pointing to the next token.
This is my naive approach, but....

You stated that I could create a second lexer for reading a segment, a
scenario where ANTLR would still control reading the input.  Since
there are no tokens defined for this second lexer, would it just match
"."  (all characters), calling my code to populate a string or other
structure with each incoming byte?  That sounds wasteful....  Bear in
mind I don't know how long each segment is until I start reading it,
so I can't create a static rule to outright match N bytes.

As soon as I've read the proper number of bytes, I could switch back
to the main lexer.

 RK> Going around the lexer input state will of course yield funny but
 RK> probably very 'interesting' behaviour.

I understand, but it's unfortunate; it's exactly what I'd like to do.

 RK> Maybe have a look at the doxygen info of the C++ support library
 RK> you can find a preliminary version on my antlr hacking page:

 RK> http://wwwhome.cs.utwente.nl/~klaren/antlr/

Very nice, much better than running "less" on the various files....

 RK> Or read through the code, see how the lexers work, keywords:
 RK> InputBuffer, LexerSharedInputState, CharScanner
 RK> (consume/LA). Read the code generated for a few lexers
 RK> (preferably a few that use backtracking). That way you'll easily
 RK> get a feel for how it works.

Well, it's all a bit overwhelming!  I'm sure I'd have an easier time
of it if I was doing something normal, like parsing HTML or a
programming language....

Thanks again for your help,
Reid

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/