[antlr-interest] Overriding INPUT->istream->consume
Jim Idle
jimi at temporal-wave.com
Thu Sep 13 09:22:40 PDT 2012
That's pretty much the way to do it.
Jim
> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Justin Murray
> Sent: Thursday, September 13, 2012 6:25 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] Overriding INPUT->istream->consume
>
> Jim,
>
> I've decided that for my current project, I need to override the
> functionality in antlr3UTF8Consume(). I need to correctly handle '\r'
> when setting the token line numbers. This means counting '\r' or '\n'
> alone each as a newline, and counting '\r' \'n' in sequence as a single
> newline. This was easy enough to do (attached as a reference for
> others, since I could not find this anywhere).
>
> What I have attached works, but notice that I had to redefine the
> arrays
> trailingBytesForUTF8 and offsetsFromUTF8 to use them in my version of
> the code. This is because they are declared as static in
> antlr3inputstream.c. I don't like the idea of modifying the distributed
> source for the runtime directly to make it not static (this is hard to
> maintain properly). I also don't like my current solution of just
> duplicating the code. My question for Jim is, is there a better way to
> do this, or is this something that could be improved in later versions
> of the runtime (this is 3.4)?
>
> Cheers,
>
> - Justin Murray
>
> ----
> @lexer::apifuncs
> {
> INPUT->istream->consume = customUTF8Consume; }
>
> @lexer::members
> {
> // ------------------------------------------------------
> // Following is from Unicode.org (see antlr3convertutf.c)
> //
>
> /// Index into the table below with the first byte of a UTF-8
> sequence to
> /// get the number of trailing bytes that are supposed to follow
> it.
> /// Note that *legal* UTF-8 values can't have 4 or 5-bytes. The
> table is
> /// left as-is for anyone who may want to do such conversion,
> which was
> /// allowed in earlier algorithms.
> ///
> static const ANTLR3_UINT32 trailingBytesForUTF8[256] = {
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
> 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
> 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
> 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
> };
>
> /// Magic values subtracted from a buffer value during UTF8
> conversion.
> /// This table contains as many values as there might be trailing
> bytes
> /// in a UTF-8 sequence.
> ///
> static const UTF32 offsetsFromUTF8[6] =
> { 0x00000000UL, 0x00003080UL, 0x000E2080UL, 0x03C82080UL,
> 0xFA082080UL, 0x82082080UL };
>
> // End of Unicode.org tables
> // -------------------------
>
> static void customUTF8Consume(pANTLR3_INT_STREAM is)
> {
> pANTLR3_INPUT_STREAM input;
> ANTLR3_UINT32 extraBytesToRead;
> ANTLR3_UCHAR ch;
> pANTLR3_UINT8 nextChar;
>
> input = ((pANTLR3_INPUT_STREAM) (is->super));
>
> nextChar = (pANTLR3_UINT8)input->nextChar;
>
> if (nextChar < (((pANTLR3_UINT8)input->data) +
> input->sizeBuf))
> {
> // Indicate one more character in this line
> //
> input->charPositionInLine++;
>
> // Are there more bytes needed to make up the
whole
> thing?
> //
> extraBytesToRead =
> trailingBytesForUTF8[*nextChar];
>
> if (nextChar + extraBytesToRead >=
> (((pANTLR3_UINT8)input->data) + input->sizeBuf))
> {
> input->nextChar =
> (((pANTLR3_UINT8)input->data) + input->sizeBuf);
> return;
> }
>
> // Cases deliberately fall through (see note A in
> antlrconvertutf.c)
> // Legal UTF8 is only 4 bytes but 6 bytes could be
used
> in old UTF8 so
> // we allow it.
> //
> ch = 0;
> switch (extraBytesToRead) {
> case 5: ch += *nextChar++; ch <<= 6;
> case 4: ch += *nextChar++; ch <<= 6;
> case 3: ch += *nextChar++; ch <<= 6;
> case 2: ch += *nextChar++; ch <<= 6;
> case 1: ch += *nextChar++; ch <<= 6;
> case 0: ch += *nextChar++;
> }
>
> // Magically correct the input value
> //
> ch -= offsetsFromUTF8[extraBytesToRead];
> if (ch == '\n')
> {
> /* Reset for start of a new line of
> input */
> if ((input->nextChar == input->data) ||
> (*((pANTLR3_UINT8)input->nextChar-1) != '\r'))
> {
> // if it is the first character,
> or the previous character was not a \r
> input->line++;
> }
>
> input->charPositionInLine = 0;
> input->currentLine = (void
> *)nextChar;
> }
> else if (ch == '\r')
> {
> /* Reset for start of a new line of
> input
> */
> input->line++;
> input->charPositionInLine = 0;
> input->currentLine = (void
> *)nextChar;
> }
>
> // Update input pointer
> //
> input->nextChar = nextChar;
> }
> }
> }
> ----
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address
More information about the antlr-interest
mailing list