[antlr-interest] Both hidden and required whitespace

Thu May 8 14:26:33 PDT 2008

On Thu, May 8, 2008 at 1:58 PM, Gavin Lambert <antlr at mirality.co.nz> wrote:
>  >where Token 9 is 'numbers'.  I presume that WS is consuming the
>  >INDENT and thus I'm not seeing it in the stream.
>
>  It's more serious than that -- your grammar cannot possibly produce INDENT
> tokens, since it's a fragment rule.  So for starters, remove the 'fragment'
> from the INDENT rule.

That makes perfect sense, thanks for clarifying Gavin.

>  This will work in most cases, but not all; for example, because it requires
> a leading newline it won't work on the first line of the file, and it also
> won't work if there is trailing whitespace on the line before the newline

Understood.

> (since it will already be in the WS rule at that point, and it will continue
> matching).  It also won't work if you have Windows end-of-lines, for a
> similar reason.

I'm using the standard "common token" definition of WS as it is on the
wiki, so each WS token is a single character.  My main question is,
therefore, how ANTLR would decide which token to generate (assuming I
make INDENT a token) once I hit strings or characters that fit into
the intersection of the strings generated by  WS and INDENT.  With
yacc, the longer token that matches the longer string would be
generated token, eg. INDENT, but I'm too much of an ANTLR newbie to
know exactly how ANTLR handles it. Readme's / URLs accepted ;).

>  Where you should go from here depends on how complicated your grammar is
> already.  I had a similar need to express indentation in a grammar I worked
> on recently, but in that case it was simple enough (and had enough weird
> edge cases) that unhiding the WS rule and splitting it into separated WS and
> NL rules made the most sense.  Obviously this requires modifying all the
> parser rules to explicitly indicate where whitespace and newlines are
> permitted.

Yeah, my grammar is simple enough that the above is possible and long
enough that it would be cumbersome.  I find it's much more readable
relying on the hidden token stream.

>  I believe someone has written a python grammar for ANTLR; that's an
> indent-sensitive language, so it might be useful looking at how it's handled
> there.

Good idea! I believe Terence wrote it and I'll check to see if it's
been ported to v3.

Thank you.

--Kaleb