[antlr-interest] Whitespace: More than meets the eye?

Graham Wideman gwlist at grahamwideman.com
Thu Aug 6 11:54:28 PDT 2009


Hi Sam:

>I'd lex $id and id entirely separately, as they are syntactically 
>distinct entities. $blah is always a variable, a "true" variable,

Tempting, but not necessarily the immediate winner because PHP also allows things like:

    $myvar = 'othervar';
    $$myvar = 'xxx';

... which means  "Get the string value of $myvar (here 'othervar'), and use that to compose a new variable name ($othervar) and use that variable. So here $othervar gets assigned 'xxx'.

This use of $ can be repeated:   $$$$$$$$$myvar  (though one might not see the usefulness for this).

Then there's

    ${'some' . 'var'}

Where { } encloses an arbitrary expression with a string result (here 'somevar'), and again this is used with $ to access the variable $somevar.

Multi $ and { } can be combined.  (And of course these appear in lengthier variable or method access expressions with -> for member and [ ] for array access.

So, this all led me to want $ to be handled uniformly in the parser grammar (not lexer).

However, this discussion has prompted me to test exactly where whitespace is permitted or not permitted.  So far as my brief test has discovered, it turns out that the only place whitespace is NOT permitted is the simple case of $myvar:

    $myvar = 'x';      // OK
    $  myvar = 'y';    // Syntax error
    $  $myvar = 'z';   // Syntax OK!

So, this really argues in favor of making a special lexer token DollarIdentifier to handle this case, as you were arguing. Essentially distinguishing $ as an id marker from $ used as an indirection operator.

>$c->member should be three tokens - a VARIABLEID ($c), a MEMBER (->) and 
>an ID (member).

Yes, this is the form of the existing solution for this part of the grammar.

Anyhow -- thanks for prompting this advance in thinking :-).

-- Graham



More information about the antlr-interest mailing list