[antlr-interest] Whitespace: More than meets the eye?

Sam Barnett-Cormack s.barnett-cormack at lancaster.ac.uk
Thu Aug 6 01:59:11 PDT 2009


Graham Wideman wrote:
> Hi Sam,
> 
> Thanks for your comments. More below on your questions:
> 
>> I'm curious as to why you want to sometimes consider whitespace, though. 
>> Is this a self-designed language, or a specification you're working from 
>> that makes whitespace 'sometimes' significant?
>>
>> You example was a function call or declaration. You can always get help 
>>from the lexer here if there are situations where there *must* be a 
>> space, and situations where there *mustn't* be a space, and nothing 
>> else... have tokens that include the lparen.
> 
> Yes, I am considering the least-messy way to tackle a few of these issues in PHP. (And the function example I gave was just a simple example, not a problem in PHP.)
> 
> One example that PHP has is the use of "$" as a prefix to identifiers, sometimes.
> 
> An ordinary variable:
> 
>     $myvar    = 'hello';
>     $othervar = $myvar;
> 
> Everywhere that such a variable appears, the dollar prefix is required, and no space is allowed. Now it's tempting to write the grammar as:
> 
> variableName 
>     : Dollar Identifier ...
> ...
> Identifier
>     : ('a'..'z' | 'A'..'Z' | '_')  ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')*
> 
> This Identifier rule is good for all named things in PHP, but the parser rule would allow whitespace between $ and Identifier, which is not accepted by the actual PHP parser.  
> 
> So, maybe it's better to stick the "$" at the beginning of the lexer rule for Identifier (call it DollarIdentifier or something).
> 
> But then you get to variables that are members of a class/object. 
> 
>     class C {
>         var $mymember = 'Hello';
>         ...
>     }
>     $c = new C();
>     print $c->mymember;
> 
> Note how the declaration uses a $ prefix, but the usage does not (the only $ is on the object variable, not the id of the member variable).  But I'm somewhat loath to handle the $ sometimes in lexer rules, and sometimes in parser rules, as this seems apt to confuse later. (Maybe not... I haven't assessed how messy it gets going down this path.)
> 
> I do indeed see ways to lex/parse this more strictly, I'm just exploring for the least messy way.

My limited experience has shown me that the more complex way usually 
ends up less messy in the end...

I'd lex $id and id entirely separately, as they are syntactically 
distinct entities. $blah is always a variable, a "true" variable, and 
$c->member should be three tokens - a VARIABLEID ($c), a MEMBER (->) and 
an ID (member). If PHP requires there be no space between those tokens, 
then that might be a problem, but conceptually you'd parse it to a tree like

^(MEMBER VARIABLEID ID)

or, filling in values,

^(MEMBER $c member)

The point being that -> is a member operator. Your tree walker would see 
the $member and give that class a member called member, perhaps, which 
the MEMBER operator would fine. It's easy to trim/add a '$' from a string.

-- 
Sam Barnett-Cormack


More information about the antlr-interest mailing list