[antlr-interest] Parsing HAML - significant and insignificant whitespaces

J. Stephen Riley Silber jsrs701 at yahoo.com
Wed Jul 15 04:08:02 PDT 2009


Hi,

In cases like this, I prefer not to jump through the parsing hoops that Python-style whitespace requires.  Instead, I write a preprocessor (in Perl, usually) that adds extra tokens to the source file, making the whitespace irrelevant again.

To modify your example, I would take

%A
    %B
        %B1
        %B2
    %C
        %C1

and add something (perhaps curly braces?) to indicate opening and closing a node:

%A {

    %B {

        %B1

        %B2
    }

    %C {

        %C1
    
}
}

(It's a very easy hack.  Significant indentation makes for easy preprocessors, too. :-) 

For this format it's extremely simple to write an ANTLR grammar.  And of course the extra tokens are throw-away: they needn't be represented in the final AST at all.

--S

--- On Tue, 7/14/09, Dmitiry Nagirnyak <dnagir at gmail.com> wrote:

From: Dmitiry Nagirnyak <dnagir at gmail.com>
Subject: [antlr-interest] Parsing HAML - significant and insignificant whitespaces
To: antlr-interest at antlr.org
Date: Tuesday, July 14, 2009, 9:43 AM

Hi,

I am researching possibility to parse HAML syntax to port it to .NET. There is project call NHAML but uses Regular Expressions instead of regular parser.
While it is working great it has certain limitations.


So people start thinking about a real parser. And years ago I did some wotks with ANTLR and have chance to revisit it.

My question is about whitespaces.
In NHAML whitespaces are significant at the beginning of line.


What I would like to have is this (star* for whitespace):

%A
**%B
****%B1
****%B2
**%C
****%C1

It would correspond to the tree sam type of tree (A in the root; B,C - second level nodes, B1,B22, C1 - third level nodes).


It would be easy if the whitespaces would always be indented at the sane number (here 2).
But this should be configurable. And even more, instead of whitespaces there might be tabs. But let's skip this for now.


So grammar like this (just a quick draft) won't satisfy that:
nhaml    :    line*    
    ;
line    :    indent? rule
    ;
indent    :    WS WS indent? // How to consume different number of WSs depending on provided settings?

    ;
rule    :    ~WS (~NL)*
    ;

So the actual question is in rule "indent".
If I don't know required number of matches of WS during development, how can I write grammar for that?

Cheers,

Dmitriy Nagirnyak.


-----Inline Attachment Follows-----


List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090715/91cd11c5/attachment.html 


More information about the antlr-interest mailing list