[antlr-interest] Comments and questions on a recent project

Mon Aug 26 06:06:15 PDT 2002

I recently used ANTLR to polish up a parser I had on the back burner and I have a few questions.  I should qualify my remarks by pointing out that I don't really have much experience with parsers (I'm a mechanical engineer) and my applications are probably relatively simple by comparison to most.  This was just a pet project of mine to help me parse Modelica modeling language ( http://www.modelica.org <http://www.modelica.org> ) files for some work related utilities.  I'm not quite sure where to start so I'll just try and cast this as a narrative and try not to ramble too much...

The first issue that comes to mind is the fact that at the start of the project, I essentially had to choose my target language up front.  I found that quite annoying.  I saw some comments in the group about Ter not wanting to put language stuff on the command line.  I'm not sure I ever saw the exact argument against this, but as a practical matter I found it quite annoying.  The result is that I had to put a few (not many) language specific bits in my grammar (target language option, handling of newlines, comments, etc).  I would have preferred to keep it "pristine" for other potential users.

As part of the parsing stage, I tried to do my best to construct a succinct AST.  I eliminated lots of punctuation and extraneous keywords from AST and I tried to introduce some high-level "imaginary" tokens to represent the important nodes in my AST.  I suppose I could have used the same token types that the keywords/punctuation generated but I guess I feel it is slightly cleaner to create new ones specifically for "rule nodes".  I had a few difficulties with this.  The first was that there are some funny constructs in the language where some qualifiers appear in some rules and some appear above them.  Simple example:

stored_definition
  : ("final")? class_definition
  ;

class_definition
  : ("encapsulated")? ("partial")? restricted_class IDENT ...
      { ## = #([DEFINITION, "DEFINITION"], ##); }
  ;

The problem here is that while the "encapsulated" and "partial" qualifiers will appear within the "DEFINITION" node in my AST, it is kind of hard to get the "final" one in there since it comes "from above".  I realize I could pass it down, but that introduced more target language specific modifications to the grammar.  I could also have added it to the "class_definition" AST after parsing the "stored_definition", but I couldn't see a way to do that without writing target language specific code.  Did I miss any built in tree construction capabilities that would have allowed me to easily do this?

Another issue with my "imaginary nodes" comes on the tree parser side.  I tried to create a nice clean AST format on the parser side since this allowed for a fairly simple tree parser grammar.  One concern I have (again, I'm not a parser person so I'm probably missing something) is that I have different applications for my tree parser and I'd like to embellish the AST in different ways depending on my application.  For example, in some cases I might be interested in resolving the fully qualified names to all my data elements.  So I'd like to add associate such names with the instance names in my AST (not fully qualified).  What is the best way to do this?  I thought about using heterogeneous AST nodes, but that would make the problem quite complicated the AST nodes would be potentially in conflict from one application to another.  Using homogeneous AST nodes, I could certainly add sub-nodes with the information I want but there are two issues I'm concerned about:

  - First, I'm not quite sure I appreciate yet how this will affect my tree parser.  Won't this essentially affect my tree grammar and then I end up with a specific tree grammar for each application or can I do some clever things to get it to "ignore" this extra information unless I specifically require it?

  - Second, it seems that I need to create additional imaginary node types to include such information (i.e. a "FQ_NAME" node type).  In order to be sure that such a definition is mutually exclusive with all other node type definitions, it has to be done in the original parser grammar file doesn't it?

This reminds me of another issue.  It seems to me (again, I'm probably missing something) that the tree grammar stuff makes reuse difficult.  I imagine the "correct way" to do things is to define a complete tree grammar and then extend it for different applications.  In this way, you can override some of the rules (only the ones you are interested in).  But, don't you have to rewrite the entire rule?  If so, that seems like a real pain if you are likely to establish actions for all rules because you end up re-typing the whole grammar anyway.  I'm comparing this to something like a "visitor" approach where you can just write the action and the rule it is associated with.  Am I missing something?  Also, it seems like every application considerably complicates my Makefile because I have to establish lots of new targets for the generated code from each tree grammar.

Because of the complications with having lots of little tree grammar walkers, I took a slightly different approach.  I created a Python extension for my AST.  I tried to use SWIG ( http://www.swig.org <http://www.swig.org> ) to just create a general extension for the AST types used in ANTLR.  In theory, this should work but I was never able to get SWIG to properly grok all the complicated C++ header files.  So, I ended up creating a wrapper class around RefAST that was simplified.  It works great but it bothers me that I've got my "Node" class wrapped around "RefAST" which is a smart pointer wrapper around "AST *".  In the end though, it worked and I was able to do really simple stuff so easily.  For example, I can count the number of different definition types in my libraries with a simple script like:

def count(ast):
    for i in ast.find_all(modelica.tokens.DEFINITION, 1):
        type = i.getFirstChild();  # This is the restricted_class node
        counts[type.getText()] = counts[type.getText()]+1;

I had a problem with "lexing" numbers.  I looked in the FAQ and I couldn't get the solution mentioned there to work for some reason.  What I finally settled on was simple enough so perhaps someone might consider including this alternative.  The basic problem I had was in knowing whether a "." was the start of ".32e+5" or just a ".".  The rules I came up with, using a syntactic predicate, look like this:

protected UINT: (DIGIT)+;
protected EXP: ("e"|"E") (OP_PLUS|OP_MINUS)? UINT;

UREAL
    : UINT ( "." (UINT)? )? (EXP)?
    | ("." DIGIT) => "." (UINT)? (EXP)?
    | "." { $setType(DOT); }
    ;

Another "lexing" issue that I ran into was that Modelica allows for nested \n's inside strings.  As a result, he line number information was thrown off so I ended having to create a slightly complicated rule for string parsing that recognized nested strings and incremented the new line.  Again, this might make a good example for an FAQ.  The solution I came up with was:

STRING: '"' (SESCAPE|'\n' { newline(); }|~'"')* '"';
protected
SESCAPE
    : "  <file://\\'> \\'"
    | "\\\""
    | "\\?"
    | "\\\\"
    | "  <file://\\a> \\a"
    | "  <file://\\b> \\b"
    | "  <file://\\f> \\f"
    | "  <file://\\n> \\n"
    | "  <file://\\r> \\r"
    | "  <file://\\t> \\t"
    | "  <file://\\v> \\v"
    ;

Finally, one other problem I ran into that I never found a good solution for was how to deal with a keyword that can also be used as an identifier.  The fact that the word "initial" is both a keyword and a built-in function is one of the few warts in the Modelica specification.  I saw something in the FAQ about how to handle this for SQL but the solution seemed to leave the parser generator complaining about ambiguities.  I was hoping for a solution where I could disambiguate the problem.  I wasn't able to come up with anything simple.  I could have passed some information down through the parser, but it would have impacted a huge number of rules.  In a similar way, I could have created two different expression parsing rule sets but they would have both been huge and only different in the "leaf" nodes at the bottom.

My hope in writing this all down is to pass along some ideas of how people who are not parsing experts will approach parsing applications and to document some of the problems and misconceptions they might have.  Hopefully somebody will find this useful and perhaps some of these issues can be addressed in future versions of ANTLR.  For the most part, I'm quite pleased with ANTLR and the results.

--
Michael Tiller
Ford Research Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20020826/481bcc16/attachment.html