[antlr-interest] Parser generator philosophy

Sat Jan 5 07:27:43 PST 2008

Mark Whitis schrieb:
...
> ANTLR does a lot of neat stuff.   But I think that it has yet to
> pass a critical usability threshold.

That's right. I believe we have to get to ANTLR 3.2 or 3.3 to fill most
of the missing gaps.

...

> Here is a basic, crucial, philosphical point: A grammar file should be
> able to define everything that is not application specific about a
> grammar in the grammar language itself.  Consider converting lex/yacc
> grammars to ANTLR.  That which was done by dirty tricks inside actions
> cannot be understood by a next generation tool, specifically antlr.
> Don't repeat that mistake.  If someone comes out with a tool tomorrow
> that uses a more efficient algorithm (and doesn't duplicate antlr's
> runtime internals) it should still be able to use the grammar file
> either directly or by running through a converter.  The problems that
> show up today as language dependency are largely the same problems
> that would screw things up for the next generation.  And they are
> problems that were obvious, to me at least, a couple decades ago.

Reading further, I've first thought, you were thinking of something like
<http://www.antlr.org/share/1196371900868/yggdrasil.pdf>. But you aren't.

...
>    rule: blah blah
>      syntactical {
>         // syntax specific rules, that can't be integrated into antlr rules
>         // portable antlrcode
>      }
>      semantic {
>         // semantic tagging actions in portable language (described below)
>         // portable antlrcode
>      }
>      language antlrcode {
>         // portable application specific code
>      }
>      language "C" {
>         // C code, application specific only
>      }
>      language "Java" {
>         // Java code, application specific only
>      }
>      ;
> The first three blocks, if present are all executed.   One of the last
> two is executed.   Rarely would you need all these on the same rule.
>

A major problem seems to be with your approach, that tree rewriting
isn't supported as everything goes into one file. And I can't help to
feel that the four categories for code are somewhat arbitrary.

...

> 
> 
> The syntax of the language can't depend on antlr's runtime internals
> or whether you are using LR/GLR/LL(k).  These are ephemeral.  Someone
> should be able to extend ANTLR, the tool, or write another tool that
> uses a choice of LL(k), GLR, LR, or recursive decent parser and use
> the same files.  Granted, writing a bottom up parser that implements
> all the features of the ANTLR++ language may be tricky, but it should
> be doable.  An LR parser generator would need to rewrite rules to make
> them LR friendly, and then convert the tree back to the original form.
...

I'm really wondering, if one can automatically convert a LR-grammar into
LL or the other way around. That may cause additional ambiguities which
aren't there in the original version. How can a tool extract enough
information to solve these problems? IMO, having something like
Yggdrasil is enough, as automatic conversions between different grammar
forms are only a nice-to-have feature. Building a target-independent
grammar for a particular parsing method seems not to bad for a deal.

> Consider the FAQ and Mailing list.   If a question is asked that is
> gramatical in nature, the answer better not be target specific or
> runtime specific code.

Looking at the examples delivered for four languages, I'd say that ANTLR
is lacking so far in a few areas.

...
> 
> @precedence: ('*', '/'), ('+', '-')
> expr:
>    a=expr '+' b=expr   SEMANTICS { ADDITION_OPERATION(a,b); }
>    | a=expr '-' b=expr   SENANTICS {SUBTRACTION_OPERATION(a,b;) }
>    | a=expr '*' b=expr   SEMANTICS { MULIIPLY_OPERATION(a,b) }
>    | a=expr '/' b=expr   SEMANTICS { DIVIDE_OPERATION(a,b) }
>    ;
> variable_declaration:
>    'variable' id=IDENTIFIER ':' type=type_expression
> SEMANTICS::VARIABLE_DECLARATION(id, type);
> 
> function_declaration:
>    'function' id=IDENTIFIER ('returns' type=type_expression) '('
> parameters=parameter_list ')' (body="{" statement* "}")+ ';'
> SEMANTICS::FUNCTION_DECLARATION(id, type, parameters, body);
> 
> Note that these are abstract markups and not code.   There is a huge
> difference between the expression parser example in the definitive
> guide with its actions statements and the abstract markup here.

The major problem with your abstract markup is that it is limited to the
things, you have included. YGGDRASIL is probably not limited, but Loring
Cramer knows best.

> Abstract symantic markup lets you do a lot of things.  Here are a few
> examples,
> mostly from the programming language domain:
>   - move from parser generator towards a true compiler compiler and
>     interpreter compiler.  ANTLR+=2 is the front end.   The backend
>     isn't much harder than a normal compiler backend for a single
>     complex language.  Once the backend supports a construct from
>     one language, it is ready to support that construct for other
> languages.

I'm not sure what the difference between a compiler compiler and parser
generator is supposed to be. For me, they refer to the same thing.

...
> 
> This is somewhat similar to existing AST/parse tree behavior.   But it
> adds standardized tagging.     The grammar writer can call x a 'function',
> 'proceedure', or a "frobnitz" but the tree will still contain the necessary
> information.   Notice that there is not a single action in this example.
> The actions are implied by the semantics and the application, not hard
> coded into the grammar file.    One tool may produce a compiler,
> one tool may produce an interpretter, one tool may produce a translator,
> another tool produces documentation, another tool analyzes variable
> dependencies, another tool does logic synthesis, etc.
> All work from the same grammar and the same AST++ tree.   They work
> whether you write the expression as "a+b" or "a b +".   It doesn't matter
> what style the grammar is written in (for example antlr style or LR style)
> or what style the language expresses them in.

That sounds utopian. Not sure if all tools can derive their required
information like you describe.

...
> With operand_a and operand_b perhaps prefixed or linked to OPERATOR_PLUS
> entry, which might be necessary to unravel more complicated versions.
> This wouldn't be limited to proceedural languages.   Abstract tagging
> could describe lists, sets, etc. for data files.   Data languages would
> add more tags, thus a node would be tagged as a) a list and b) information
> about what information it contains (Phone numbers).   Thus there would
> be two tags from different domains, one application specific.
> Additional tags for protocols.   The tags might indicate that
> the data is a variable length list with hash and/or subscript lookup.
> Additional tags for 2D/3D graphics modelling.

This proves that your approach is indeed one level too low.

...
> A couple hundred semantic tags could probably define most of the commonly
> used features of most programming languages.  The tags would be
> written in function call style.   Possibly some simple control structures:
>    IF(SYMBOL_LOOKUP(a).basic_type == STRING) {
>       OPERATOR_PLUS(a,b);
>    } else {
>       OPERATOR_CONCATENATE(a,b);
>    }
> But this decision making ability may not be necessary if you can
> do:
>   {SYMBOL_LOOKUP(a).basic_type == STRING} (a=expr "+" b=expr)
>     semantics { OPERATOR_CONCATENATE(a,b); }
>   | (a=expr "+" b=expr) semantics { OPERATOR_PLUS(a,b); }

If you need more than fifty different things which won't be nonetheless
ever enough then you aren't doing things orthogonal.

> Note that this example is not meant to suggest that you try to
> put a bunch of type specific stuff in the grammar file.  Instead
> it is dealing with a common special case where a language misuses
> the addition operator for concatenation.  The D language, for example,
> defines a concatenation operator "~".   The back end should not have
> to modified for this.  If for some reason the back end has to deal with
> it, you use OPERATOR_PLUS_OR_CONCATENATE(), preserving the ambiguity.

The use of '+' as string concatenation operator is indeed bad, but not
all languages have that use truly inbuilt. OO-Languages like C# employ
operator overloading. In these cases I don't see your approach working.

> An intial version of the tagging mechanism itself could be incorporated
> in ANTLR++.   Defining the tags would be the job of ANTLR+=2.  TAG
> definitions would be in a file that could be updated.   And it would
> probably contain some rewriting mechanisms.
> 
> Users would define their own, nonstandard tags using the common "X_NAME()"
> convention until the tags were standardized.

This approach makes me shudder...

> In many ways, ANTLR+=2 could be easier than ANTLR++.   The existing
> rewriting rules are close.  Not there, yet, as far as I understand them.
> When you use
>   a=expr '+' b=expr ^(PLUS a b)
> you appear to lose information.  You lose more when you flatten the
> tree.  I am talking about a system that retains all of the text and
> all of the intermediate nodes.  You can retain the information that
> "C+D" used the "+" operator and that C is both an expr and a atom.
> or maybe that one was an integer_atom and the other was a float_atom.
> I am adding information, not subtracting it and doing it without
> increasing the depth of the tree.

I thought you could prevent while rewriting the loss of any information.
In my view, you can save everything needed somewhere.

> In the long, run, all this could make ANTLR easier to maintain and
> extend.    Consider that there are JAVA, C, Python, C#, and Objective
> C runtimes.    Now consider that one of the reasons to make ANTLR+=2
> is to make translators easy to write.   Thus, the C, Python, C# and
> Objective C runtimes could be translated from the Java runtime.
>
> The fact that Java is fairly abstract and yet crippled helps.   Java is
> more notationally complete in many respects but functionally incomplete.
> You can't automatically translate C pointer code to Java very easily
> but you can do the reverse.   Converting classes to C isn't very
> hard, structs with function pointers.  Expressions with member
> methods a.b(...) get translated to a->b(a,...).   In the case of
> Java to C, there is already a (not maintained) translator called Toba.
> It deals with exception handling, garbage collecting, etc.
> ANTLR probably isn't using a lot of the Java proprietary libraries (SWING,
> etc.) that would complicate translating GUI apps.   Translating to
> C#, C++, and Objective-C can largely reuse C translation while
> turning off certain translation steps (classes, exception handling,
> member methods).   There are a few Java to C++ programs out there.
> And translation code can be shared with an implementation that
> converts antlrcode actions to the various target languages for
> those cases where you really need actions but want portability
> for things that still won't fit into the improved abstract notation.
> Alternatively, something like Comeau C++ to C translator ($50) could
> be used as a temporary measure to simplify the translation process
> and LLVM is rumored to do C++ to C, though I have my doubts as
> to whether that is true.

I have to disagree there. Runtimes have always some specific ways to
deal with problems - and those specifics don't only very in how they
differ, but also where they differ. So I consider it far more work to
make runtime translations automatically - without any human intervention
afterwards - than to do it yourself by hand (except basic syntax
translation).

...
> In the next section (taken from the second email), I list the flaws
> in ANTLR that I see.  Most of these are serious limitations in the
> abstract notation of ANTLR that limit its use as a portable grammar
> language between tools and even between different language targets
> of ANTLR.
> 
> The following are NOT rare edge cases that should be handled by
> user supplied target language and antlr runtime specific kludges:
>   - case insensitivity
>     (VHDL, BSDL, Pascal, ADA, SQL, BASIC, HTML, CQL, FORTRAN, HTTP,
>     Oracle (database), many assemblers, DOS SHELL, many non-unix
>     shells, human languages, common lisp, FORTH, Logo, PL/I, Scheme,
>     SGML, PHP (keywords), E, MUMPS,...)

Unicode handling makes case insensitivity more complicated. In Turkey
the uppercase y isn't Y, but Y accented with ... whatever that accent
is. In any case, you have to add locales. I know that .NET supports
locales so at least there it may be easy to compare the input. But I
don't know if the templates to generate the parser can be easily updated.

...
>   - selectively disallowing whitespace between tokens

Something I did with checking the indices of the supposedly neighboring
tokens (difference may be only one).

>   - choosing between multiple token rules that match the same input
>     based on parser context.

Can be done already. Just use for scanning a normal name (like "TILDE")
and via rewriting and imaginary tokens you can get to
"CONCATENATION[TILDE]".

>   - communicating between parser rules/actions and lexer rules/actions
>     - shared variables
>     - arguments

The situation right now: The Lexer is entirely separate from the parser.
In fact the default implementation lexes everything before the parser
sees the first token. Unfortunately, there are enough situations, where
the parser HAS to tell the lexer something. Using globals isn't enough
(scopes don't work because lexer and parser are in separate files and
classes).

Which reminds me: Lexer tokens can't have arguments, unless they are
fragment rules. I forgot the reason for this, but orthogonality-wise
it's not a good decision, even so implementation-wise the reason may be
sound.

>   - selecting tree format, where more than one is supported.  AST vs Parse
>     trees in this case.

Not sure, what you mean here.

>   - keyword abreviation
>     (very common in command languages, a really bad idea to use in scripts
>     written in those languages)

In combination with case insensitivity? In any case, these feature has
been requested already.

...
>   - grammar include files
>     This is needed for many large and/or modular grammars such as those
>     produced  by w3.org and iso.   Some examples are X3D, STEP (3D),

That feature is supposed to be included in 3.1.

>   - operator precedence.
>     (Almost every language which handles expressions).
>     This isn't too hard to do by rewriting grammars but the result
>    is cluttered grammars that are harder for readers to understand.

I'm not sure, if that is a good idea or not...

>   - access to parser class members from lexer actions, access
>     to lexer class members from parser actions, and access to other
>     utility classes from either.

See a few comments above.

>   - Non-finite input streams, streams larger than memory, and
>     streams where not all data is immediately available
>     (such as protocols).

Definitively missing.

>   - operator tokens defined by the user
>     of a grammar, not the grammer itself.   Requires a runtime
>     table lookup.   Multiple character operators "++" would be harder to
>     implement, though possible.   This gets around, for example, the
>     c++ limitation on defining new operators.
>      U+2200 .. U+22FF (mathematical symbols) are prime candidates.
>     as are U+0391..U+03A9 and U+03B1..U+03C9 (greek letters).

Huh? At which step of a usual compiler development are the user supposed
to add their new operators?

>   - layered parsing
>     For example, You might layer an SVG parser on top of an XML parser.
>     @include might be enough, then again it might not be.

That could be done via another tree grammar after XML.

> These are for the most point perfectly ordinary things that any user
> would expect a compiler compiler to be able to do automatically.   Yes,
> users
> have often been disappointed by existing compiler compilers.  But
> these are common reasons why users abondon existing tools in search
> of a better compiler compiler.    These are reasons why 5000 users
> a month download ANTLR in search of a better tool (but only a
> miniscule percentage apparenlty actually produce grammars for
> real world languages with ANTLR).   Yet most of
> these require ugly target language specific code hacks or have other
> problems in antlr.   Some are even handled worse in antlr than
> in its predecessors.
> 
> If the core of ANTLR is well written, most of these won't be too hard
> to do.
> 
> 
> My next message will deal with case sensitivity with a lot of discussion
> of unicode as well.

Unicode support isn't that good. I'd like to declare the use of
character classes and to being capable to deal with characters beyond
\uFFFF. It would make a lexer for C# very easy.

Johannes