[antlr-interest] Parser generator philosophy

Sat Jan 5 04:33:28 PST 2008

This is a long message, but it is the most important message I will
ever post to this group.  If it takes a while to read, it took more
than a day to write.

I have taken parts from two different draft emails I have written on
other topics and combined them here, since this belongs in a separate 
topic.
Because of this origin, the order of this message is a bit backwards.
I describe the path to take before I describe the specific bugs
in detail.   But the abstract stuff is more important for impatient
people to read than a detailed list of flaws.   The abstract context
also will make it easier to appreciate the gravity of the flaws.
But in the middle, I diverge into the more distant future.  And
the sections overlap.

In the preface to the definitive guide, Terrance, you talk about being
two levels of indirection from doing anything productive.   But a tool
such as ANTLR could be gives you enormous leverage to accomplish more
through improving the productivity of others than you could accomplish
individually.   If you take ANTLR to the tipping point where critical
mass is reached it has the potential to catalize major improvements.
Archimedes said "Give me a lever long enough and a fulcum on which to
place and I shall move the world".   ANTLR isn't that lever, yet,
but it could be.  Due to time constraints, I will only scratch the surface
of the implications.   The absence of a solid grammar language has
severely impacted the history of computing.

Now, I may have made some mistakes.   I am criticising ANTLR before
I fully understand it.   But if I waited until I did, I might forget
a lot of the problems.  My naivety with antlr specifics combined with
my vast experience in many areas that are not antlr specific and my
knack for seeing the big picture is actually helpful in identifying flaws.
I am not accustomed to doing things "the ANTLR way" or working around
the flaws.   Thus, I am in a good position to catch flaws that will
turn away potential users by the hundreds or thousands and simultaneously
see how those flaws fit into the big picture and ticked off enough
to be motivated to write about it.   But it means that my suggested
notations may need to be altered to fit into the antlr scheme.

ANTLR does a lot of neat stuff.   But I think that it has yet to
pass a critical usability threshold.   But you can take my blunt
constructive criticism here as a compliment of sorts.   I think
ANTLR has potential.   While ANTLRs fans may compare it to
older parser generators, I compare it to what a parser generator
should be.   Thus, I will be talking more about what is wrong with
ANTLR than what is right.

ANTLR is supposed to be designed to work in the real world where grammars
are often ugly.   No parser generator can be expected to
handle languages that are really abysmal.   But the bar is much lower
than it should be.

Here is a basic, crucial, philosphical point: A grammar file should be
able to define everything that is not application specific about a
grammar in the grammar language itself.  Consider converting lex/yacc
grammars to ANTLR.  That which was done by dirty tricks inside actions
cannot be understood by a next generation tool, specifically antlr.
Don't repeat that mistake.  If someone comes out with a tool tomorrow
that uses a more efficient algorithm (and doesn't duplicate antlr's
runtime internals) it should still be able to use the grammar file
either directly or by running through a converter.  The problems that
show up today as language dependency are largely the same problems
that would screw things up for the next generation.  And they are
problems that were obvious, to me at least, a couple decades ago.

Think of ANTLR++ not as a tool but as a language; the tool is nothing
more than a specific implementation of the language.   How fast it runs,
while important for some applications, is entirely secondary to the
expressive power and reusability of the language itself.   Whether
ANTLR can do what I am describing here ultimately will have an
order of magnitude more social impact than how fast it does it.
If the tool is slow, people can optimize or come up with new tools
where needed or drop in a faster chip.    Think of it more as the
answer to EBNF, plus a reference implementation, than as the replacement
to lex/yacc, et. al.   In other words, the job is to create a portable
grammar language and the first tool that uses it.   Expect that other 
tools
will use it.   Some will be parser generators, some won't.
The missing features I describe at the bottom of the file are
more important than how many languages it natively targets.   A version
of ANTLR that supports these features with only a single target, C, would 
be
more useful than the current version (dozens of languages can link to
a parser library written in C).   But the problems in supporting
multiple languages are very similar to the problems of making a portable
grammar.   Thus, ANTLR's strength in almost supporting 5 target languages
focuses attention on the core notational deficiencies.   It also means
that a lot of the groundwork has probably already been done.   But it
means there is more work to do where the runtime API code needs 
improvement.

There are two sorts of actions:
   - grammar actions
   - application specific actions.
Syntactical grammar actions need to be implemented as extensions to the
basic rules themselves or as a small embedded language:
   - an expression syntax with magic function calls and simple
     assignment.   Thus, the
     strings can be handled something like this:
        PARSE(COMBINE(STRIP(...)))
   - a stripped down language with basic control structures that is
     reasonable to reimplement in other tools and which can be easily
     translated into all the target languages antlr supports.
Both would use a notation for accessing data that is not antlr runtime
specific.   Basic object oriented rules of hiding the internal structure
apply.

Application specific actions would also benefit, in terms of portability
across target languages, from a minimal embedded language that is
translated to the target language (target language in this text refers
to the language the parser runs in).   However, any application specific
action blocks should be separable from the core syntactical and/or 
semantical
actions.

    rule: blah blah
      syntactical {
         // syntax specific rules, that can't be integrated into antlr 
rules
         // portable antlrcode
      }
      semantic {
         // semantic tagging actions in portable language (described below)
         // portable antlrcode
      }
      language antlrcode {
         // portable application specific code
      }
      language "C" {
         // C code, application specific only
      }
      language "Java" {
         // Java code, application specific only
      }
      ;
The first three blocks, if present are all executed.   One of the last
two is executed.   Rarely would you need all these on the same rule.

Every language should have a single permissively licensed grammar file
that is both human and machine readable and complete enough that it can
immediately be run through the parser generator tool of their choice,
as long as it is good enough, and build a recognizer.   Any code
that is added would be application specific (i.e. are you building a
compiler, a syntax highlighting editor, etc).   In fact, if what
you are building is a syntax highlighting editor all you should
need is a stylesheet once you have the basic infrastructure.
Consider, for example, applying CSS to grammar rules.
    IDENTIFIER { background: red; }
    STRING { background: blue; }
Now, if you implement the features and fix the problems I have
described in my past messages and the one I haven't sent yet, to
make ANTLR++ then the ANTLR++ grammar would be very close to
satisfying this goal.

Everything about the syntax would be defined in the grammar file,
and maybe a few semantic details as well (such as whether an
identifier is a class, a typedef, a constant, a scalar, etc.).

Translators should be able to convert EXPRESS, XML Schema, etc. into
ANTLR++.

The syntax of the language can't depend on antlr's runtime internals
or whether you are using LR/GLR/LL(k).  These are ephemeral.  Someone
should be able to extend ANTLR, the tool, or write another tool that
uses a choice of LL(k), GLR, LR, or recursive decent parser and use
the same files.  Granted, writing a bottom up parser that implements
all the features of the ANTLR++ language may be tricky, but it should
be doable.  An LR parser generator would need to rewrite rules to make
them LR friendly, and then convert the tree back to the original form.
Not necessarily worth undertaking, but it should be possible.
Potential support for LR parser itself shouldn't hold back language
features.   The most important use of an LR parser (or other) backend 
would
be simply to tell a language designer if an LR parser is possible
so they can weigh the benefits of a representation against the cost.
And LR, GLR, and recursive descent are used as examples for the
basic principle that the grammar description language should not
be tied to one algorithm.

Consider the FAQ and Mailing list.   If a question is asked that is
gramatical in nature, the answer better not be target specific or
runtime specific code.

I believe that what I am suggesting can be done and should be done.   In 
fact,
I think it could have been done 20 years ago.

When undertaking ANTLR++, one should consider ANTLR+=2.   Forward thinking
helps.   While ANTLR++ would just deal with syntax and those basic 
features,
such as knowing the basic type of a symbol which is sometimes necessary
to parse the syntax, ANTLR+=2 would begin to tackle semantics.
You would tag all operators in an abstract way.

@precedence: ('*', '/'), ('+', '-')
expr:
    a=expr '+' b=expr   SEMANTICS { ADDITION_OPERATION(a,b); }
    | a=expr '-' b=expr   SENANTICS {SUBTRACTION_OPERATION(a,b;) }
    | a=expr '*' b=expr   SEMANTICS { MULIIPLY_OPERATION(a,b) }
    | a=expr '/' b=expr   SEMANTICS { DIVIDE_OPERATION(a,b) }
    ;
variable_declaration:
    'variable' id=IDENTIFIER ':' type=type_expression 
SEMANTICS::VARIABLE_DECLARATION(id, type);

function_declaration:
    'function' id=IDENTIFIER ('returns' type=type_expression) '(' 
parameters=parameter_list ')' (body="{" statement* "}")+ ';' 
SEMANTICS::FUNCTION_DECLARATION(id, type, parameters, body);

Note that these are abstract markups and not code.   There is a huge
difference between the expression parser example in the definitive
guide with its actions statements and the abstract markup here.
Abstract symantic markup lets you do a lot of things.  Here are a few 
examples,
mostly from the programming language domain:
   - move from parser generator towards a true compiler compiler and
     interpreter compiler.  ANTLR+=2 is the front end.   The backend
     isn't much harder than a normal compiler backend for a single
     complex language.  Once the backend supports a construct from
     one language, it is ready to support that construct for other 
languages.
     And projects like LLVM can be used for code generation.
   - documentation generators can automatically extract the information
     that x is a variable and y is a function.
   - Documentation generators or more sophisticated tools, can actually
     tell what variables can be derived from which other variables.
   - for simple examples like an expression parser, tools can be written
     that automatically generate the evaluator.
   - automatic translators get a big boost because grammar files are
     usable on both ends.   For example, it isn't that hard to map 
corresponding
     operators, precedence, etc. such that expressions can be translated.
     StringTemplate could come in handy here.   N languages shouldn't
     require N^2 translators written from scratch.   And a unified
     translator could have some advantages in converting mixed language
     programs because it could use the original source of a library
     and avoid lost information.
   - users will not be limited in their choice of language based on
     what languages are available for a particular target CPU or
     environment.   If you have ever done embedded work, you know what
     I mean.
   - users will be able to construct supersets of languages fairly easily.
     Enforced subsets will be trivial.
   - it is an important step towards a truly modular and sane compiler with
     a permisive license and good cross platform compilation.
     Front ends, generated with ANTLR+=2 are loaded as plugins.
     Code generators are loaded as plugins.   Code generators are loaded
     as plugins.  Optimizers are loaded as plugins.   Mixing commercial
     and free plugins.
   - languages can be developed by people doing real work in the real
     world and thus have a better idea what is really needed rather than
     by people who have time to build a compiler.
   - because the same grammar can be used to update all the other tools
     associated with a language (IDEs, documentation generators, call graph
     generators, pretty printers, code review tools, syntax aware
     version contol systems, etc) people can choose languages more
     based on their merits than their popularity.
   - domain specific data languages start to be as usable as XML.
   - syntax aware editors.  Not just syntax highlighting but
     knowing what is valid in any context.   And not just for
     programs but for data as well.   It might create a template
     to fill in.

This is somewhat similar to existing AST/parse tree behavior.   But it
adds standardized tagging.     The grammar writer can call x a 'function',
'proceedure', or a "frobnitz" but the tree will still contain the 
necessary
information.   Notice that there is not a single action in this example.
The actions are implied by the semantics and the application, not hard
coded into the grammar file.    One tool may produce a compiler,
one tool may produce an interpretter, one tool may produce a translator,
another tool produces documentation, another tool analyzes variable
dependencies, another tool does logic synthesis, etc.
All work from the same grammar and the same AST++ tree.   They work
whether you write the expression as "a+b" or "a b +".   It doesn't matter
what style the grammar is written in (for example antlr style or LR style)
or what style the language expresses them in.

Thus, you might get a tree that looks something like
   thus, given OPERATOR_PLUS(operand_a, operand_b) and "C+D" and
rule fragment "

              "+":plus:OPERATOR_PLUS
                      |
           +----------+-----------------+
           |                            |
  "C":atom:operand_a         "D":atom:operand_b

Note that this can be used whether the tree ends up (+ (C D)) or
"(+ (D C)).    Also note that without tagging, we would have two
"atoms" without knowing the semantics of those atoms.  Also, note
that at no time does the backend rely on the text "+" to determine
the type of operation.   Some operators are written differently
in different languages.   "<", "<<", "@<", "lt", and ".LT" mean
the same thing in different languages.

Antlr users who haven't seen this wil probably appreciate it.
http://merd.sourceforge.net/pixel/language-study/syntax-across-languages/Vrs.html
With possible hidden channel info mixed in for tools that need it.

With operand_a and operand_b perhaps prefixed or linked to OPERATOR_PLUS
entry, which might be necessary to unravel more complicated versions.
This wouldn't be limited to proceedural languages.   Abstract tagging
could describe lists, sets, etc. for data files.   Data languages would
add more tags, thus a node would be tagged as a) a list and b) information
about what information it contains (Phone numbers).   Thus there would
be two tags from different domains, one application specific.
Additional tags for protocols.   The tags might indicate that
the data is a variable length list with hash and/or subscript lookup.
Additional tags for 2D/3D graphics modelling.

Given a data file, you could use a query program to extract information.
A tool could generate a query program specific to one data language.
Grammars written by 6 different people could be used to automatically
construct most of a compiler that handled C, C++, ADA, Java, C#, and D.
And that is an easy example because most of those use the same basic
expression syntax.  Enough so that a compiler or interpretter could
be built automatically that might not handle the more exotic features
of a language but would be able to handle the basic features from day
one and start throwing test cases at it.   The creator of a language
can start using a subset of that language before a complete compiler
is built.

A couple hundred semantic tags could probably define most of the commonly
used features of most programming languages.  The tags would be
written in function call style.   Possibly some simple control structures:
    IF(SYMBOL_LOOKUP(a).basic_type == STRING) {
       OPERATOR_PLUS(a,b);
    } else {
       OPERATOR_CONCATENATE(a,b);
    }
But this decision making ability may not be necessary if you can
do:
   {SYMBOL_LOOKUP(a).basic_type == STRING} (a=expr "+" b=expr)
     semantics { OPERATOR_CONCATENATE(a,b); }
   | (a=expr "+" b=expr) semantics { OPERATOR_PLUS(a,b); }

Note that this example is not meant to suggest that you try to
put a bunch of type specific stuff in the grammar file.  Instead
it is dealing with a common special case where a language misuses
the addition operator for concatenation.  The D language, for example,
defines a concatenation operator "~".   The back end should not have
to modified for this.  If for some reason the back end has to deal with
it, you use OPERATOR_PLUS_OR_CONCATENATE(), preserving the ambiguity.

An intial version of the tagging mechanism itself could be incorporated
in ANTLR++.   Defining the tags would be the job of ANTLR+=2.  TAG
definitions would be in a file that could be updated.   And it would
probably contain some rewriting mechanisms.

Users would define their own, nonstandard tags using the common "X_NAME()"
convention until the tags were standardized.

In many ways, ANTLR+=2 could be easier than ANTLR++.   The existing
rewriting rules are close.  Not there, yet, as far as I understand them.
When you use
   a=expr '+' b=expr ^(PLUS a b)
you appear to lose information.  You lose more when you flatten the
tree.  I am talking about a system that retains all of the text and
all of the intermediate nodes.  You can retain the information that
"C+D" used the "+" operator and that C is both an expr and a atom.
or maybe that one was an integer_atom and the other was a float_atom.
I am adding information, not subtracting it and doing it without
increasing the depth of the tree.

In the long, run, all this could make ANTLR easier to maintain and
extend.    Consider that there are JAVA, C, Python, C#, and Objective
C runtimes.    Now consider that one of the reasons to make ANTLR+=2
is to make translators easy to write.   Thus, the C, Python, C# and
Objective C runtimes could be translated from the Java runtime.
The fact that Java is fairly abstract and yet crippled helps.   Java is
more notationally complete in many respects but functionally incomplete.
You can't automatically translate C pointer code to Java very easily
but you can do the reverse.   Converting classes to C isn't very
hard, structs with function pointers.  Expressions with member
methods a.b(...) get translated to a->b(a,...).   In the case of
Java to C, there is already a (not maintained) translator called Toba.
It deals with exception handling, garbage collecting, etc.
ANTLR probably isn't using a lot of the Java proprietary libraries (SWING,
etc.) that would complicate translating GUI apps.   Translating to
C#, C++, and Objective-C can largely reuse C translation while
turning off certain translation steps (classes, exception handling,
member methods).   There are a few Java to C++ programs out there.
And translation code can be shared with an implementation that
converts antlrcode actions to the various target languages for
those cases where you really need actions but want portability
for things that still won't fit into the improved abstract notation.
Alternatively, something like Comeau C++ to C translator ($50) could
be used as a temporary measure to simplify the translation process
and LLVM is rumored to do C++ to C, though I have my doubts as
to whether that is true.

The difficulty in translating from C and having, for example,
hand coded classe classes, exceptions, or pointer operations convert
into the appropriate notation for those classes is somewhat analogous
to the current state of ANTLR.   While C is functionally complete,
it just isn't abstract enough.   While you might be able to write
something in C or ANTLR, it isn't abstracted to the point that
it can be translated and reused.

In the next section (taken from the second email), I list the flaws
in ANTLR that I see.  Most of these are serious limitations in the
abstract notation of ANTLR that limit its use as a portable grammar
language between tools and even between different language targets
of ANTLR.

The following are NOT rare edge cases that should be handled by
user supplied target language and antlr runtime specific kludges:
   - case insensitivity
     (VHDL, BSDL, Pascal, ADA, SQL, BASIC, HTML, CQL, FORTRAN, HTTP,
     Oracle (database), many assemblers, DOS SHELL, many non-unix
     shells, human languages, common lisp, FORTH, Logo, PL/I, Scheme,
     SGML, PHP (keywords), E, MUMPS,...)
   - nested parsing (VHDL, XML, SVG, C, C++...)
     - parsing inside strings (VHDL, XML, SVG, ...)
     - handling of include files (C, C++, ...)
     - selecting a parser and lexar subgrammar when doing nested parsing
     - putting nested grammars in user's choice of separate grammar
       files or in the same file.
     - selectively sharing rules among nested grammars and parent
       grammar.
   - keywords as identifiers
     (SQL, COBOL, Natual/Abadas, JCL, BSDL (sort of),...)
     Plus, this handles the general problem of code breakage in
     non-ambiguous cases due to official and unofficial extensions
     to a language.
   - selectively disallowing whitespace between tokens
   - choosing between multiple token rules that match the same input
     based on parser context.
   - communicating between parser rules/actions and lexer rules/actions
     - shared variables
     - arguments
   - selecting tree format, where more than one is supported.  AST vs Parse
     trees in this case.
   - keyword abreviation
     (very common in command languages, a really bad idea to use in scripts
     written in those languages)
   - conditional inclusion of parts of the grammar file based on
     variables  (i.e. similar to C #if or #ifdef) including user
     defined variables and options to the language.
     @if language=="C"
     @else
     @endif
     And also defining those from the ANTLR command line.
   - grammar include files
     This is needed for many large and/or modular grammars such as those
     produced  by w3.org and iso.   Some examples are X3D, STEP (3D),
   - operator precedence.
     (Almost every language which handles expressions).
     This isn't too hard to do by rewriting grammars but the result
    is cluttered grammars that are harder for readers to understand.
   - access to parser class members from lexer actions, access
     to lexer class members from parser actions, and access to other
     utility classes from either.
   - Non-finite input streams, streams larger than memory, and
     streams where not all data is immediately available
     (such as protocols).
   - operator tokens defined by the user
     of a grammar, not the grammer itself.   Requires a runtime
     table lookup.   Multiple character operators "++" would be harder to
     implement, though possible.   This gets around, for example, the
     c++ limitation on defining new operators.
      U+2200 .. U+22FF (mathematical symbols) are prime candidates.
     as are U+0391..U+03A9 and U+03B1..U+03C9 (greek letters).
   - layered parsing
     For example, You might layer an SVG parser on top of an XML parser.
     @include might be enough, then again it might not be.

These are for the most point perfectly ordinary things that any user
would expect a compiler compiler to be able to do automatically.   Yes, 
users
have often been disappointed by existing compiler compilers.  But
these are common reasons why users abondon existing tools in search
of a better compiler compiler.    These are reasons why 5000 users
a month download ANTLR in search of a better tool (but only a
miniscule percentage apparenlty actually produce grammars for
real world languages with ANTLR).   Yet most of
these require ugly target language specific code hacks or have other
problems in antlr.   Some are even handled worse in antlr than
in its predecessors.

If the core of ANTLR is well written, most of these won't be too hard
to do.

My next message will deal with case sensitivity with a lot of discussion
of unicode as well.