[antlr-interest] Parser generator philosophy

Tue Jan 8 03:44:32 PST 2008

I accidentally sent the mail to Mark himself and not to the list, so I
resend it properly.

Mark Whitis schrieb:
>
> I never said tree rewriting isn't supported.   Everything for
> the very front end goes into one file or set of files (user's
> choice, not the tools).   In practice, one file would usually be
> used except for modular grammars where a set of grammars
> share modules.

Well, it still LOOKS like it. I don't know how your approach can support
this, but this isn't proof that your approach can't support this.

> In fact, I am dismayed to read that antlr itself doesn't yet
> support tree rewriting rules for tree grammers, though that
> is planned for future versions.   Odd becuase you would
> think that it would take more to remove it.

"...take more to remove it."??? What do you mean? In any case, tree
rewriting is supported by the latest intermediate builds. 3.1 is
supposed to be released this January - or February, if other targets
than Java and C# don't work at the end of this month. (I haven't read
anything about the state of the C target yet, which should be included
in a working form. After all, it is already a working target for 3.0.)

> You might, as a simplified example,  have many passes in a compiler:
>   - The first pass parses the syntax.
>     it executes the portable "syntax { }" actions.
>     builds a symbol table.
>   - the second pass does the semantic tagging.
>     Can be merged with the first pass on some languages like
>     C that require forward declarations.
>     executes the protable "semantic { } actions.
>     Does the semantic tagging.   AST tree is built here or
>     in pass 1.
>   - The third pass resolves constant subexpressions as these
>     will need to be resolved before you can determine the
>     size of objects "char a[SIZE+1]".
>     They are either removed from the tree or kept but tagged
>     as dead weight but still available for things like
>     error messages so the compiler can tell you that
>     "a+b+c" which resolves to "15" is not a valid value
>     rather than confusing you with "15".
>   - The fourth pass, which might be merged with the third,
>     would do source level optimiztion like pruning code where:
>        if(never) {
>        }
>     It might also factor out constant subexpressions, do
>     loop unrolling, etc.
>   - The fifth pass outputs LLVM assembler
>   - subsequent optimization and code generation passes handled by LLVM

I'm not that much of an expert, compared to Ter, but the first two
passes seem awfully complicated to me...
> 
> The division between first and second pass is a little fuzzy, still.
> The basic issue is that many languages require you to build at least
> a minimal symbol table before you can finish parsing.   In C, is "a*b"
> multiplication (with discarded result) or declaring b to be a
> pointer to type a?   If the language has an ambiguous syntax and
> lets you write things in any order like
>    a*b;
>    typedef int a;
> Then things start to get really ugly.

There are probably too many different variations of these corner cases
to support that directly by ANTLR.

...

> Yes, the core will need to be extended
> some as languages are added.

And that's exactly the kind of situation which shouldn't happen. ANTLR-X
has to be complete and high enough abstracted so it doesn't require this.

...

> The language specific stuff is only for the stuff
> that can't be handled by the portable actions.
...

That's illustrating the ironic point: You want ANTLR become language
independent, but you still allow the use of language-specifics. I'm
wondering, if YGGDRASIL manages to allow a truely language-independent
implementation, or if it breaks down at later point. Maybe it's enough
to keep the grammar file from language specifics, but require to
translate a certain additional backend (beyound that what is strictly
necessary to compile the translated grammar file itself), which is
implemented in separate files. It would make things definitively
cleaner, as replacing the innards of a ANTLR grammar file can be quite
tricky.

...
> Can all tools derive 100%?  Maybe not.   But getting more than 90% is
> probably reasonable.

How do you add the last 10% without risking to lose them in a later
regeneration?
...
>> If you need more than fifty different things which won't be nonetheless
>> ever enough then you aren't doing things orthogonal.
> 
> 
> You could say the same meaningless statement about needing 50 different
> grammars for 50 different languages which will never be enough because
> there will always be a language 51.    There will always be another
> language.    However, the number of basic tags increases at a much
> lower rate than the number of languages.   It is possible for the
> number of tags to be smaller than the number of languages supported.
> A couple hundred tags could support an infinite number of programming
> languages, if not real ones.   If you group the tags into
> sets (TRY, CATCH, FINALLY, RAISE) then the number of tag groups is
> roughly equivalent to the number of constructs supported.
> 
> Your criticism seems analogous to criticizing English because
> "noun" and "verb" are defined in a dictionary and new nouns
> and verbs need to be added.    I am sure we could both
> rip into English for a whole lot of reasons but that isn't
> really one of them.   Now, logban, IIRC tries to address
> the vocabularity issue by using compound words such that
> "small-ice-planet" is used for meteor.   And it wouldn't
> hurt to apply a little of that to the TAG naming to make
> the names more consistant (or "orthogonal" if you prefer).

Actually I come from XAML (WPF from Microsoft), where each class has an
own tag, instead using a tag "<class name='...'>". And I happen to know
a few RPG magic systems, which try to deal with sensible spell creation.
If you happen to add effectively the same effect to each element of a
certain layer, then it screams, that there are two different things
mixed. How applicable my nonetheless sensible criticism is here, is an
entirely other question - which at least I can't answer.

> Tags is much better than rewriting compiler code glue for every
> language that supports a construct.    The later is of order N*M where N
> is the number of languages and M is the number of constructs.
> 
> For data files, the secondary attributes I refer to like
> 'NAME','PHONE_NUMBER', 'CITY', 'STATE', 'POSTAL CODE', 'COUNTRY', EMAIL,
> URL, CREDIT_CARD_NUMBER, etc.) are strings.  It costs very little to
> convey additional information.  These are just taken from a domain
> specific dictionary with shared subsets.  Thus, if a program sees a tree
> like:
>   ^('Numero Telephono:':PHONES:LIST:
>       '+1-234-567-8901':PH_ENTRY:LIST_ITEM:'PHONE_NUMBER'
>       '+1-234-567-8902':PH_ENTRY:LIST_ITEM:'PHONE_NUMBER'
>       '+1-234-567-8903':PH_ENTRY:LIST_ITEM:'PHONE_NUMBER'
>    )
> without understanding the grammar rule names (which may be derived from
> one of many standards and thus different for each grammar that contains
> contact info) or the language (spanish) the data file was written in ,
> it immediately undestands that it has a list of three
> phone numbers and can dial a number if you click on it, can search
> based on phone number, etc.

This looks like as if you try to add DSLs to ANTLR grammar syntax. I
prefer instead the LISP approach: Use a generic approach to define a DSL
and then write in it. Not sure, if that is feasible, but it's cleaner
and prevents accidental mixing of tags - and allows even different
meanings (e.g., "bank" can mean to different things).

...
>>> Users would define their own, nonstandard tags using the common
>>> "X_NAME()"
>>> convention until the tags were standardized.
>>
>> This approach makes me shudder...
> 
> Why?   It works fine in many other areas.   Any language which can't
> be extended is limited in its expressive power.

Because your approach looks to me as too far open-ended and muddled.
Firstly, too many people add things, which creates confusion ("Where to
put feature X? Should I use feature Y here?"). Secondly, I believe that
overlaps in tags may cause a greater overhaul - at least, if one can't
exchange the definitions in a way, that doesn't influence existing code.
Am I too pessimistic here...?

...
>> Unicode handling makes case insensitivity more complicated. In Turkey
>> the uppercase y isn't Y, but Y accented with ... whatever that accent
>> is. In any case, you have to add locales. I know that .NET supports
>> locales so at least there it may be easy to compare the input. But I
>> don't know if the templates to generate the parser can be easily updated.
> 
> I deal with this at length in my case sensitivity post which I mentioned
> was forthcoming.  Short version is there is really much excuse for not
> creating the basic infrastruction and providing an ASCII implementation
> and you can substitute unicode methods that will work for a lot of
> unicode using existing libraries.  But if you want proper handling, you
> will have to abandon the notion of a string as an array of 8/16/32 bit
> characters and instead treat it as a string of variable length objects
> that you use standard library functions to sequence through, compare,
> etc.  For ASCII you can optimize by making your string methods inline
> and suffer very little performance penalty.  Full unicode support would
> be a gradual transition.  Would have been pretty easy to do in the
> ANTLR3 rewrite, probably much harder now.   Spend a couple hours
> skimming the unicode
> standard and you have a pretty good idea what your string class should
> look like.

I hope you don't suggest to implement an own string class and not to use
existing implementations.
> 
>>>   - selectively disallowing whitespace between tokens
>>
>> Something I did with checking the indices of the supposedly neighboring
>> tokens (difference may be only one).
> 
> Target language and runtime specific but I might borrow that as a
> workaround until ANTLR is fixed.

Indices are actually supported by ANTLR token attributes. Although the
implementation is indeed target specific.
> 
>>>   - choosing between multiple token rules that match the same input
>>>     based on parser context.
>>
>> Can be done already. Just use for scanning a normal name (like "TILDE")
>> and via rewriting and imaginary tokens you can get to
>> "CONCATENATION[TILDE]".
> 
> Not sure exactly what you are saying but it sounds target language and
> runtime specific.

Nope, normal ANTLR grammar. This is explained in TDAR in the section
about imaginary tokens.
...
>> Which reminds me: Lexer tokens can't have arguments, unless they are
>> fragment rules. I forgot the reason for this, but orthogonality-wise
>> it's not a good decision, even so implementation-wise the reason may be
>> sound.
> 
> Yes, this could be a pain to implement.  I can suggest a number of
> reasons:
>   - general absence of comunication between parser/lexer.
>   - parser doesn't call lexer functions directly, but through
>     a stream class that isn't yet flexible enough to communicate.
>   - precomputed state machines
>   - the need to purge lookahead token cache, etc. when changing
>     values
>   - a lot of optimizations may assume that the breakdown of text
>     into tokens is constant regardless of context.
>   - possible use of code that resembles
>      if(get_token()==FOO)
>    rather than
>      if(match_token(FOO)
>    You can add arguments to the second version and modify the meddling
>    classes in between to pass the data through.
> 
> There are ways around this.    They aren't necessarily easy.  But I
> have mentioned some of them.
> 
...
>>>   - operator tokens defined by the user
>>>     of a grammar, not the grammer itself.   Requires a runtime
>>>     table lookup.   Multiple character operators "++" would be harder to
>>>     implement, though possible.   This gets around, for example, the
>>>     c++ limitation on defining new operators.
>>>      U+2200 .. U+22FF (mathematical symbols) are prime candidates.
>>>     as are U+0391..U+03A9 and U+03B1..U+03C9 (greek letters).
>>
>> Huh? At which step of a usual compiler development are the user supposed
>> to add their new operators?
> 
> At no stage of compiler development.   The end-user declares their tokens
> in the file being compiled.    This should not be as hard as it sounds
> at first.   It means that each state of a state machine where this
> token may be relevent consults a lookup table.  This will typically
> affect a few special rules:
>     NEW_OPERATOR_PRECEDENCE_1:  :new-token[1]: ;
>     NEW_OPERATOR_PRECEDENCE_2:  :new-token[2]: ;
>     NEW_OPERATOR_PRECEDENCE_3:  :new-token[3]: ;
>     NEW_OPERATOR_PRECEDENCE_4:  :new-token[4]: ;
>     ...
> This gets tagged NEW_OPERATOR and the compiler back end takes advantage
> of the fact that it gets both tags and the original text.  Lexer states
> just call is_new_operator(precedence, char_reference);  you arange
> things so these rules are called last.   Just a tad more and you
> can define unary vs binary operators.  Presto, you
> have the ability to use a whole bunch of new operators and define
> their precedence.    http://www.unicode.org/charts/PDF/U2200.pdf
> 
> Making the changes to ANTLR grammar and lexer would be done at the
> same time unicode property matching is added, so the marginal cost
> is small.

If I read this right you allow people to modify the behaviour of a
compiled parser?

> One problem here is that ANTLR currently doesn't let treeparsers
> output trees.

3.1 can do that.
...
> Thanks for your input.
> 

No problem.

Johannes