[antlr-interest] Parsing Lisp into C++

Tue Sep 29 17:16:02 PDT 2009

> Richard Lewis wrote:
>> This is not an intellectual exercise or a homework assignment. The Lisp
>> in question is a peculiar dialect of CLOS developed for our clients to be
>> able to embed scripting behavior in our application suite. The interpreter
>> has been tweaked to the point of which we can no longer squeeze out any
>> more performance gains. There are 10's of thousands of lines of code
>> distributed amongst many clients so changing the language in any way is
>> not an option.
>>
>> So the next step is to have it compiled natively.

Oh boy...

When I first read this, the very first thought was 'you should see
what Autodesk did and take that as a [strong] hint' and the more I see
appear in this thread, the stronger the feeling.

For those who say 'Auto-who?!?!' think the number one vendor of CAD
solutions worldwide (AFAIK). They've offered LISP 'scripting' since
about the stone age and they've hit this (and other) barriers over the
years. FYI:

http://en.wikipedia.org/wiki/AutoLISP

or google 'autoLISP' or 'Visual LISP'

You mention your bottleneck is *performance*, and this is me reading
between your lines, but my impression is it's such an important
bottleneck that you're willing to spend the effort to change from the
existing interpreter to something entirely different, e.g. a [JIT?]
compiler / other scripting language even? That means the number of
passes of the compiler/translator to be is not the most important
issue here.

Since it's SPEED what you need, let me be very clear: if you have no
extensive readily available knowledge and experience with constructing
optimizing compilers and/or language translators, you're at the start
of a unfathomable battle.
And assuming you have those abilities aplenty, then there's still the
issue of 'maintenance' which splits two ways:

a) when your 'translated' code will ever (if only 'just once') hit
human retinae, you've just chosen to walk the path to utter damnation,
because machine-translated code is /never/ maintainable in the long
run; any of that sort gets rewritten as quickly as possibly, dialing
up maintenance cost as it's basically smeared out dual translation
efforts, once machine, once manual, where the latter is at a
additional disadvantage then, so TCO goes through the roof (but smart
accountants know where to hide that long term result. What's the
average lifetime of a board level occupation after all)

b) when your 'translated' result is strictly for machine eyes only,
which means you're not translating but /compiling/ your sourcecode,
you can either do it all from scratch or buy [half]product. In this
case, one can argue it's a technology-only issue (budgets allowing),
where your need for speed (there's that again) implies your major
effort will /not/ end up creating the parser (i.e. the ANTLR work) but
the optimizer part of your compiler, i.e. adding the smart AST
rewriter(s), etc. In other words: ANTLR will only be the start of it,
when you wish to build your own translator. (A human expert
interpreting and re-implementing the 'original intent' by hand can be
quite competitive here.)

Alternatively, you can seek an already available [partial] solution
for choice (b). A solution which you pick because it's generated
output is extremely fast already, then spend your remaining budget and
time on integrating and tweaking that engine. Run-time speed of
generated output would win over parser generator tools used, I'd say.
When your company decides to take the (b) route, I suggest you take
this latter choice: acquire an existing translator/compiler engine and
tweak and embed that one. When it happens to use ANTLR, more power to
it and you, but ANTLR, in this particular case, is part of the easy
20%, not the backbreaking 80%.

Please backpedal, and your board of directors should (re)convene on
this a well, as it's /them/ who must make the /strategic/ choice where
your money (& time) will be spent:

1) building a compiler/translator from scratch (and state very
explicitly whether output if for humans or 'strictly machine only';
others will need to phrase this as 'output is editable (yes/no)' to be
able to discover the significant consequences of such a decision.
Building or purchasing is only a secondary choice here.

2) take the route Autodesk effectively did: If it's not fast enough
for 'you' (for any user value of 'you'), there's an alternative
user-scriptable ABI/API here for you. Autodesk offers native C++
interfaces for those folks that need near-bare metal speed. Note that
in 20+ years they have switched their LISP interpreter only once,
replacing the old one with a new development environment, but it has
remained an interpreter all the same. The major /strategic/ choice
/they/ made was adding other extensibility options: their modern C++
and .NET interfaces.

3) phase out your LISP support and substitute it with something else;
while in the transition phase your company can offer translation
services for those users who need those. (I wouldn't advise this right
away, because LISP is a quite different animal, and once you get used
to it... When Autodesk would announce a choice like this, you'd get
spontaneous lynching parties. If the flame wars in their discussion
boards over the years about whether to lisp or not to lisp is any
indication, oy oy oy!)

I like ANTLR very much, and for what it does I consider it the best
tool around by a very large margin, but this way lies lunacy as well
if you're not mindful of the consequences of taking this road. Context
is everything, some times.

Reread the messages by Mr. Kaplan twice for good measure, also reread
the messages by the other posters.

Then when you wish to persevere, I'd say to

"My question was one of style with using Antlr... Is it better to
muddy the parser grammar with a single pass or split the AST creation
into 2 passes?"
or rather: "Where is the best place to attach semantic  information?"

that the 'right' answer is one of taste. My answer is: "Spread the
wealth. Do both".

Elaboration:
I've done both, single- and multipass, and I've even considered it a
particularly enticing challenge to roll it all into a strictly single
pass (and that challenge I've put to myself twice, just to prove to
myself I could do it. Sigh. The male species.... The point here:
single-pass sounds nice, but taking it to the extreme will make it a
maintenance/readability headache.
In actual practice, the engines which I still like to 'grok' today,
even when I wrote them a long time ago, are the ones that use a
single, fast, pass to turn text into an AST, add the semantics as far
as was /easily/ doable during that phase, then update/extend those
semantic attributes in subsequent 'passes', i.e. AST
traversals/rewrites. Not too amazingly, those 'both passes' solutions
happen to be the fastest, also when compared to my
oh-so-mentally-challenging single pass compilers/interpreters. So:
'add what you can, when and where it feels right'. No 'best' there; it
quickly becomes similar to the question 'which programming language is
best?' -- too little context to remain truely meaningful.

And here's a thought about that single- versus double-pass doodah: all
compilers, even the single-pass ones, are multipass. It just depends
on what you define as 'a pass'. The only really 'one pass compilers'
(and the 'one pass interpreter' I once created using bison/yacc) are
non-optimizing straight translators/interpreters/compilers.
Since LISP and C/C++ are culturally (to borrow Mr. Kaplan's / Prof.
Eco's phrasing) very far apart, I'm willing to bet you'll need
multipass anyway to get anywhere near a viable translation.
And then... what is an optimizer than another couple of passes --
where 'pass' == 'current code representation traversal' --> AST
traversal, once you're past the lexer+scanner (ANTLR) and arrived at
the 'backend'. That are multiple passes, or at /least/ one more with a
bit of extra hither-and-yon, which is mandatory when you wish to
rewrite (== optimize) your AST after all.

The number of passes might seem to matter at first, but in the end I
find it's not so important. There are more O(N) factors than just the
number of passes.
What I try to do myself is one fast pass lexing+scanning the human
readable stuff into some form of AST (FAST as this pass will cut down
multichar 'slow' access to 'fast' single ID/integer-based token
access), do it fast and easy (ANTLR can be quite useful there, while I
confess I tend to handcode almost all my lexers), then do the complex
voodoo in the AST traversal, try to cut down on the number of
traversals, or rather, number of nodes visited -- and when you can,
cut down on heap dynamics (malloc/free/new/delete), both the explicit
and implicit ones, as those can eat an inordinate amount of time when
you're not careful.

And that's enough for now. It's high time for some shuteye here.

--
Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
       http://www.hebbut.net/
mail:   ger at hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------