[antlr-interest] Complementing ANTLR with parboiled

Fri Mar 5 10:23:50 PST 2010

Ron,

thanks for your feedback.

> OK, I was bemused by your "motivation" page. The motive is built
> around the desire to create domain-specific languages with Java.
> But then, the first disadvantage you claim for existing parser generators
> is this:
> 
>  Special, non-java grammar syntax in separate project files
> 
> Um, that's because parser generators *are* domain-specific
> languages!  So, you don't like special, non-java syntaxes,
> but your goal is to create a tool that lets people create special,
> non-java syntaxes. :-)

I understand that at first glance the point you mention might seem like a contradiction.
However, it's not that I don't like DSLs, au contraire!
You could say that even parboiled uses a DSL for defining a grammar, the only difference is it being an internal instead of an external DSL.
I think that when developing a DSL one should take into account the environment in which the targeted DSL users will be using the language.
A business user of my application might be perfectly content with entering short snippets of a business rule DSL on a website without further support (apart from documentation). However, any serious present-day java developer is relying heavily on his/her IDE to manage large code bases and offset Javas relatively high level of verbosity.
When designing the underlying grammar description DSL for a parser generator written in Java one has two choices:
a) Choose an external DSL (like ANTLR) and gain conciseness but forego automatic IDE support, which can only be achieved by the tedious development of custom plugins for all major IDEs.
b) Choose an internal Java DSL (like parboiled) and trade in the compactness and expressive power of a custom syntax for automatic support in all IDEs.

IMHO it depends on the size and complexity of the languages the parser generator is being designed for whether a) or b) yield the best compromise.
For large projects, where big, complicated languages have to be defined a) might be the better choice, since otherwise the limitations of Java as a "carrier" for the grammar description DSL might be too restricting and make the grammar description bloated and unmanageable.
However, when smaller, less complicated grammars are the main target of a parser generator I would argue that b) is the better approach.
Defining the target language grammar directly in Java instead of a special syntax puts it under the full power of modern IDEs. Syntax highlighting, code completion, code navigation, inspections, reference analysis, refactoring support... they all work out of the box. 
Not having to learn another syntax will speed things up, as will not having additional build steps for an external generator.

> But seriously, much of the other motivation points also suffer
> the same self-contradictory problem vis a vis the basic nature
> of a domain-specific language. OTOH, this point deserves special
> note:
> 
>    More complicated design and maintenance through divided
>    parsing process in lexing (token generation) and token
>    parsing phases
> 
> The division of labor between lexing and parsing is more than
> half a century old, and it was arrived at (and survived) because
> it does exactly the opposite of what you say: it makes the
> code more modular and easier to maintain. When you try to
> pack the two together for any non-trivial language, you
> inevitably see the hacks multiply (e.g., something as trivial
> as white space becomes some kind of "special case").

You are right, the division into lexing and parsing is very old. And it has performance advantages and can make things like whitespace handling easier. However, it also has drawbacks. Lexing differs from parsing in the underlying logic and is therefore an additional concept to understand. It requires a separate specification. It does make it difficult to compose grammars.
On todays hardware performance is not a problem for most applications. The second main reason (apart from performance) why it was introduced decades ago, grouping input characters to enable parsers with limited look-ahead to "see further", is irrelevant with Parsing Expressions Grammars that do not have any look-ahead problem.
So again, the decision to split the whole process into lexing and parsing or not depends on the application.
If performance and white-space handling are really important, using a separate lexing phase might make sense. Otherwise things are easier to build and maintain without it, IMHO.

> Finally, as we live in an age where CPU speed has peaked
> and even begun to decline, there is increasing pressure to
> parallelize code to take advantage of the only remaining
> practical advantage of Moore's law -- increasing numbers
> of CPUs. For a language processor, one of the few neat
> and modular divisions of labor that can easily be put
> in parallel is the division between lexing and parsing.
> Often, efficiency doesn't matter for parsing, but since
> you list ANTLR's footprint as a disadvantage, it becomes
> contradictory to claim that combining lexing and parsing
> so they can't be parallelized is an unvarnished advantage.

Yes, ANTLRs footprint in Kb certainly isn't the main point.
But the general size and complexity of all its subparts can make it hard to get started with.

> None of this is by way of criticism of the project, which
> I find interesting reading (thanks for the pointer!).

parboiled's raison d'être is not trying to replace ANTLR, JavaCC or any other traditional parser generator.
All it would like to offer is an alternative for applications where ANTLR & Co. are currently used outside of their primary target areas.

Cheers,
Mathias

---
mathias at parboiled.org
http://www.parboiled.org