[antlr-interest] Can subrules be set to 'n-to-m'?

John D. Mitchell johnm-antlr at non.net
Sat Mar 26 13:42:34 PST 2005


>>>>> "Richard" == Richard Matthias <richard at exaflop.org> writes:
[...]

> As an example from the CSS grammar, because it doesn't allow spaces
> between some tokens the lexer cannot just discard whitespace which means
> the parser rules have to be peppered with (mostly optional) whitespace
> tokens. So you get lots of rules like this:-

Yeah, that's usually a sign of a poorly designed language.  Alas, there's
all too many of those kinds of problems in the languages we have to deal
with. :-(


> Needless to say you have to be very careful where you place those (S)*
> sub-rules to avoid non-determinism. Oh, the comments on the ends of the
> lines are where the original yacc grammar had what I think are
> superfluous whitespace swallowing sub-rules. Actually I'd like to open a
> discussion on the best way to handle a language that needs to allow
> whitespace but only in certain places.

Well, besides "don't"? :-)

> Like I could allow the lexer to drop whitespace but then make everything
> where whitespace wasn't allowed into a single custom token, but I don't
> know if ANTLR's lexer could handle that.

Personally, I've never found a completely satisfactory solution using any
tool.

What I've done a couple of times in manually create a lexer that has just
enough understanding of the parse level to deal with the whitespace
vagarities.


> While we're on the subject of lexers, one of John D. Mitchell's emails on
> this subject appears to denigrate the regex as something that's only
> useful for simple operations or hacks.

Hmm... I can see how it could be read that way.  To be clear, at heart,
regexps are just another tool.  One of the big problems is that because of
their easy to get started with nature, people have gone well and truly
insane in their abuse of both the usage of regexps as well as
gerrymandering them to attempt to become full blown grammars.  That's
become a vicious cycle.

And yes, I too have written ridiculously impenetrable regexp abusing code
in a number of languages, including Perl. :-)

> That may be so, but I'd kill for a lexer right now that could handle
> common left prefixes without requiring syntactic predicates (like I want
> a load of exception-based backtracking on every token). There are some
> clever things you can do with a LL(k) based lexer but there are also some
> very basic things that you can do with lex that are an absolute nightmare
> with antlr. Hopefully the DFA-based LL(*) algorithm for antlr3 will sort
> most of this.

Yeah, the LL* stuff seems to kick ass on that sort of thing.

Do you have some good examples of some easy-using-lex constructs readily at
hand?  Those would be good for us to keep in mind as we beat on Antlr v3.

Thanks,
	John


More information about the antlr-interest mailing list