[antlr-interest] dfa-based lexers versus top-down antlr lexers

Fri Apr 25 18:53:24 PDT 2003

Hi,

> PROS
>
> Very readable lexers.  ('0'..'9')+ turns into a while look you can
> debug/read.

Which is a *very* big pro IMHO.

I've had some serious trouble with flex lexers in the (more distant) past and it
was just awful hard to track down the problem (one example of this: of course my
flex lexer spec had a bug in one of its regexes but I didn't know that until I
ran a compiled binary trough it for a joke. (garbage in; parser crash out. lexer
barfed on the binary input a few megs down the road and started coughing up some
very peculiar token sequences. 2 days later. Final diagnosis: two *typos* in the
flex lexer definition file annulling each other... except when.... Bummer.
Generated code that somehow still resembled my definition file at least a _bit_
would certainly have helped...)

> CONS
>
> Some lexer rules are a huge pain to specify in ANTLR because of the
> limitations of lookahead.

Can I have a head-numbing example of this? I'm mighty curious.

(Something that's not about 'comment scanning', please? ;-) )

> ANYWAY, who has thoughts on this?  I'd like thoughts also from people
> with *no* experience using DFA-based tools like lex/flex.  Do ANTLR
> lexers seem "natural"?

Err, the 'reverse' applies to me: no experience with ANTLR 2.x lexers. So sorry.
(I'm still a very happy user of PCCTS 1.33 but am looking into 2.x now too.)

FYI, since my discovery of the then-cool 'flex' tool and it's initial couple of
applications (more and less appropriate ones), I've reverted to hand-coded
lexers again; these allow me the most flexibility and readability at all times.
One of the more interesting ones was one where I ran a pascal/c like script
interpreter inside the yacc-generated code (*no* p-code or anything like that),
so yacc had to signal my lexer at the proper time to 'jump around' in the input
stream when loops, subroutine calls or 'subprogram file loads' came up. Granted,
it was slo-o-o-o-ow but it was fun and proof of concept.

Some time later I ran into PCCTS because the error-handling in yacc sucks
bigtime (for want of a better description of this, ah, elaborate 'system') and
LALR(1) doesn't fit my bill either 5 out of 10. DLG was nice, but the first
thing I checked was the way to link hand-coded scanners to ANTLR 1.x. Worked
like a charm in 'C' mode.

Since I fear I'll be using (partially) hand-coded scanners quite often, I'd vote
for the 'sneak away' approach: dear ANTLR could give us the choice to:

a) generate lexers as they are right now, i.e. top-down and with the pros & cons
like you specified them. (I read from the other replies ANTLR 2.x allows me to
'hand-code parts of the lexer' by adding scanning code in actions. yum yum :-) I
would love that.)

b) generate some interesting output for use by flex et al (the antlr 1.x + dlg
mix revisited) -- iff you feel like doing such a thing

Of course this isn't 'clean', it's not hip and it's definitely isn't *the*
answer, but I like my tools flexible and if I get something which offers me the
3 choices 'use me', 'here's something you can feed to my neighbour' and 'roll
your own, dude' (or at least choices 1 & 3), I'm happy as a pig in s[a-z]+ ;-)

Assumption: ANTLR 2.x comes with a well-defined, *documented* lexer-to-parser
interface, including those dreaded 'parser to scanner' feedback mechanisms you
often find around type-versus-variable scanner problems (antlr can resolve these
in the parser alone, but I'm sure you can think of better examples where parser
provides the lexer with feedback for mandatory 'improved' subsequent
token-scanning).

<digress mode>
(Thinking about lexers, top-down & dfa, the evil idea about the possibility to
make 'self-modifying token streams' in the *parser* keeps cropping up (old idea
of mine), i.e. have antlr [predicate] actions modify (or even add/remove) tokens
in the input stream. Too simple example would be to modify some 'identifier'
token into a 'type definition' token or other token, based on a matched/failed
predicate (or rule?). You'll surely think of other wicked uses ;-) )
</digress mode>

You guessed it: I'm in favor of simple(=human-reviewable) lexers and pushing
some complexity into the parser, thus complicating the grammar an itsy bitsy
bit.
Keep the Antlr scanner as it is. Just offer a properly documented 'way out of
here', i.e. 'HOWTO-roll-your-own-scanner.txt'.

Maybe you can point at something where my 'way of life' isn't very suitable?
Always glad to learn about the err of my ways. ;-)

Best regards,

  Ger Hobbelt

----------------------------------------------------------------------------
http://FlashExperiments.insh-allah.com/           http://www.insh-allah.com/
----------------------------------------------------------------------------
"Senior Software Engineer." Uh-uh. Yep. One of them buggers who created this
bloody fine mess. "To boldly crash where no man has made a bug before."
----------------------------------------------------------------------------
The daily amount of incoming virusses/trojans/SPAM compels me to enforce the
rule: attachments have to be in ZIP,RAR,ARJ,JAR,BZ2,GZ,TXT or TAR.GZ format.
            Graphics accepted in GIF,PNG,JPEG and TIFF formats.
!!!       >>> ANY OTHER ATTACHMENTS WILL BE SILENTLY DELETED. <<<        !!!

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/