[antlr-interest] dfa-based lexers versus top-down antlr lexers
Terence Parr
parrt at jguru.com
Tue Apr 29 09:48:36 PDT 2003
Hi Ger,
Most excellent thoughts. I have absorbed this for future reference.
Concerning the interface between parser / lexer, yes: it's simple:
interface TokenStream {
public Token nextToken();
}
As long as you can answer nextToken(), you can feed an ANTLR parser :)
As for flexibility, yep. I like the idea of allowing whatever lexer.
Also, I like your idea of allowing regexpr and then generating a flex
(is there a Java version) tool. Only question is: how do DFA-based
systems handle unicode well (given DFA size issues)? They don't as far
as I can tell. They resort to NFA-based stuff.
Ter
On Friday, April 25, 2003, at 06:53 PM, Gerrit E.G. 'Insh_Allah'
Hobbelt wrote:
>> PROS
>>
>> Very readable lexers. ('0'..'9')+ turns into a while look you can
>> debug/read.
>
> Which is a *very* big pro IMHO.
>
> I've had some serious trouble with flex lexers in the (more distant)
> past and it
> was just awful hard to track down the problem (one example of this: of
> course my
> flex lexer spec had a bug in one of its regexes but I didn't know that
> until I
> ran a compiled binary trough it for a joke. (garbage in; parser crash
> out. lexer
> barfed on the binary input a few megs down the road and started
> coughing up some
> very peculiar token sequences. 2 days later. Final diagnosis: two
> *typos* in the
> flex lexer definition file annulling each other... except when....
> Bummer.
> Generated code that somehow still resembled my definition file at
> least a _bit_
> would certainly have helped...)
>
>> CONS
>>
>> Some lexer rules are a huge pain to specify in ANTLR because of the
>> limitations of lookahead.
>
> Can I have a head-numbing example of this? I'm mighty curious.
>
> (Something that's not about 'comment scanning', please? ;-) )
>
>> ANYWAY, who has thoughts on this? I'd like thoughts also from people
>> with *no* experience using DFA-based tools like lex/flex. Do ANTLR
>> lexers seem "natural"?
>
> Err, the 'reverse' applies to me: no experience with ANTLR 2.x lexers.
> So sorry.
> (I'm still a very happy user of PCCTS 1.33 but am looking into 2.x now
> too.)
>
> FYI, since my discovery of the then-cool 'flex' tool and it's initial
> couple of
> applications (more and less appropriate ones), I've reverted to
> hand-coded
> lexers again; these allow me the most flexibility and readability at
> all times.
> One of the more interesting ones was one where I ran a pascal/c like
> script
> interpreter inside the yacc-generated code (*no* p-code or anything
> like that),
> so yacc had to signal my lexer at the proper time to 'jump around' in
> the input
> stream when loops, subroutine calls or 'subprogram file loads' came
> up. Granted,
> it was slo-o-o-o-ow but it was fun and proof of concept.
>
> Some time later I ran into PCCTS because the error-handling in yacc
> sucks
> bigtime (for want of a better description of this, ah, elaborate
> 'system') and
> LALR(1) doesn't fit my bill either 5 out of 10. DLG was nice, but the
> first
> thing I checked was the way to link hand-coded scanners to ANTLR 1.x.
> Worked
> like a charm in 'C' mode.
>
>
>
> Since I fear I'll be using (partially) hand-coded scanners quite
> often, I'd vote
> for the 'sneak away' approach: dear ANTLR could give us the choice to:
>
> a) generate lexers as they are right now, i.e. top-down and with the
> pros & cons
> like you specified them. (I read from the other replies ANTLR 2.x
> allows me to
> 'hand-code parts of the lexer' by adding scanning code in actions. yum
> yum :-) I
> would love that.)
>
> b) generate some interesting output for use by flex et al (the antlr
> 1.x + dlg
> mix revisited) -- iff you feel like doing such a thing
>
> Of course this isn't 'clean', it's not hip and it's definitely isn't
> *the*
> answer, but I like my tools flexible and if I get something which
> offers me the
> 3 choices 'use me', 'here's something you can feed to my neighbour'
> and 'roll
> your own, dude' (or at least choices 1 & 3), I'm happy as a pig in
> s[a-z]+ ;-)
>
> Assumption: ANTLR 2.x comes with a well-defined, *documented*
> lexer-to-parser
> interface, including those dreaded 'parser to scanner' feedback
> mechanisms you
> often find around type-versus-variable scanner problems (antlr can
> resolve these
> in the parser alone, but I'm sure you can think of better examples
> where parser
> provides the lexer with feedback for mandatory 'improved' subsequent
> token-scanning).
>
>
> <digress mode>
> (Thinking about lexers, top-down & dfa, the evil idea about the
> possibility to
> make 'self-modifying token streams' in the *parser* keeps cropping up
> (old idea
> of mine), i.e. have antlr [predicate] actions modify (or even
> add/remove) tokens
> in the input stream. Too simple example would be to modify some
> 'identifier'
> token into a 'type definition' token or other token, based on a
> matched/failed
> predicate (or rule?). You'll surely think of other wicked uses ;-) )
> </digress mode>
>
>
>
> You guessed it: I'm in favor of simple(=human-reviewable) lexers and
> pushing
> some complexity into the parser, thus complicating the grammar an itsy
> bitsy
> bit.
> Keep the Antlr scanner as it is. Just offer a properly documented 'way
> out of
> here', i.e. 'HOWTO-roll-your-own-scanner.txt'.
>
>
> Maybe you can point at something where my 'way of life' isn't very
> suitable?
> Always glad to learn about the err of my ways. ;-)
>
>
>
>
>
> Best regards,
>
> Ger Hobbelt
>
>
> -----------------------------------------------------------------------
> -----
> http://FlashExperiments.insh-allah.com/
> http://www.insh-allah.com/
> -----------------------------------------------------------------------
> -----
> "Senior Software Engineer." Uh-uh. Yep. One of them buggers who
> created this
> bloody fine mess. "To boldly crash where no man has made a bug before."
> -----------------------------------------------------------------------
> -----
> The daily amount of incoming virusses/trojans/SPAM compels me to
> enforce the
> rule: attachments have to be in ZIP,RAR,ARJ,JAR,BZ2,GZ,TXT or TAR.GZ
> format.
> Graphics accepted in GIF,PNG,JPEG and TIFF formats.
> !!! >>> ANY OTHER ATTACHMENTS WILL BE SILENTLY DELETED. <<<
> !!!
>
>
>
>
>
>
> Your use of Yahoo! Groups is subject to
> http://docs.yahoo.com/info/terms/
>
>
>
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Co-founder, http://www.peerscope.com link sharing, pure-n-simple
Lecturer in Comp. Sci., University of San Francisco
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list