[antlr-interest] dfa-based lexers versus top-down antlr lexers

Tue Apr 29 09:48:36 PDT 2003

Hi Ger,

Most excellent thoughts.  I have absorbed this for future reference.   
Concerning the interface between parser / lexer, yes: it's simple:

interface TokenStream {
	public Token nextToken();
}

As long as you can answer nextToken(), you can feed an ANTLR parser :)

As for flexibility, yep.  I like the idea of allowing whatever lexer.   
Also, I like your idea of allowing regexpr and then generating a flex  
(is there a Java version) tool.  Only question is: how do DFA-based  
systems handle unicode well (given DFA size issues)?  They don't as far  
as I can tell.  They resort to NFA-based stuff.

Ter

On Friday, April 25, 2003, at 06:53  PM, Gerrit E.G. 'Insh_Allah'  
Hobbelt wrote:
>> PROS
>>
>> Very readable lexers.  ('0'..'9')+ turns into a while look you can
>> debug/read.
>
> Which is a *very* big pro IMHO.
>
> I've had some serious trouble with flex lexers in the (more distant)  
> past and it
> was just awful hard to track down the problem (one example of this: of  
> course my
> flex lexer spec had a bug in one of its regexes but I didn't know that  
> until I
> ran a compiled binary trough it for a joke. (garbage in; parser crash  
> out. lexer
> barfed on the binary input a few megs down the road and started  
> coughing up some
> very peculiar token sequences. 2 days later. Final diagnosis: two  
> *typos* in the
> flex lexer definition file annulling each other... except when....  
> Bummer.
> Generated code that somehow still resembled my definition file at  
> least a _bit_
> would certainly have helped...)
>
>> CONS
>>
>> Some lexer rules are a huge pain to specify in ANTLR because of the
>> limitations of lookahead.
>
> Can I have a head-numbing example of this? I'm mighty curious.
>
> (Something that's not about 'comment scanning', please? ;-) )
>
>> ANYWAY, who has thoughts on this?  I'd like thoughts also from people
>> with *no* experience using DFA-based tools like lex/flex.  Do ANTLR
>> lexers seem "natural"?
>
> Err, the 'reverse' applies to me: no experience with ANTLR 2.x lexers.  
> So sorry.
> (I'm still a very happy user of PCCTS 1.33 but am looking into 2.x now  
> too.)
>
> FYI, since my discovery of the then-cool 'flex' tool and it's initial  
> couple of
> applications (more and less appropriate ones), I've reverted to  
> hand-coded
> lexers again; these allow me the most flexibility and readability at  
> all times.
> One of the more interesting ones was one where I ran a pascal/c like  
> script
> interpreter inside the yacc-generated code (*no* p-code or anything  
> like that),
> so yacc had to signal my lexer at the proper time to 'jump around' in  
> the input
> stream when loops, subroutine calls or 'subprogram file loads' came  
> up. Granted,
> it was slo-o-o-o-ow but it was fun and proof of concept.
>
> Some time later I ran into PCCTS because the error-handling in yacc  
> sucks
> bigtime (for want of a better description of this, ah, elaborate  
> 'system') and
> LALR(1) doesn't fit my bill either 5 out of 10. DLG was nice, but the  
> first
> thing I checked was the way to link hand-coded scanners to ANTLR 1.x.  
> Worked
> like a charm in 'C' mode.
>
>
>
> Since I fear I'll be using (partially) hand-coded scanners quite  
> often, I'd vote
> for the 'sneak away' approach: dear ANTLR could give us the choice to:
>
> a) generate lexers as they are right now, i.e. top-down and with the  
> pros & cons
> like you specified them. (I read from the other replies ANTLR 2.x  
> allows me to
> 'hand-code parts of the lexer' by adding scanning code in actions. yum  
> yum :-) I
> would love that.)
>
> b) generate some interesting output for use by flex et al (the antlr  
> 1.x + dlg
> mix revisited) -- iff you feel like doing such a thing
>
> Of course this isn't 'clean', it's not hip and it's definitely isn't  
> *the*
> answer, but I like my tools flexible and if I get something which  
> offers me the
> 3 choices 'use me', 'here's something you can feed to my neighbour'  
> and 'roll
> your own, dude' (or at least choices 1 & 3), I'm happy as a pig in  
> s[a-z]+ ;-)
>
> Assumption: ANTLR 2.x comes with a well-defined, *documented*  
> lexer-to-parser
> interface, including those dreaded 'parser to scanner' feedback  
> mechanisms you
> often find around type-versus-variable scanner problems (antlr can  
> resolve these
> in the parser alone, but I'm sure you can think of better examples  
> where parser
> provides the lexer with feedback for mandatory 'improved' subsequent
> token-scanning).
>
>
> <digress mode>
> (Thinking about lexers, top-down & dfa, the evil idea about the  
> possibility to
> make 'self-modifying token streams' in the *parser* keeps cropping up  
> (old idea
> of mine), i.e. have antlr [predicate] actions modify (or even  
> add/remove) tokens
> in the input stream. Too simple example would be to modify some  
> 'identifier'
> token into a 'type definition' token or other token, based on a  
> matched/failed
> predicate (or rule?). You'll surely think of other wicked uses ;-) )
> </digress mode>
>
>
>
> You guessed it: I'm in favor of simple(=human-reviewable) lexers and  
> pushing
> some complexity into the parser, thus complicating the grammar an itsy  
> bitsy
> bit.
> Keep the Antlr scanner as it is. Just offer a properly documented 'way  
> out of
> here', i.e. 'HOWTO-roll-your-own-scanner.txt'.
>
>
> Maybe you can point at something where my 'way of life' isn't very  
> suitable?
> Always glad to learn about the err of my ways. ;-)
>
>
>
>
>
> Best regards,
>
>   Ger Hobbelt
>
>
> ----------------------------------------------------------------------- 
> -----
> http://FlashExperiments.insh-allah.com/            
> http://www.insh-allah.com/
> ----------------------------------------------------------------------- 
> -----
> "Senior Software Engineer." Uh-uh. Yep. One of them buggers who  
> created this
> bloody fine mess. "To boldly crash where no man has made a bug before."
> ----------------------------------------------------------------------- 
> -----
> The daily amount of incoming virusses/trojans/SPAM compels me to  
> enforce the
> rule: attachments have to be in ZIP,RAR,ARJ,JAR,BZ2,GZ,TXT or TAR.GZ  
> format.
>             Graphics accepted in GIF,PNG,JPEG and TIFF formats.
> !!!       >>> ANY OTHER ATTACHMENTS WILL BE SILENTLY DELETED. <<<       
>   !!!
>
>
>
>
>
>
> Your use of Yahoo! Groups is subject to  
> http://docs.yahoo.com/info/terms/
>
>
>
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Co-founder, http://www.peerscope.com link sharing, pure-n-simple
Lecturer in Comp. Sci., University of San Francisco

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/