[antlr-interest] unicode 16bit versus new 21bit stuff

Terence Parr parrt at cs.usfca.edu
Sat Jun 19 11:42:57 PDT 2004


On Jun 19, 2004, at 11:09 AM, Mike Lischke wrote:
>> I'm secretly planning to allow all sorts of cool stuff like
>> parsers that can handle single char tokens w/o going to the
>> lexer and so on.
>> Having the parser at runtime be able to distinguish char from
>> token type just by looking at the value was going to be mighty handy.
>
> How's that going to work? Will you let the parser take over some of 
> the lexer functionality? How often do you expect is
> it that one needs to recognise single character tokens? Isn't it 
> usually so that entire words are being tokenized? Well,
> right, there are the symbols used for example in expressions. But then 
> you'd need a separate path in the parser a la "is
> it a single char token then use it now, if not ask the lexer what it 
> really is". Doesn't sound like a big speed
> improvement to me

Well, SEMI and LPAREN etc... come to mind, but I guess you're 
right--I'm not sure how much the parser will be able to optimize out of 
this.  I do like the channels idea, however. :)

> .What I rather would like is that the lexer would more work so that I 
> could use tokens with overlapping definitions. For
> instance currently I have to make
>
>   INPUT_CHARACTER: ~('\n' | '\r' | '\u2028' | '\u2029');
>
> protected, otherwise it conflicts with almost anything in my grammar. 
> Currently I cannot define:
>
>   DIGIT:           '0'..'9';
>   HEX_DIGIT:       DIGIT | 'a'..'f';
>   OCTAL_DIGIT:     '0'..'7';
> Without making all three rules protected. I think you got the pattern.

The DIGIT isn't a token by itself though is it?  It should be protected.

The LL-star algorithm will gladly predict anything we throw at it now, 
but at the cost of possibly examining the chars for some token twice in 
the worst case.  I expect that we will be able to optimize this out in 
many cases, but the exact algorithm is still being developed as a 
background process in my head. ;)  It is essentially related to the 
idea that most lexer rules have no actions and are inherently regular: 
let the DFA predictor simply set the token type and return w/o ever 
calling the rule.  For example, if you have INT and FLOAT as tokens but 
both have no actions and are clearly regular (no recursion), the DFA 
predictor would match either and then immediately return the token w/o 
calling method mINT() or mFLOAT().  In summary, it builds a DFA like 
all other scanner generators like flex unless you happen to stick an 
action in the middle of a rule or you use recursion in the lexer.  So, 
we get the speed *and* power. :)  Woohoo!

> I think ANTLR is pretty damn fast and going 64 bit wouldn't hurt that 
> much (can't speak for C++, though). I believe you
> will much more gain than you loose when going that route.

I should look at the bytecodes for a simple lexer to see what long vs 
int looks like.

>> Too bad I don't have #define or a typename I could use so the
>> actual type could be changed later.  Would be nice to see
>> LabelType instead of int.
>
> This would make a decision much easier, indeed. And neither generics 
> nor wrapper classes would help, so even in the
> future there seem to be no alternatives.

Oh well.  Every language has it's issues. ;)

PS: The level 1 (just lex/parse, no trees etc...) code generator for 
ANTLR 3.0 is going extremely well.  Literally every single character of 
output for the java code generator is in a template file.  The code 
generator has absolutely no idea what language it is generating.  Swap 
out the template file with another and you will generate a different 
language. :)  I'm probably going to do LISP next just to see how far I 
can stretch this concept. ;)

Ter
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing





 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list