[antlr-interest] Any thoughts of using java.util.Scanner (jdk5.x) for tokenizing?

Christopher Schultz christopher.d.schultz at comcast.net
Thu Jan 20 21:08:36 PST 2005


Hanasaki,

> http://java.sun.com/developer/JDCTechTips/2004/tt1201.html#1
> 
> Any thoughts of using java.util.Scanner (jdk5.x) for tokenizing?

One major problem with the new Scanner class is that it doesn't work 
well with hererogenius tokens. ANTLR's scanner (tokenizer), as well as 
the tokenizers shipped with many other compiler compilers, works very 
well recognizing tokens that are completely orthogonal.

You simply can't write an expression that returns tokens which sometimes 
look like "AN_IDENTIFIER" and sometimes look like "3.141592654288". 
Sure, you can split on whitespace, but that doesn't always work very well.

The approach given in this article for handling heterogenious tokens is 
to layer one Scanner on top of another. However, the base-level Scanner 
needs to generate very simple tokens, and then you have to layer 
successively smarter Scanners on top of it. I think that having a 
custom-generated tokenizer (a la ANTLR, lex/yacc, JavaCC, JLex/CUP, 
etc.) makes more sense than using a very generic Scanner class (which is 
essentially a regex used to split a String).

Probably a better reason not to use java.util.Scanner is breaking 
compatibility: ANTLR will require Java 1.5, whereas today it only 
requires Java 1.1.

-chris
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20050121/91a3b6a3/signature.bin


More information about the antlr-interest mailing list