[antlr-interest] lexing: case insensitivity

Mon Aug 20 01:09:20 PDT 2007

In ANTLR is it possible to manage the file stream *before* the lexer
handles it?

I'd like to show you what I did with my original hand-written lexer and
offer as a solution of sorts, though I don't know if it can be
implemented
directly in ANTLR.

I am replicating a dialect of the BASIC language that is case
insensitve.  

I have read the article on

http://www.antlr.org/pipermail/antlr-interest/2007-January/019008.h
tml

For me I need a different way of handling the input stream.  I
originally
hand-wrote a lexer for my dialect that contains hundreds of keywords.
The way I dealt with case-insensitivity was to make a first pass on the
input stream, removing all of the comments by remapping the characters
to spaces so I wouldn't loose any file position info AND uppercasing
anything
that was not in quotes.  It was not really necessary to remove the 
comments but it made the logic for the real lexer much simpler since I
didn't 
need to keep the comments for any reason because I'm not doing source to

source translation.  How I uppercase the input stream was to setup an
array, 
uppercase it and use it to remap for all the characters in the input
stream.
Done to satisfy an ASCII character set but I don't know of any reason
why
the same could not be done for unicode, ( never tried it ).

WARNING: pseudocode mixed with real code.  Also my Java is a bit crude.

pseudo mixed because I don't recall how to do this in Java off the top
of my head as this was originally done in BASIC and a bit simpler.

        // deal with ASCII  but COULD use unicode set
        static final int  _charSetSize = 256;// ASCII but could be
unicode
        String asciiStr = null;
        char asciiMap[] = new char [ _charSetSize ];
        char charInInputStream, charPos;
        for ( charPos = 0; charPos < _charSetSize; charPos++ )
            asciiStr = asciiStr + charPos;

        asciiStr = asciiStr.toUpperCase(); // change to uppercase

       // remap the initial array so there is no method call overhead
       // when it is time to remap the input array
        for ( charPos = 0; charPos < _charSetSize; charPos++ )
            asciiMap[charPos] = asciiStr.charAt(charPos);

        // replace all characters not inside a string
        // replace characters don't bother deciding if it needs to be
uppercased
        // since the branch decision causes more overhead than its worth
        while ( charInInputStream != EOF ) {
            if ( characterNotInsideString == true ) {
                charInInputStream = streamArray[charPos];
                streamArray[charPos] = arrayMap[charInInputStream] ;
            }
        }

Because lexing takes 25% or more of a compilers times I couldn't
repeatedly
invoke a procedure that would likely double my overhead but went for a 
remapping the input stream.  I went character by character because, one 
I have no choice and two in a lexer I have to touch each character
anyway.  
Initially treating input stream as an array of characters speeds up
lexing 
**tremendously** by getting rid of ALL of the overhead associated with 
the toUpperCase method call and allowing you to make pre-processing 
decisions along the way.

I thought I would throw that out there for anyone who needs an idea for
case insensitivity.

Now the question remains:  
In ANTLR Is it possible to manage the file stream *before* the lexer
handles it?

W.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20070820/0232ec1e/attachment.html