[antlr-interest] Lexer Rule Ordering, how to obtain a default token rule??

John B. Brodie jbb at acm.org
Tue Jun 20 09:19:05 PDT 2006


I *REALLY* dislike predicates - altho they are essential in some situations.

I think even with a predicate you would still need to inspect the lookahead
character to see if it was a delimiter (e.g. to make "/1a" be a STRING, while
"/1 " is a N_PROXIMITY).

It is a failing of mine that I spend *WAY* too much time trying to get rid of
predicates.  Not always having a good cost-benefit ratio ;-(

Anyway, how about this lexer without predicates?

(I assume that " / " is a STRING (no WS), and likewise "/google", "g/g",
"g*g/g/" are all STRING's and that "/*", "**", "a*b/c*" are all
PREFIXED_STRINGS)

-------------------------
class LuceneLexer extends Lexer;

tokens {
    AND = "AND";
    STRING;
    PREFIXED_STRING;
    N_PROXIMITY;
}

STRING options{ testLiterals=true; } :
        ~( '/' | ' ' | '\t' | '\n' | '\r' )
        ( ~( ' ' | '\t' | '\n' | '\r' ) )*
        { if ((text.length() > 1) && (text.charAt(text.length()-1) == '*')) {
            $setType(PREFIXED_STRING);
            text.setLength(text.length() - 1);
          }
        }
	;

N_PROXIMITY :
        ( '/' { $setType(STRING);} )
        ( ('0'..'9')+ { $setType(N_PROXIMITY); } )?

        ( ( /*empty*/ {/* need to strip leading '/' here */} )

        | ( /*NB: leading '/' should be kept on this path */
            ~( '0'..'9' | ' ' | '\t' | '\n' | '\r' ) { $setType(STRING); }
             ( ~( ' ' | '\t' | '\n' | '\r' ) )*
             { if(text.charAt(text.length()-1)=='*') {
                 $setType(PREFIXED_STRING);
                 text.setLength(text.length() - 1);
               }
             }
          )
        )
    ;

WS  : ( ' ' | ('\t' { tab(); }) ) { $setType(Token.SKIP); } ;
EOL : ( '\r' ( '\n' )? | '\n' ) { newline(); $setType(Token.SKIP); } ;
-------------------------

Hope this helps...
   -jbb


More information about the antlr-interest mailing list