[antlr-interest] Problems with lexing tokens containing blanks

Thu Nov 30 00:41:27 PST 2006

Ok, some words to my project.

I'm going to build a kind of a simple translator, who's job is to detect 
some defined tokens in an input string an replace this tokens with a 
given translation.

Therefor I have a config file that maps english tokens to its german 
representation.

The content of the file can look like this:
index of  = der Index von;
index = Index;

So my idea was to use the antlr3 lexer to strip an input string into the 
given tokens and than replace each token with its foreign representation.

Here an example lexer grammar:

lexer grammar SimpleLexer;

INDEX : 'index';
INDEX_OF : 'index of';
INT : '0' | '1'..'9' '0'..'9'*;
IDENT: ('a'..'z'|'A'..'Z'|'_'|'$') ('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'$')*;

WS : ( ' ' | '\r' '\n' | '\n' | '\t' ){$channel=HIDDEN;};

Ok, for the input string "index of value1" everyfing works finde. The 
Lexer retuns the expected token types: INDEX_OF WS IDENT. And I can 
proper translat it into "der Index von value1".

But for the input string "index 4" the lexer throws the execption that 
he expects 'o' intead of '4'. But I'm expecting the token types INDEX 
and INT to translate it into "Index 4".

Internally, the generated lexer does somethin like this:

if ('i' 'n' 'd' 'e' 'x' ' ') -> INDEX_OF
else if ('i' 'n' 'd' 'e' 'x') -> INDEX

Hm, I think, I need to teach the lexer to do something like this:

if ('i' 'n' 'd' 'e' 'x' ' ' 'o') -> INDEX_OF
else if ('i' 'n' 'd' 'e' 'x') -> INDEX

Regards
Bernd

------------------------------

Message: 3
Date: Wed, 29 Nov 2006 20:16:25 -0500
From: "Jim Idle" <jimi at intersystems.com>
Subject: Re: [antlr-interest] Problems with lexing tokens containing
    blanks
To: <antlr-interest at antlr.org>
Message-ID: <20061130011626.116A31390 at mail.intersystems.com>
Content-Type: text/plain; charset="windows-1250"

I cannot immediately see why this is not working for you, however unless 
whitespace is typically significant in the language you are parsing, you 
should not deal with this in the lexer. Though I suspect that if you 
supply the full source it will be evident as to why this does not seem 
to work, such as you have a lexer rule that captures whitespace before 
the INDEX_OF definition or something similar.

However, I think you are confusing lexical definitions with 
disambiguation that the parser should be handling. For instance, what 
would happen if the source code was:

index         <tab><tab>      of

You are well advised to think of the source input as you would any other 
language. In English you would not tokenize ?index of? as one semantic 
element, and should not really do this in ANTLR (it isn?t as clear cut 
as this of course).

Is there any reason you cannot have:

WS        : ? ? | ?\t? {channel=$hidden};
INDEX     : ?index? ;
OF        : ?of?

And then have parser rules that ?know? the difference? Remember that the 
lexer is a simple beast whose only job is to just tokenize the input.

indexer:
          INDEX
              (
                    OF somerule
                   | somethingelse
              )
     ;

Jim