[antlr-interest] Problems with lexing tokens containing blanks
Bernd Vogt
Bernd.Vogt at Innovations.de
Thu Nov 30 00:41:27 PST 2006
Ok, some words to my project.
I'm going to build a kind of a simple translator, who's job is to detect
some defined tokens in an input string an replace this tokens with a
given translation.
Therefor I have a config file that maps english tokens to its german
representation.
The content of the file can look like this:
index of = der Index von;
index = Index;
So my idea was to use the antlr3 lexer to strip an input string into the
given tokens and than replace each token with its foreign representation.
Here an example lexer grammar:
lexer grammar SimpleLexer;
INDEX : 'index';
INDEX_OF : 'index of';
INT : '0' | '1'..'9' '0'..'9'*;
IDENT: ('a'..'z'|'A'..'Z'|'_'|'$') ('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'$')*;
WS : ( ' ' | '\r' '\n' | '\n' | '\t' ){$channel=HIDDEN;};
Ok, for the input string "index of value1" everyfing works finde. The
Lexer retuns the expected token types: INDEX_OF WS IDENT. And I can
proper translat it into "der Index von value1".
But for the input string "index 4" the lexer throws the execption that
he expects 'o' intead of '4'. But I'm expecting the token types INDEX
and INT to translate it into "Index 4".
Internally, the generated lexer does somethin like this:
if ('i' 'n' 'd' 'e' 'x' ' ') -> INDEX_OF
else if ('i' 'n' 'd' 'e' 'x') -> INDEX
Hm, I think, I need to teach the lexer to do something like this:
if ('i' 'n' 'd' 'e' 'x' ' ' 'o') -> INDEX_OF
else if ('i' 'n' 'd' 'e' 'x') -> INDEX
Regards
Bernd
------------------------------
Message: 3
Date: Wed, 29 Nov 2006 20:16:25 -0500
From: "Jim Idle" <jimi at intersystems.com>
Subject: Re: [antlr-interest] Problems with lexing tokens containing
blanks
To: <antlr-interest at antlr.org>
Message-ID: <20061130011626.116A31390 at mail.intersystems.com>
Content-Type: text/plain; charset="windows-1250"
I cannot immediately see why this is not working for you, however unless
whitespace is typically significant in the language you are parsing, you
should not deal with this in the lexer. Though I suspect that if you
supply the full source it will be evident as to why this does not seem
to work, such as you have a lexer rule that captures whitespace before
the INDEX_OF definition or something similar.
However, I think you are confusing lexical definitions with
disambiguation that the parser should be handling. For instance, what
would happen if the source code was:
index <tab><tab> of
You are well advised to think of the source input as you would any other
language. In English you would not tokenize ?index of? as one semantic
element, and should not really do this in ANTLR (it isn?t as clear cut
as this of course).
Is there any reason you cannot have:
WS : ? ? | ?\t? {channel=$hidden};
INDEX : ?index? ;
OF : ?of?
And then have parser rules that ?know? the difference? Remember that the
lexer is a simple beast whose only job is to just tokenize the input.
indexer:
INDEX
(
OF somerule
| somethingelse
)
;
Jim
More information about the antlr-interest
mailing list