[antlr-interest] Problems with lexing tokens containing blanks
Jim Idle
jimi at intersystems.com
Wed Nov 29 17:16:25 PST 2006
I cannot immediately see why this is not working for you, however unless whitespace is typically significant in the language you are parsing, you should not deal with this in the lexer. Though I suspect that if you supply the full source it will be evident as to why this does not seem to work, such as you have a lexer rule that captures whitespace before the INDEX_OF definition or something similar.
However, I think you are confusing lexical definitions with disambiguation that the parser should be handling. For instance, what would happen if the source code was:
index <tab><tab> of
You are well advised to think of the source input as you would any other language. In English you would not tokenize “index of” as one semantic element, and should not really do this in ANTLR (it isn’t as clear cut as this of course).
Is there any reason you cannot have:
WS : ‘ ‘ | ‘\t’ {channel=$hidden};
INDEX : ‘index’ ;
OF : ‘of’
And then have parser rules that ‘know’ the difference? Remember that the lexer is a simple beast whose only job is to just tokenize the input.
indexer:
INDEX
(
OF somerule
| somethingelse
)
;
Jim
_____
From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Ryan Hollom
Sent: Wednesday, November 29, 2006 1:15 PM
To: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Problems with lexing tokens containing blanks
Terence,
Putting the INDEX_OF rule first doesn't seem to do the trick for me. For instance, the full lexer grammar:
lexer grammar testgrammarlexer;
INDEX_OF : 'index of' ;
INDEX : 'index' ;
NEWLINE : (('\r')? '\n' )+ ;
ID : ( 'A' .. 'Z' | '0' .. '9') ( 'A' .. 'Z' | 'a' .. 'z' | '0' .. '9')*;
WS : (' '|'\t')+ {$channel=HIDDEN;};
Still generates the mTOKENS section that checks for 'i' 'n' 'd' 'e' 'x' ' ', at which point it assumes the token is 'index of'. In detail, it generates this:
public void mTokens() throws RecognitionException {
int alt5=5;
switch ( input.LA(1) ) {
case 'i':
int LA5_1 = input.LA(2);
if ( (LA5_1=='n') ) {
int LA5_5 = input.LA(3);
if ( (LA5_5=='d') ) {
int LA5_6 = input.LA(4);
if ( (LA5_6=='e') ) {
int LA5_7 = input.LA(5);
if ( (LA5_7=='x') ) {
int LA5_8 = input.LA(6);
if ( (LA5_8==' ') ) {
alt5=1; <- INDEX_OF
}
else {
alt5=2;} <- INDEX
}
I've run into this issue in other ways for my grammar, and even if putting INDEX_OF as the first rule did work, what if you're not directly creating a lexer rule for each multi word keyword (that is, just referencing the keywords in the parser rules like 'index of' and 'index')? Do all of the parser rules therefore need to be in the proper order to generate the correct lexer? Sometimes this is not possible, and likely not desired.
Do lexer predicates need to be used, or perhaps a fixed lookahead (of at least 7 in this case)?
Thanks,
-Ryan
Terence Parr <parrt at cs.usfca.edu>
Sent by: antlr-interest-bounces at antlr.org
11/29/2006 02:22 PM
To
ANTLR Interest <antlr-interest at antlr.org>
cc
Subject
Re: [antlr-interest] Problems with lexing tokens containing blanks
On Nov 29, 2006, at 8:44 AM, Bernd Vogt wrote:
> Hi all,
>
> in my current project I have the requirement to lex some tokens
> like this:
>
> lexer grammar ExampleLexer;
> …
> INDEX : 'index' ;
> INDEX_OF : 'index of' ;
> INT : '0' | '1'..'9' '0'..'9'* ;
Hi, try putting
INDEX_OF : 'index of' ;
before INDEX.
Ter
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/556 - Release Date: 11/28/2006 3:22 PM
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/556 - Release Date: 11/28/2006 3:22 PM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20061129/c4f8b40d/attachment-0001.html
More information about the antlr-interest
mailing list