[antlr-interest] Problems with lexing tokens containing blanks

Jim Idle jimi at intersystems.com
Wed Nov 29 17:16:25 PST 2006


I cannot immediately see why this is not working for you, however unless whitespace is typically significant in the language you are parsing, you should not deal with this in the lexer. Though I suspect that if you supply the full source it will be evident as to why this does not seem to work, such as you have a lexer rule that captures whitespace before the INDEX_OF definition or something similar.
 
However, I think you are confusing lexical definitions with disambiguation that the parser should be handling. For instance, what would happen if the source code was:
 
index         <tab><tab>      of
 
 
You are well advised to think of the source input as you would any other language. In English you would not tokenize “index of” as one semantic element, and should not really do this in ANTLR (it isn’t as clear cut as this of course). 
 
Is there any reason you cannot have:
 
WS        : ‘ ‘ | ‘\t’ {channel=$hidden};
INDEX     : ‘index’ ;
OF        : ‘of’
 
And then have parser rules that ‘know’ the difference? Remember that the lexer is a simple beast whose only job is to just tokenize the input.
 
indexer:
          INDEX 
              (
                    OF somerule
                   | somethingelse
              )
     ;
 
Jim
 
   _____  

From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Ryan Hollom
Sent: Wednesday, November 29, 2006 1:15 PM
To: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Problems with lexing tokens containing blanks
 

Terence, 
Putting the INDEX_OF rule first doesn't seem to do the trick for me.  For instance, the full lexer grammar: 

lexer grammar testgrammarlexer; 

INDEX_OF :        'index of' ; 
INDEX         :        'index' ; 

NEWLINE :   (('\r')? '\n' )+ ; 
ID        : ( 'A' .. 'Z' | '0' .. '9') ( 'A' .. 'Z' | 'a' .. 'z' | '0' .. '9')*; 
WS        :        (' '|'\t')+ {$channel=HIDDEN;}; 

Still generates the mTOKENS section that checks for 'i' 'n' 'd' 'e' 'x' ' ', at which point it assumes the token is 'index of'.  In detail, it generates this: 
    public void mTokens() throws RecognitionException { 
        int alt5=5; 
        switch ( input.LA(1) ) { 
        case 'i': 
            int LA5_1 = input.LA(2); 
            if ( (LA5_1=='n') ) { 
                int LA5_5 = input.LA(3); 
                if ( (LA5_5=='d') ) { 
                    int LA5_6 = input.LA(4); 
                    if ( (LA5_6=='e') ) { 
                        int LA5_7 = input.LA(5); 
                        if ( (LA5_7=='x') ) { 
                            int LA5_8 = input.LA(6); 
                            if ( (LA5_8==' ') ) { 
                                alt5=1; <- INDEX_OF 
                            } 
                            else { 
                                alt5=2;} <- INDEX 
                        } 

I've run into this issue in other ways for my grammar, and even if putting INDEX_OF as the first rule did work, what if you're not directly creating a lexer rule for each multi word keyword (that is, just referencing the keywords in the parser rules like 'index of' and 'index')?  Do all of the parser rules therefore need to be in the proper order to generate the correct lexer?  Sometimes this is not possible, and likely not desired. 

Do lexer predicates need to be used, or perhaps a fixed lookahead (of at least 7 in this case)? 

Thanks, 
-Ryan 




Terence Parr <parrt at cs.usfca.edu> 
Sent by: antlr-interest-bounces at antlr.org 
11/29/2006 02:22 PM 

To
ANTLR Interest <antlr-interest at antlr.org> 

cc
 

Subject
Re: [antlr-interest] Problems with lexing tokens containing blanks
 

 
 




On Nov 29, 2006, at 8:44 AM, Bernd Vogt wrote:

> Hi all,
>
> in my current project I have the requirement to lex some tokens  
> like this:
>
> lexer grammar ExampleLexer;
>> INDEX : 'index' ;
> INDEX_OF : 'index of' ;
> INT : '0' | '1'..'9' '0'..'9'* ;

Hi, try putting

INDEX_OF : 'index of' ;

before INDEX.

Ter




--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/556 - Release Date: 11/28/2006 3:22 PM

-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.19/556 - Release Date: 11/28/2006 3:22 PM
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20061129/c4f8b40d/attachment-0001.html 


More information about the antlr-interest mailing list