[antlr-interest] Semantic Predicates in a Lexer

"Paul Bouché (NSN)" paul.bouche at nsn.com
Fri Mar 20 11:02:45 PDT 2009


Hi Jim,

thanks for your response.

Jim Idle schrieb:
> Paul Bouché (NSN) wrote:
>> Hi,
> Firstly, do not forget that you cannot set such a flag from the parser 
> as the lexer runs first and creates all the tokens.
I know which I personally find a bug, if there was not a token buffer, 
one could steer the lexer depending on parser state. This would make the 
lexer much more powerfull. Ter stated in his book we need tokens because 
this makes things easier and this is also how humans analyse text, but 
humans don't buffer up all words in a page but read them word by word 
and keep a short buffer of recently read tokens so as to make context 
sensitive decisions ;-)

Sorry I don't mean to come off abrasive, I am glad for your reply!
>> Here is a lexer excerpt:
>> NUMBER : DIGIT_+;
>> SIMPLENAME: {noColonInNames}?=> LETTER_+;
>> COLON: {noColonInNames}?=> COLON_;
>> NAME: {!noColonInNames}?=> (LETTER_ | COLON_)+;
>> fragment DIGIT_: '0'..'9';
>> fragment LETTER_: 'a'..'z' | 'A'..'Z';
>>   
> Assuming that you can configure these flags in lexer context are not 
> expecting them to be respected by the lexer if the parser sets them, 
> then you should be able to do this:
>
> grammar ttt;
>
> @lexer::members
> {
>     boolean noColonInNames = false;
> }
>
> test
>     : (SIMPLENAME | COLON | NAME)* EOF ;
>    
> fragment LETTER_
>     :    ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z')*
>     ;
>
> fragment COLON
>     :     ':'
>     ;
>    
> fragment SIMPLENAME
>     :   
>     ;
>    
> NAME
>     : {!noColonInNames}?=> (LETTER_ | COLON)+ { noNamesInColons = true; }
>     |  LETTER_+ { $type = SIMPLENAME; }
>     | COLON { $type = COLON; }
>     ;
>
> However, I suspect that you will find it much easier to use predicates 
> in the parser, even if it is only the first one you come across that 
> should be NAME COLON NAME:
>
> grammar ttt;
>
> @lexer::members
> {
>     boolean noColonInNames = false;
> }
>
> test
>     : names* EOF ;
>    
> names
>     : {!noColonInNames}?=> name { System.out.println("Var is '" + 
> $name.text + "'"); }
>     | {noColonInNames}?=> NAME (COLON NAME)*
>     ;
>    
> name
>     : ((NAME | COLON)=>(NAME | COLON))+
>     ;
>    
> fragment LETTER_
>     :    ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z')*
>     ;
>
> fragment COLON
>     :     ':'
>     ;
>    
> NAME
>     : LETTER_+
>     ;
I tried what you suggested, basically you moved the gated semantic 
predicate into subrules of the NAME rule. Here is what I tried:

grammar LexerPredicates;

@lexer::members
{
    boolean noColonInNames = true;
    boolean parsedSimpleName = false;
}
@members {
    boolean noColonInNames = false;
}

test
    : (simplename | colon | name)* EOF ;

name
    :    NAME
    ;

colon
    :    COLON
    ;

simplename
    :    SIMPLENAME
    ;

  
fragment DIGIT_
    :    '0'..'9'
    ;

NUMBER
    :
    {noColonInNames || !noColonInNames}?=>
    DIGIT_+
    ;  

fragment LETTER_
    :    'a'..'z'|'A'..'Z'
    ;

fragment COLON
    :     ':'
    ;
  
fragment SIMPLENAME
    :  
    ;

fragment LETTERORDIGIT_
    :   
    DIGIT_ | LETTER_
    ;
  
NAME
    : {!noColonInNames}?=> (LETTERORDIGIT_ | COLON)+
    | LETTERORDIGIT_+ { $type = SIMPLENAME; }  //A
    | COLON { $type = COLON; } { noColonInNames = false;} //B
    ;
   
WHITESPACE
        :
        ( ' ' | '\t' /*| NEWLINE_*/)+ // removed new line due to 
backward compatibility
        { skip(); }
        ;

If I try to compile this I get the following error:
internal error: 
org.antlr.analysis.NFAToDFAConverter.getPredicatesPerNonDeterministicAlt(NFAToDFAConverter.java:1603): 
no AST/token for nonepsilon target w/o predicate
which is better than no error and tells me that I need to cover the 
other alternative within the name rule. Marked with //A and //B. I did 
that, i.e.:

NAME
    : {!noColonInNames}?=> (LETTERORDIGIT_ | COLON)+
    | {noColonInNames }?=> LETTERORDIGIT_+ { $type = SIMPLENAME; }
    | {noColonInNames }?=> COLON { $type = COLON; } { noColonInNames = 
false;}
    ;

If I then set the noColonInNames to true for the input
a          :     b    a    3    a3   3a
I get
SIMPLENAME COLON NAME NAME NAME NAME NAME
which is not what I want, I want the 3 to be recognized as a NUMBER as 
it works without predicates. Imo there is a bug or I cannot understand 
why it does not work.

I solved this by combining your approach and generating several tokens 
when the NAME lexer rule matches, of course with the desired type. The 
NAME COLON NAME solution did not work because I would have had to do 
major rework in our tree parser which I wanted to avoid. Now it works 
nicely as I desired it, but as praiseworthy as ANTLR is, it is also 
sometimes a real beast, or maybe I am the beast and just didn't get it :)

Thx,
Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090320/2e1ae5df/attachment.html 


More information about the antlr-interest mailing list