[antlr-interest] Lookahead predicates in the Lexer?

Tue Nov 13 10:43:37 PST 2012

IIRC, the old-style "( A B )=>" predicate could define fixed, rule-based 
lookaheads.  Did this to define a modal lexer

TAG_OPEN         :
         ( ( '<'  LETTER ) => '<' { lexMode = TAG; printState("-TOpen  
"); }
         | ( '<' ~LETTER ) => '<' { lexMode = TEXT; $type=PCDATA;   
printState("BOpen   "); }
         ) ;

Was looking for/hoping that you had collapsed that functionality into 
the v4 predicates.  Having the generated lexer do the scan ahead is 
always, or almost always, going to be more efficient than a hand written 
scan -- at least Antlr can reuse the results of the scan it performs.  
Any thought of adding this capability as an enhancement?

BTW, the new ()*? operator is nice -- explicit and succinct.

On 11/13/2012 9:33 AM, Terence Parr wrote:
> predicates have always been native code, though, right?
> Ter
> On Nov 13, 2012, at 12:09 AM, Gerald Rosenberg wrote:
>
>> Well that was what I was hoping for.  Using the v4.0b3 jar, the Lexer rule
>>
>> fragment COMMA        : ','    ;
>> Identifier
>>     : LETTER ( LETTER | DIGIT | UNDERSCORE )* { ~COMMA }? -> popMode
>>     | LETTER ( LETTER | DIGIT | UNDERSCORE )*
>>     ;
>>
>> generates, in relevant part,
>>
>>     public void Identifier_action(RuleContext _localctx, int actionIndex) {
>>         switch (actionIndex) {
>>         case 0: popMode();  break;
>>         }
>>     }
>>     public boolean Identifier_sempred(RuleContext _localctx, int predIndex) {
>>         switch (predIndex) {
>>         case 5: return  ~COMMA ;
>>         }
>>         return true;
>>     }
>>
>> Switching from the fragment rule to a token rule
>>
>> Comma : COMMA ;
>> . . . . { ~Comma }? . . . .
>>
>> generates
>> . . . .
>>         case 5: return  ~Comma ;
>>
>> As if Antlr is only considering the content of the predicate to be a native code boolean expression.
>>
>>
>> On 11/12/2012 5:05 PM, Terence Parr wrote:
>>> That predicate should work.  If that predicate fails, then that rule will fail and the input will not be consumed for B.
>>> Ter
>>> On Nov 12, 2012, at 3:29 PM, Gerald Rosenberg wrote:
>>>
>>>> In Antlr4, is there a way to do a fixed lookahead in the lexer predicate
>>>> without capturing the lookahead token(s)?  In v3, predicates could be
>>>> used for this purpose.
>>>>
>>>> csvRule : A ( Comma B )* ;
>>>>
>>>> A : P Q R -> pushMode(Alphabet)
>>>>
>>>> mode Alphabet;
>>>> B : X Y Z { ~Comma }? -> popMode
>>>>     : X Y Z ;
>>>>
>>>> In v4 , the "~Comma" is presumed to be native code.
>>>>
>>>> Basically, looking for a clean, workable way to not require the use of a
>>>> semicolon to explicitly terminate input that matches the csvRule, yet
>>>> have a representation in the lexer that can be used as the popMode trigger.
>>>>
>>>> I do realize that I can write a predicate method to do a stream scan,
>>>> but would prefer a non-native code solution if possible.  Also realize
>>>> that, in the simplest case, csvRule could be pushed down into the
>>>> Lexer.  Where A and B  are not just single terminals in the parser,
>>>> other rules would have to be pushed down also, making for a bit of a mess.
>>>>
>>>> Thanks,
>>>> Gerald
>>>>
>>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>