[antlr-interest] Lexer problem

Jim Idle jimi at temporal-wave.com
Tue Mar 11 11:54:20 PDT 2008


Actually I think that what is happening is that your call to the WS rule after POUND is setting the token HIDDEN. This is a side effect of a change meant to fix something else and is (probably ;-), a bug. We are talking about what to do about this at the moment - currently you cat' change the token type by calling a fragment either. For now, change your call to WS to explicitly use (' ' | '\t')*, then your token won't be hidden.

 

Also note that your predicate will send 'aaa' down the first alt of your sub rule even if there is no WS DECIMAL_POINT following it. Hence you will get a lexer mismatch error in some cases. You should try to cover all alternatives, even errors, so you can do something under your own control. Generally though it is best to leave ordered construction up to the parser if you can.

 

Jim

 

From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Brent Yates
Sent: Monday, March 10, 2008 8:51 PM
To: antlr-interest at antlr.org
Subject: [antlr-interest] Lexer problem

 

I need some help understanding syntactic predicates when used in the lexer.

Here is a simple grammar that will run in AntlrWorks:

grammar Simple;

options
    {
    language= Java;
    output=AST;
    }

start
    :   TEST
    ;

POUND   :   '#' ;
ID      :   'a'..'z'+ ;
fragment DECIMAL_DIGIT
    :   '0'..'9'
    ;

TEST
    :   POUND WS?
    (
        ('aaa') => 'aaa' WS DECIMAL_DIGIT       {$channel=HIDDEN;$type=DECIMAL_DIGIT;}
    |   ('bbb') => 'bbb' WS DECIMAL_DIGIT       {$channel=HIDDEN;$type=ID;}
    |   ID
    )
    ;

fragment SPACE_OR_TAB
    :  (' '|'\t')+
    ;

WS
    :   SPACE_OR_TAB+
        {$channel=HIDDEN;}
    ;

NEWLINE
    :   ('\r'? ('\u000C'|'\n') )
        {$channel=HIDDEN;}
    ;

When fed this input:

# aaa 4
# bbb
#hi

I would expect the following:

1) the '# aaa 4' matches alt1 in TEST and should be set to HIDDEN and type DECIMAL_DIGIT.  And that does happen.
2) the '# bbb<nl>#hi' does not match alt2, however it does match the predicate.  I would expect a lexer error.  What happens is that the token type is set to HIDDEN and the rules actually matches the ID and returns a type of TEST.  That I don't understand.

It looks like the actions of alt2 are being run even though only the predicate matches.  Also, if the predicate matches, why does the lexer later match alt3?

Thanks for your help,

Brent Yates
brent,yates at gmail.com





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080311/3d8fc40c/attachment.html 


More information about the antlr-interest mailing list