[antlr-interest] Antlr lexer does not try other possible matches when it fails to match a token

Wed Dec 1 07:03:31 PST 2010

Thanks for your quick response Nick.

You are right, it works with the example but I am afraid this is not feasible with my complete grammar.

 If I do that for all my possible parameter values (the format depends on the preceding parameter name), I would have a lot of lexer rules to sort out and that would for sure be conflicting:

STATION_NAME           :           LETTER LETTER LETTER DIGIT;
ADDRESS                    :           (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT);
LOCATION                    :           LETTER LETTER LETTER LETTER;
SESSION                      :           LETTER LETTER DIGIT DIGIT DIGIT DIGIT;
PROVIDER                   :           LETTER LETTER LETTER;
CODE                           :           LETTER (LETTER|DIGIT) (LETTER|DIGIT);
DATE                           :           DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT;
TIME                             :           (DIGIT DIGIT DIGIT DIGIT) | (DASH DASH DASH DASH) | (SPACE SPACE SPACE SPACE);
...

I think it is much preferable to have the lexer returning a sequence of DIGIT and LETTERS (except for param names), and to specify what is the expected sequence for a given parameter at parsing level. Something like that (but again this is an extract):

grammar test;

listOfParameters           :           parameterDef (CRLF parameterDef)* EOF;

parameterDef                : stationParameter|addressParameter|locationParameter|sessionParameter|providerParameter|codeParameter|dateParameter|timeParameter;

stationParameter           :           STATION SPACE stationName;
stationName                  :           LETTER LETTER LETTER DIGIT;

addressParameter         :           ADDRESS SPACE stationName;
adress                          :           (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT);

locationParameter         :           LOCATION SPACE stationName;
location                        :           LETTER LETTER LETTER LETTER;

sessionParameter          :           SESSION SPACE stationName;
session                         :           LETTER LETTER DIGIT DIGIT DIGIT DIGIT;

providerParameter         :           PROVIDER SPACE stationName;
provider                        :           LETTER LETTER LETTER;

codeParameter              :           CODE SPACE stationName;
code                             :           LETTER (LETTER|DIGIT) (LETTER|DIGIT);

dateParameter               :           DATE SPACE stationName;
date                              :           DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT;

timeParameter               :           TIME SPACE stationName;
time                              :           (DIGIT DIGIT DIGIT DIGIT) | (DASH DASH DASH DASH) | (SPACE SPACE SPACE SPACE);

STATION           :           'STATION';
ADDRESS        :           'ADDRESS';
LOCATION:       'LOCATION';
SESSION          :           'SESSION';
PROVIDER:      'PROVIDER';
CODE   :           'CODE';
DATE   :           'DATE';
TIME     :           'TIME';
LETTER            :           'a'..'z' | 'A'..'Z';
DIGIT    :           '0'..'9';
DASH   :           '-';
SPACE :           ' ';
CRLF : '\r'? '\n';

From: Nick Vlassopoulos [mailto:nvlassopoulos at gmail.com]
Sent: 01 December 2010 15:10
To: COUJOULOU, Philippe
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Antlr lexer does not try other possible matches when it fails to match a token

Hello Philippe,

Although I am not an expert, I thing you should let the lexer sort out
the "3 letters 1 digit" in the station name. Alternatively, you could probably
add the station name as an identifier and check if it is in the correct format
after parsing it.

Without being sure if it is a good solution, the following seems to work:

Best regards,

Nikos

-------------------------
grammar Stations;

stationParameter         :
            KEYWORD_STATION SPACE stationName;

stationName
            :           STATION_NAME;

STATION_NAME
            :           LETTER LETTER LETTER DIGIT;

KEYWORD_STATION        :           'STATION';
LETTER                     :           'a'..'z' | 'A'..'Z';
DIGIT             :           '0'..'9';
SPACE                       :           ' ';
-------------------------

On Wed, Dec 1, 2010 at 2:18 PM, COUJOULOU, Philippe <philippe.coujoulou at airbus.com<mailto:philippe.coujoulou at airbus.com>> wrote:
Dear all,

I am trying to parse a message that contains parameters values like <PARAM_NAME> <VALUE>, for instance "STATION EST1".
Here is a very simple extract of my grammar for one of these parameters (the one given in the above example):

grammar test;

KEYWORD_STATION :       'STATION';
DIGIT    :        '0'..'9';
LETTER  :        'a'..'z' | 'A'..'Z';
SPACE   :       ' ';

stationParameter        :       KEYWORD_STATION SPACE stationName;
stationName     :       LETTER LETTER LETTER DIGIT;

The point is that when I try to parse my example message (STATION EST1), I get a MismatchTokenException at the point where the parser attempts to read the last "ST1". After some analysis, I understood that the lexer generated the following tokens: KEYWORD_STATION SPACE LETTER for the string "STATION E"  and then attempted to match the remaining "ST1" with KEYWORD_STATION but failed to complete it.

At this point, I would expect the lexer to backtrack to the beginning of 'ST1' and then match it with LETTER LETTER DIGIT, but it doesn't.

I have tried various combinations of "backtrack", "memorize" and "k" options without any success. I must have missed something. (Should it help, I use ANTLRWorks 1.4).

Please could you tell me how to proceed in order to make the lexer backtrack and try other alternatives when a keyword of my language is not exactly matched ?

Thanks in advance for your help.

Best Regards,

Philippe Coujoulou.

The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised.
If you are not the intended recipient, please notify Airbus immediately and delete this e-mail.
Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately.
All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free.

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address

This mail has originated outside your organization, either from an external partner or the Global Internet.

Keep this in mind if you answer this message.

The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised.
If you are not the intended recipient, please notify Airbus immediately and delete this e-mail.
Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately.
All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free.