[antlr-interest] Antlr lexer does not try other possible matches when it fails to match a token

Wed Dec 1 09:39:56 PST 2010

Greetings!

On Wed, 2010-12-01 at 16:09 +0100, COUJOULOU, Philippe wrote:
> Sorry, the code I posted was not correct. Here is the correct code I wanted to copy in my previous message (with xxxParameter parser rules referring to correct parameter value rule):
> 
> grammar test;
> 
> listOfParameters           :           parameterDef (CRLF parameterDef)* EOF;
> 
> parameterDef    : stationParameter|addressParameter|locationParameter|sessionParameter|providerParameter|codeParameter|dateParameter|timeParameter;
> 
> stationParameter           :           STATION SPACE stationName;
> stationName                  :           LETTER LETTER LETTER DIGIT;
> 
> addressParameter         :           ADDRESS SPACE address;
> address                                    :           (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT);
> 
> locationParameter         :           LOCATION SPACE location;
> location                        :           LETTER LETTER LETTER LETTER;
> 
> sessionParameter          :           SESSION SPACE session;
> session                         :           LETTER LETTER DIGIT DIGIT DIGIT DIGIT;
> 
> providerParameter         :           PROVIDER SPACE provider;
> provider                        :           LETTER LETTER LETTER;
> 
> codeParameter              :           CODE SPACE code;
> code                             :           LETTER (LETTER|DIGIT) (LETTER|DIGIT);
> 
> dateParameter               :           DATE SPACE stationName;

i assume that stationName here should be date

> date                              :           DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT;
> 
> timeParameter               :           TIME SPACE time;
> time                              :           (DIGIT DIGIT DIGIT DIGIT) | (DASH DASH DASH DASH) | (SPACE SPACE SPACE SPACE);
> 
> 
> STATION           :           'STATION';
> ADDRESS        :           'ADDRESS';
> LOCATION:       'LOCATION';
> SESSION          :           'SESSION';
> PROVIDER:      'PROVIDER';
> CODE   :           'CODE';
> DATE   :           'DATE';
> TIME     :           'TIME';
> LETTER            :           'a'..'z' | 'A'..'Z';
> DIGIT    :           '0'..'9';
> DASH   :           '-';
> SPACE :           ' ';
> CRLF : '\r'? '\n';
> 
> 

I do not think that ANTLR v3 Lexers are able to backtrack in the fashion
you want. e.g. when seeing the ST in EST1 be able to issue two LETTER
tokens rather that insisting on finding a STATION token.

3 solutions come to mind --- but perhaps none of them will be
appropriate for you since my solutions are based soley upon the fragment
of your entire grammar that you have posted and therefore might not work
out for the whole.

1) do not use ANTLR. just divide the line at the first SPACE. lookup the
first portion in a keyword hashmap and dispatch to proper code to handle
the second portion.

2) move each parameter to the Lexer and let the Lexer work out the
STATION vs E S T 1 issues. see the attached AllLexer.g for a tested
example of what I am suggesting here.

3) move everything into the Parser, e.g. the Lexer just issues single
character tokens, and the Parser works out how to glue things together.
see the attached NoLexer.g for a tested example of what I am suggesting
here.

Hope this helps
   -jbb

> 
> From: COUJOULOU, Philippe
> Sent: 01 December 2010 16:04
> To: 'Nick Vlassopoulos'
> Cc: antlr-interest at antlr.org
> Subject: RE: [antlr-interest] Antlr lexer does not try other possible matches when it fails to match a token
> 
> 
> Thanks for your quick response Nick.
> 
> You are right, it works with the example but I am afraid this is not feasible with my complete grammar.
> 
>  If I do that for all my possible parameter values (the format depends on the preceding parameter name), I would have a lot of lexer rules to sort out and that would for sure be conflicting:
> 
> STATION_NAME           :           LETTER LETTER LETTER DIGIT;
> ADDRESS                    :           (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT);
> LOCATION                    :           LETTER LETTER LETTER LETTER;
> SESSION                      :           LETTER LETTER DIGIT DIGIT DIGIT DIGIT;
> PROVIDER                   :           LETTER LETTER LETTER;
> CODE                           :           LETTER (LETTER|DIGIT) (LETTER|DIGIT);
> DATE                           :           DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT;
> TIME                             :           (DIGIT DIGIT DIGIT DIGIT) | (DASH DASH DASH DASH) | (SPACE SPACE SPACE SPACE);
> ...
> 
> 
> I think it is much preferable to have the lexer returning a sequence of DIGIT and LETTERS (except for param names), and to specify what is the expected sequence for a given parameter at parsing level. Something like that (but again this is an extract):
> 
> grammar test;
> 
> listOfParameters           :           parameterDef (CRLF parameterDef)* EOF;
> 
> parameterDef                : stationParameter|addressParameter|locationParameter|sessionParameter|providerParameter|codeParameter|dateParameter|timeParameter;
> 
> stationParameter           :           STATION SPACE stationName;
> stationName                  :           LETTER LETTER LETTER DIGIT;
> 
> addressParameter         :           ADDRESS SPACE stationName;
> adress                          :           (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT) (LETTER|DIGIT);
> 
> locationParameter         :           LOCATION SPACE stationName;
> location                        :           LETTER LETTER LETTER LETTER;
> 
> sessionParameter          :           SESSION SPACE stationName;
> session                         :           LETTER LETTER DIGIT DIGIT DIGIT DIGIT;
> 
> providerParameter         :           PROVIDER SPACE stationName;
> provider                        :           LETTER LETTER LETTER;
> 
> codeParameter              :           CODE SPACE stationName;
> code                             :           LETTER (LETTER|DIGIT) (LETTER|DIGIT);
> 
> dateParameter               :           DATE SPACE stationName;
> date                              :           DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT;
> 
> timeParameter               :           TIME SPACE stationName;
> time                              :           (DIGIT DIGIT DIGIT DIGIT) | (DASH DASH DASH DASH) | (SPACE SPACE SPACE SPACE);
> 
> 
> STATION           :           'STATION';
> ADDRESS        :           'ADDRESS';
> LOCATION:       'LOCATION';
> SESSION          :           'SESSION';
> PROVIDER:      'PROVIDER';
> CODE   :           'CODE';
> DATE   :           'DATE';
> TIME     :           'TIME';
> LETTER            :           'a'..'z' | 'A'..'Z';
> DIGIT    :           '0'..'9';
> DASH   :           '-';
> SPACE :           ' ';
> CRLF : '\r'? '\n';
> 
> 
> 
> 
> 
> 
> From: Nick Vlassopoulos [mailto:nvlassopoulos at gmail.com]
> Sent: 01 December 2010 15:10
> To: COUJOULOU, Philippe
> Cc: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Antlr lexer does not try other possible matches when it fails to match a token
> 
> Hello Philippe,
> 
> Although I am not an expert, I thing you should let the lexer sort out
> the "3 letters 1 digit" in the station name. Alternatively, you could probably
> add the station name as an identifier and check if it is in the correct format
> after parsing it.
> 
> Without being sure if it is a good solution, the following seems to work:
> 
> Best regards,
> 
> Nikos
> 
> -------------------------
> grammar Stations;
> 
> stationParameter         :
>             KEYWORD_STATION SPACE stationName;
> 
> stationName
>             :           STATION_NAME;
> 
> STATION_NAME
>             :           LETTER LETTER LETTER DIGIT;
> 
> KEYWORD_STATION        :           'STATION';
> LETTER                     :           'a'..'z' | 'A'..'Z';
> DIGIT             :           '0'..'9';
> SPACE                       :           ' ';
> -------------------------
> 
> 
> On Wed, Dec 1, 2010 at 2:18 PM, COUJOULOU, Philippe <philippe.coujoulou at airbus.com<mailto:philippe.coujoulou at airbus.com>> wrote:
> Dear all,
> 
> I am trying to parse a message that contains parameters values like <PARAM_NAME> <VALUE>, for instance "STATION EST1".
> Here is a very simple extract of my grammar for one of these parameters (the one given in the above example):
> 
> grammar test;
> 
> KEYWORD_STATION :       'STATION';
> DIGIT    :        '0'..'9';
> LETTER  :        'a'..'z' | 'A'..'Z';
> SPACE   :       ' ';
> 
> stationParameter        :       KEYWORD_STATION SPACE stationName;
> stationName     :       LETTER LETTER LETTER DIGIT;
> 
> 
> The point is that when I try to parse my example message (STATION EST1), I get a MismatchTokenException at the point where the parser attempts to read the last "ST1". After some analysis, I understood that the lexer generated the following tokens: KEYWORD_STATION SPACE LETTER for the string "STATION E"  and then attempted to match the remaining "ST1" with KEYWORD_STATION but failed to complete it.
> 
> At this point, I would expect the lexer to backtrack to the beginning of 'ST1' and then match it with LETTER LETTER DIGIT, but it doesn't.
> 
> I have tried various combinations of "backtrack", "memorize" and "k" options without any success. I must have missed something. (Should it help, I use ANTLRWorks 1.4).
> 
> Please could you tell me how to proceed in order to make the lexer backtrack and try other alternatives when a keyword of my language is not exactly matched ?
> 
> Thanks in advance for your help.
> 
> Best Regards,
> 
> Philippe Coujoulou.

-------------- next part --------------
grammar AllLexer;

options {
   output = AST;
   ASTLabelType = CommonTree;
}

@members {
   private static final String [] x = new String[] {
      "STATION ABC2\nADDRESS 1234567","STATION STA1"
   };

   public static void main(String [] args) {
      for( int i = 0; i < x.length; ++i ) {
         try {
            System.out.println("about to parse:`"+x[i]+"`");
            AllLexerLexer lexer =
               new AllLexerLexer(new ANTLRStringStream(x[i]));
            CommonTokenStream tokens = new CommonTokenStream(lexer);
            System.out.println("tokens:"+tokens.toString());

            AllLexerParser parser = new AllLexerParser(tokens);
            AllLexerParser.listOfParameters_return p_result =
               parser.listOfParameters();

            CommonTree ast = p_result.tree;
            if( ast == null ) {
               System.out.println("resultant tree: is NULL");
            } else {
               System.out.println("resultant tree: " + ast.toStringTree());
            }
            System.out.println();
         } catch(Exception e) {
            e.printStackTrace();
         }
      }
   }
}

listOfParameters : parameterDef (CRLF parameterDef)* EOF;

parameterDef
   : StationParameter
   | AddressParameter
   | LocationParameter
   | SessionParameter
   | ProviderParameter
   | CodeParameter
   | DateParameter
   | TimeParameter;

StationParameter : STATION SPACE StationName;
fragment StationName : LETTER LETTER LETTER DIGIT;

AddressParameter : ADDRESS SPACE Address;
fragment Address : L_or_D L_or_D L_or_D L_or_D L_or_D L_or_D L_or_D;

LocationParameter : LOCATION SPACE Location;
fragment Location : LETTER LETTER LETTER LETTER;

SessionParameter : SESSION SPACE Session;
fragment Session : LETTER LETTER DIGIT DIGIT DIGIT DIGIT;

ProviderParameter : PROVIDER SPACE Provider;
fragment Provider : LETTER LETTER LETTER;

CodeParameter : CODE SPACE Code;
fragment Code : LETTER L_or_D L_or_D;

DateParameter : DATE SPACE Date;
fragment Date : DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT;

TimeParameter : TIME SPACE Time;
fragment Time : (DIGIT DIGIT DIGIT DIGIT) | (DASH DASH DASH DASH) | (SPACE SPACE SPACE SPACE);

fragment L_or_D  : LETTER | DIGIT ;

fragment STATION : 'STATION';
fragment ADDRESS : 'ADDRESS';
fragment LOCATION: 'LOCATION';
fragment SESSION : 'SESSION';
fragment PROVIDER: 'PROVIDER';
fragment CODE    : 'CODE';
fragment DATE    : 'DATE';
fragment TIME    : 'TIME';
fragment LETTER  : 'a'..'z' | 'A'..'Z';
fragment DIGIT   : '0'..'9';
fragment DASH    : '-';
fragment SPACE   : ' ';

CRLF : '\r'? '\n';

-------------- next part --------------
grammar NoLexer;

options {
   output = AST;
   ASTLabelType = CommonTree;
}

@members {
   private static final String [] x = new String[] {
      "STATION ABC2\nADDRESS 1234567","STATION STA1"
   };

   public static void main(String [] args) {
      for( int i = 0; i < x.length; ++i ) {
         try {
            System.out.println("about to parse:`"+x[i]+"`");
            NoLexerLexer lexer =
               new NoLexerLexer(new ANTLRStringStream(x[i]));
            CommonTokenStream tokens = new CommonTokenStream(lexer);
            System.out.println("tokens:"+tokens.toString());

            NoLexerParser parser = new NoLexerParser(tokens);
            NoLexerParser.listOfParameters_return p_result =
               parser.listOfParameters();

            CommonTree ast = p_result.tree;
            if( ast == null ) {
               System.out.println("resultant tree: is NULL");
            } else {
               System.out.println("resultant tree: " + ast.toStringTree());
            }
            System.out.println();
         } catch(Exception e) {
            e.printStackTrace();
         }
      }
   }
}

listOfParameters : parameterDef (CRLF parameterDef)* EOF;

parameterDef
   : stationParameter
   | addressParameter
   | locationParameter
   | sessionParameter
   | providerParameter
   | codeParameter
   | dateParameter
   | timeParameter;

stationParameter : stationKeyword SPACE stationName;
stationName : letter letter letter digit;

addressParameter : addressKeyword SPACE addressValue;
addressValue : l_or_d l_or_d l_or_d l_or_d l_or_d l_or_d l_or_d;

locationParameter : locationKeyword SPACE locationValue;
locationValue : letter letter letter letter;

sessionParameter : sessionKeyword SPACE sessionValue;
sessionValue : letter letter digit digit digit digit;

providerParameter : providerKeyword SPACE providerValue;
providerValue : letter letter letter;

codeParameter : codeKeyword SPACE codeValue;
codeValue : letter l_or_d l_or_d;

dateParameter : dateKeyword SPACE dateValue;
dateValue : digit digit digit digit digit digit;

timeParameter : timeKeyword SPACE timeValue;
timeValue : (digit digit digit digit) | (DASH DASH DASH DASH) | (SPACE SPACE SPACE SPACE);

l_or_d  : letter | digit ;

digit   : ZERO | ONE | TWO | THREE | FOUR | FIVE | SIX | SEVEN | EIGHT | NINE ;

letter  : A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z ;

stationKeyword : S T A T I O N; //NB: accepts "StaTiON" as a keyword
addressKeyword : A D D R E S S;
locationKeyword: L O C A T I O N;
sessionKeyword : S E S S I O N;
providerKeyword: P R O V I D E R;
codeKeyword    : C O D E;
dateKeyword    : D A T E;
timeKeyword    : T I M E;

// the following makes the lexer essentially case insensitive.

// if case is important, replace the A rule with 2 rules A_Upper:'A';
// and A_Lower:'a'; and appropriately update the parser. repeat for
// all the alphabet.

A : 'a' | 'A' ;
B : 'b' | 'B' ;
C : 'c' | 'C' ;
D : 'd' | 'D' ;
E : 'e' | 'E' ;
F : 'f' | 'F' ;
G : 'g' | 'G' ;
H : 'h' | 'H' ;
I : 'i' | 'I' ;
J : 'j' | 'J' ;
K : 'k' | 'K' ;
L : 'l' | 'L' ;
M : 'm' | 'M' ;
N : 'n' | 'N' ;
O : 'o' | 'O' ;
P : 'p' | 'P' ;
Q : 'q' | 'Q' ;
R : 'r' | 'R' ;
S : 's' | 'S' ;
T : 't' | 'T' ;
U : 'u' | 'U' ;
V : 'v' | 'V' ;
W : 'w' | 'W' ;
X : 'x' | 'X' ;
Y : 'y' | 'Y' ;
Z : 'z' | 'Z' ;

ZERO  : '0' ;
ONE   : '1' ;
TWO   : '2' ;
THREE : '3' ;
FOUR  : '4' ;
FIVE  : '5' ; 
SIX   : '6' ;
SEVEN : '7' ;
EIGHT : '8' ;
NINE  : '9' ;

DASH    : '-';
SPACE   : ' ';

CRLF : '\r'? '\n';