[antlr-interest] lexing expression ('a'..'z')+ not matching single character input

Wed Dec 13 09:53:16 PST 2006

Thanks to everyone for the replies.

At risk of permanently offending the entire mailing list, i'll include 
the grammar file in its entirety (since I feel it's far more likely I'm 
missing something than there is a genuine bug in ANTLR).

The input text is a sequence format for carbohydrate structures, some 
samples follow. "Identifiers" in this grammar are always lower-case, and 
adhere to the grammar "('a'-'z')+ ('-' ('a'-'z')+)*".

These sequences parse without problems:

"RES 1b:a-dara-hex-2:5|2:keto;"

"RES 1b:a-dara-hex-1:5|2:d|6:d;"

"RES 1b:a-dglc-hex-1:5;2s:lactone; LIN 1:1d(3-6)2o;"

"RES 1b:a-dgro-dgal-non-2:6|1:a|2:keto|3:d;1s:glycolyl; LIN 1:1o(5-1)2d;

These sequences have ids that contain single chars and fail to  parse. 
I've included the error and a javac-style context marker '^'.

org.eurocarbdb.sugar.seq.SequenceFormatException: expecting a residue name/identifier, found 'n'

RES 1b:a-drib-hex-1:5|3:d;2s:n;3s:n;LIN 1:1d(2-1)2n;2:1d(6-1)3n;

                             ^

org.eurocarbdb.sugar.seq.SequenceFormatException: expecting a residue name/identifier, found 'n'

RES 1b:a-dgal-hex-1:5;2s:n;4s:thiol;LIN 1:1d(3-1)2n;2:1d(4-1)3n;

                         ^

org.eurocarbdb.sugar.seq.SequenceFormatException: expecting a residue name/identifier, found 'n'

RES 1b:a-dgro-dgal-non-2:6|1:a|2:keto|3:d;1s:n-glycolyl;LIN 1:1o(5-1)2d;

                                             ^

org.eurocarbdb.sugar.seq.SequenceFormatException: expecting a residue name/identifier, found 'n'

RES1b:b-dgal-hex-1:5;2s:n-acetyl;3b:a-lgal-hex-1:5|6:d;4b:b-dgal-hex-1:5;LIN1:1d(2-1)2n;2:1o(3-1)3d;3:1o(4-1)4d;

                        ^

The grammar:

/*    glycoct_grammar.g -- a grammar for carbohydrates in IUPAC nomenclature    */

header //  <-- this section appears at the top of the auto-generated parser
{   
package org.eurocarbdb.sugar.seq.grammar; 
}

/* class GlycoctParser *//*****************************************************
*
* This class defines an LLk parser based on ANTLR (http://antlr.org) syntax 
* rules for parsing carbohydrate sequences in GlycoCT syntax, according 
* to the syntax rules described (TODO: hassle ppl for a definitive syntax reference link). 
*
*
* This class' superclass provides the majority of 
* the semantic action code that is called from within this grammar. This
* is in order to keep the grammar as clear as possible and to facillitate
* re-targeting of this grammar to other languages than Java (at time of 
* writing ANTLR also supports C++, python, C#).
*
*
* Note that the source code for this class has been auto-generated by ANTLR.
*
*
* @see GlycoctLexer
* @see GlycoctParserAdaptor
* @see ParserAdaptor
* @see glycoct_grammar.g
*
* @author mjh [matt at ebi.ac.uk]
*/
class GlycoctParser extends Parser("org.eurocarbdb.sugar.seq.grammar.GlycoctParserAdaptor");

//~~~  ANTLR options  ~~~
options {
    k=2;                         /* lookahead */
    codeGenDebug=false;            /* a debugging setting */
    analyzerDebug=false;        /* a debugging setting */
    defaultErrorHandler=false;  /* needs to be false to propagate exceptions */
}

//~~~ start class init section ~~~  
//  this section is inserted directly into the top of the generated class
//  right after the class declaration. It can contain any valid (Java) code. 
{
    /* empty */
}
//~~~ end class init section ~~~  

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GRAMMAR ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
//                                              |                              |
//          grammar-specification               |      actions for grammar     |
//            written in antlr                  |        written in java       |
//                                              |                              |
//                                              |                              |

/**    Toplevel rule defining a sugar sequence.  */
sugar        
        :   res_section        /*  residues  */
            (lin_section)?     /*  linkages  */

//TODO:     (pro_section)?     /*  heterogeneity due to uncertainty  */
//TODO:     (rep_section)?     /*  repeats   */
//TODO: // STA section too incompletely defined grammatically - omitted
//TODO      sta_section        /*  heterogeneity due to statistical distribution, eg GAGs */              
            EOF
        ; 

//~~~  SECTIONS  ~~~//

res_section
        :   RES 
            (residue)+
        ;

lin_section
        :   LIN
            (linkage)+
        ;

pro_section
        :   PRO
            (linkage)+
        ;

rep_section
        :   REP
            (sugar)+
        ;

//~~~  RES SECTION  ~~~//

/** A numbered residue entry in the 'RES' section. */
residue
        :   INTEGER
            residue_specification
            SEMICOLON
        ;

/** A residue, which may be either a monosaccharide, a substituent, 
*   or one of the other types specified by GlycoCT (INCHI, freetext)
*/
residue_specification
        :   monosac_specification
        |   substit_specification
        //|   inchi_specification // TODO later
        ;

monosac_specification
        :
            "b"   // for "basetype" 
            COLON
            monosaccharide
        ;

substit_specification
        :
            "s"   // for "substituent"
            COLON
            substituent
        ;

/** A monosaccharide, in GlycoCT format */
monosaccharide
        :   a:anomer                            
            HYPHEN
            monosaccharide_name
            HYPHEN
            c:monosac_superclass
            HYPHEN
            monosac_ring_closure              
            (monosac_modifications)*
        ;

//  stem-type
monosaccharide_name
        :   n:IDENTIFIER                    
            ( 
                HYPHEN 
                x:IDENTIFIER                    {   n.setText( n.getText() + "-" + x.getText() );  } 
            )*                                  {   addResidue( createResidueToken( n ) );  }                                      
        ; 

// A substituent (ie: non-monosaccharide) */
substituent
        :   n:IDENTIFIER                        
            (
                HYPHEN 
                x:IDENTIFIER                    {   n.setText( n.getText() + "-" + x.getText() );  }
            )*                                  {   addResidue( createResidueToken( n ) );  }
        ;

// Name/type given to the basic monosaccharide sans mods, eg: glc 
monosac_stemtype
        :
            //HYPHEN
            //stereo
            s:IDENTIFIER // have to check for stereo separately here
        ;

// Ring size/configuration, eg: hex 
monosac_superclass
        :   //IDENTIFIER
            "hex"       // hexose (6)
        |   "pen"       // pentose (5)
        |   "hept"      // heptose (7)
        |   "non"       // nonulose (8)
        ;

monosac_ring_closure
        :
            terminus_position                   
            COLON
            terminus_position                   
        ;

terminus_position
        :   
            t:INTEGER                          
        |   u:UNKNOWN                     
        ;

monosac_modifications
        :   
            PIPE
            INTEGER 
            ( COMMA INTEGER )? // syntax for alkenes is '|2,3:en' 
            COLON
            monosac_modification
        ;

monosac_modification
        :   
        (   "d"         //  deoxygenation  
        |   "keto"      //  a carbonyl group 
        |   "en"        //  double-bond       
        |   "enx"       //  double-bond?   
        |   "a"         //  acidic group    
        |   "aldi"      //  reduced C1 carbonyl
        |   "sp2"       //  outgoing linkage with double bond 
        |   "geminal"   //  2 OH at one backbone carbon 
        )
        ;

monosac_type_identifier
        :   
        (   "b"     //  a base type  
        |   "s"     //  a substituent 
        |   "n"     //  other chemically defined entity (freetext) 
        |   "i"     //  INCHI-encoded non-basetype, non-substituent 
        |   "r"     //  repeating unit 

//  ERROR in the specification: 's' is duplicated
//        |   "s"     //  statistical unit
        )
        ;

//~~~  LIN SECTION  ~~~//

linkage                                         {   Token nrtt, rtt;  }
        :
            //  linkage numbering 
            i:INTEGER
            COLON

            //  non-reducing residue id 
            nrti:INTEGER
            nrtt=linkage_type_identifier

            //  then the actual linkage             
            LPARENTHESIS
            lnrt:INTEGER
            HYPHEN
            lrt:INTEGER
            RPARENTHESIS

            //  then the reducing residue id
            rti:INTEGER
            rtt=linkage_type_identifier

            //  end of linkage
            SEMICOLON                           {   addLinkage( i, nrti, nrtt, lnrt, lrt, rti, rtt );  }
        ;

/* inlined to avoid having to pass tokens 
linkage_specification
        :
            linkage_terminus_specification
            linkage_terminii
            linkage_terminus_specification
        ;

linkage_terminus_specification
        :
            INTEGER
            linkage_type_identifier
        ;

linkage_terminii
        :
            LPARENTHESIS
            INTEGER
            HYPHEN
            INTEGER
            RPARENTHESIS
        ;
*/

linkage_type_identifier returns [Token t]
        :  
        (   "o"     //  loss of H from OH 
        |   "h"     //  loss of H
        |   "d"     //  loss of OH 
        |   "n"     //  linkage to non-monosac/repeat
        |   "r"     //  prochiral H-atom removed, resulting in R-configuration 
        |   "s"     //  prochiral H-atom removed, resulting in S-configuration 
        )                                       {   t = LT(1);  }
        ;    

anomer                    
        :   "a"       /*  alpha       */
        |   "b"       /*  beta        */
        |   "o"       /*  open-chain  */
        |   "x"       /*  unknown     */        
        ;

stereo
        :
        (   "d"       /*  dextro  */
        |   "l"       /*  levo    */
        |   "x"       /*  unknown */
        )
        ;

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  LEXER  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~//
/**
*
*   This class implements a lexer/scanner for carbohydrate
*   sequences in Glycoct syntax. This class was auto-generated from
*   the ANTLR lexer grammar in glycoct_grammar.g.
*
*   @see GlycoctParser
*   @see glycoct_grammar.g
*
*   @author mjh [matt at ebi.ac.uk]
*/
class GlycoctLexer extends Lexer;

options {
    k=3;        /*  lookahead  */
    testLiterals=true;
}

//~~~~~~~~~~~~~~~~~~~~  token separators & delimiters  ~~~~~~~~~~~~~~~~~~~~~~//

COLON
        options { paraphrase="a colon separator ':'"; }
        :   ':'
        ;

COMMA                
        options { paraphrase="a comma ','"; }
        :     ','
        ;

HYPHEN            
        options { paraphrase="a hyphen '-'"; }
        :     '-' 
        ;

PIPE                
        options { paraphrase="a residue substitution delimiter '|'"; }
        :     '|'
        ;

SEMICOLON
        options { paraphrase="a residue/linkage token separator ';'"; }
        :   ';'
        ;

LPARENTHESIS
        options { paraphrase="a linkage start delimiter '('"; }
        :   '('
        ;

RPARENTHESIS
        options { paraphrase="a linkage end delimiter ')'"; }
        :   ')'
        ;

//~~~~~~~~~~~~~~~~~~~~~~~~~~~ identifiers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~//

INTEGER
        options { paraphrase="a positive integer or zero"; }
        :     ('1'..'9')  ('0'..'9')*  
        |   '0'
        ;

IDENTIFIER                    
        options { paraphrase="a residue name/identifier"; }
        :     ('a'..'z')+ 
        ;

//~~~~~~~~~~~~~~~~~~~~~~~  section type identifiers  ~~~~~~~~~~~~~~~~~~~~~~~~//

RES
        options { paraphrase="a RES (residue) section start identifier"; }
        :   "RES"
        ;

LIN     
        options { paraphrase="a LIN (linkage) section start identifier"; }
        :   "LIN" 
        ;

PRO     
        options { paraphrase="a PRO (heterogeneity due to uncertainty) section start identifier"; }
        :   "PRO"
        ;

REP     
        options { paraphrase="a REP (repeat) section start identifier"; }
        :   "REP"
        ;

STA 
        options { paraphrase="a STA (heterogeneity due to a statistical distribution) section start identifier"; }
        :   "STA"
        ;

ISO
        options { paraphrase="an ISO (isotope) section start identifier"; }
        :   "ISO"
        ;

AGL
        options { paraphrase="an AGL (aglycon) section start identifier"; }
        :   "AGL"
        ;

CR
        : (( '\r' '\n' ) | '\n')                {   newline(); $setType( Token.SKIP );  }
        ;

WS  
        : (' '| '\t' )                          {   $setType( Token.SKIP );  }
        ;

Terence Parr wrote:
> Hi.  What is the error message?  Note you'll need to have A..Z in 
> IDENTIFIER if it is to match the keywords (upper case you have).
>
> Ter
> On Dec 13, 2006, at 3:03 AM, Matt Harrison wrote:
>
>>
>> Unfortunately, it doesn't. For some bizarre reason, ('a'..'z')+ 
>> stubbornly refuses to match any single alphabetic character, 
>> regardless of context; that is, I can call the rule 'substituent' 
>> below directly with a single character of input and it doesn't match, 
>> nor will it match if a single character 'substituent' occurs in the 
>> middle of a token stream.
>>
>> Perhaps a bug in ANTLR? Surely this has got to be due to something 
>> else I am missing due to my inexperience with ANLTR, but I can't for 
>> the life of me discern what.
>>
>> cheers,
>> Matt Harrison
>>
>> ps: "identifiers" for my particular parsing problem are only 
>> lower-case, and indeed, allowing upper-case ids introduces 
>> non-determinism with all of the constant upper-case keywords defined 
>> elsewhere in the lexer.
>>
>> Vinay Veeramachaneni wrote:
>>> Hi,
>>>  Your grammar seems to be fine. You must consider to include the 
>>> uppercase letters as identifiers too.
>>>  IDENTIFIER   options { paraphrase="a residue name/identifier"; }
>>>
>>>        :     ('a'..'z' | 'A'..'Z')+ ;
>>>
>>> This must solve the problem.
>>>  Regards,
>>> Vinay
>>>
>>>  On 12/12/06, *Matt Harrison* <matt at ebi.ac.uk 
>>> <mailto:matt at ebi.ac.uk>> wrote:
>>>
>>>     Salute, fellow antlers.
>>>
>>>     I'm a recent convert to the world of language recognition/parsing
>>>     using
>>>     ANTLR, although I have used Perl /python for "simple" parsing
>>>     tasks for
>>>     many many man-months.
>>>
>>>     I am having trouble diagnosing why the (common) lexer expression
>>>     "('a'..'z')+" is not matching any single character input (eg: "n")
>>>     in my
>>>     grammar. Is there any situations under which this expression
>>>     should not
>>>     match a single character in the range 'a' - 'z'?
>>>
>>>     Thanks for your time.
>>>     Matt
>>>
>>>     ~~~
>>>     The offending parser rule is as following:
>>>
>>>     substituent
>>>
>>>            :   IDENTIFIER
>>>
>>>                (HYPHEN IDENTIFIER)*
>>>
>>>            ;
>>>
>>>
>>>     The lexer is pretty basic:
>>>
>>>     class FooBarLexer extends Lexer;
>>>
>>>     options {
>>>
>>>        k=3;        /*  lookahead  */
>>>
>>>     }
>>>
>>>     //~~~~~~~~~~~~~~~~~~~~  token separators &
>>>     delimiters  ~~~~~~~~~~~~~~~~~~~~~~//
>>>
>>>
>>>
>>>     COLON
>>>
>>>            options { paraphrase="a colon separator"; }
>>>
>>>            :   ':'
>>>
>>>            ;
>>>
>>>
>>>
>>>     COMMA
>>>
>>>            options { paraphrase="a comma"; }
>>>
>>>            :     ','
>>>
>>>            ;
>>>
>>>     HYPHEN
>>>
>>>            options { paraphrase="an internal linkage delimiter '-'"; }
>>>
>>>            :     '-'
>>>
>>>            ;
>>>
>>>     PIPE
>>>
>>>            options { paraphrase="a residue substitution delimiter"; }
>>>
>>>            :     '|'
>>>
>>>            ;
>>>
>>>     SEMICOLON
>>>
>>>            options { paraphrase="a residue/linkage token separator"; }
>>>
>>>            :   ';'
>>>
>>>            ;
>>>
>>>
>>>
>>>     LPARENTHESIS
>>>
>>>            options { paraphrase="a linkage delimiter"; }
>>>
>>>            :   '('
>>>
>>>            ;
>>>
>>>
>>>
>>>     RPARENTHESIS
>>>
>>>            options { paraphrase="a linkage delimiter"; }
>>>
>>>            :   ')'
>>>
>>>            ;
>>>
>>>
>>>     //~~~~~~~~~~~~~~~~~~~~~~~~~~~ identifiers
>>>     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~//
>>>
>>>     INTEGER
>>>
>>>            options { paraphrase="a positive integer or zero"; }
>>>
>>>            :     ('1'..'9')  ('0'..'9')*
>>>
>>>            |   '0'
>>>
>>>            ;
>>>
>>>
>>>
>>>     IDENTIFIER
>>>
>>>            options { paraphrase="a residue name/identifier"; }
>>>
>>>            :     ('a'..'z')+
>>>
>>>            ;
>>>
>>>     //~~~~~~~~~~~~~~~~~~~~~~~  section type
>>>     identifiers  ~~~~~~~~~~~~~~~~~~~~~~~~//
>>>
>>>     RES
>>>
>>>            options { paraphrase="a RES (residue) section start
>>>     identifier"; }
>>>
>>>            :   "RES"
>>>
>>>            ;
>>>
>>>
>>>
>>>     LIN
>>>
>>>            options { paraphrase="a LIN (linkage) section start
>>>     identifier"; }
>>>
>>>            :   "LIN"
>>>
>>>            ;
>>>
>>>
>>>
>>>     PRO
>>>
>>>            options { paraphrase="a PRO (heterogeneity due to
>>>     uncertainty) section start identifier"; }
>>>
>>>            :   "PRO"
>>>
>>>            ;
>>>
>>>
>>>
>>>     REP
>>>
>>>            options { paraphrase="a REP (repeat) section start
>>>     identifier"; }
>>>
>>>            :   "REP"
>>>
>>>            ;
>>>
>>>
>>>
>>>     STA
>>>
>>>            options { paraphrase="a STA (heterogeneity due to a
>>>     statistical distribution) section start identifier"; }
>>>
>>>            :   "STA"
>>>
>>>            ;
>>>
>>>
>>>
>>>     ISO
>>>
>>>            options { paraphrase="an ISO (isotope) section start
>>>     identifier"; }
>>>
>>>            :   "ISO"
>>>
>>>            ;
>>>
>>>
>>>
>>>     AGL
>>>
>>>            options { paraphrase="an AGL (aglycon) section start
>>>     identifier"; }
>>>
>>>            :   "AGL"
>>>
>>>            ;
>>>
>>>
>>>
>>>     CR
>>>
>>>            : ( '\r' '\n' )
>>>
>>>            | '\n'                                  {   newline();
>>>     $setType( Token.SKIP );  }
>>>
>>>            ;
>>>
>>>
>>>
>>>     WS
>>>
>>>            : (' '| '\t' )                          {   $setType(
>>>     Token.SKIP );  }
>>>
>>>            ;
>>>
>>>
>>>
>>>
>>>     --
>>>     Dr Matt Harrison
>>>     BTech (Biotech) Hons PhD
>>>     Glycobiology Bioinformatician
>>>     European Bioinformatics Institute UK
>>>     http://www.ebi.ac.uk <http://www.ebi.ac.uk>   +44 (0)1223 492533
>>>
>>>