[antlr-interest] lexing expression ('a'..'z')+ not matching single character input
Matt Harrison
matt at ebi.ac.uk
Wed Dec 13 09:53:16 PST 2006
Thanks to everyone for the replies.
At risk of permanently offending the entire mailing list, i'll include
the grammar file in its entirety (since I feel it's far more likely I'm
missing something than there is a genuine bug in ANTLR).
The input text is a sequence format for carbohydrate structures, some
samples follow. "Identifiers" in this grammar are always lower-case, and
adhere to the grammar "('a'-'z')+ ('-' ('a'-'z')+)*".
These sequences parse without problems:
"RES 1b:a-dara-hex-2:5|2:keto;"
"RES 1b:a-dara-hex-1:5|2:d|6:d;"
"RES 1b:a-dglc-hex-1:5;2s:lactone; LIN 1:1d(3-6)2o;"
"RES 1b:a-dgro-dgal-non-2:6|1:a|2:keto|3:d;1s:glycolyl; LIN 1:1o(5-1)2d;
These sequences have ids that contain single chars and fail to parse.
I've included the error and a javac-style context marker '^'.
org.eurocarbdb.sugar.seq.SequenceFormatException: expecting a residue name/identifier, found 'n'
RES 1b:a-drib-hex-1:5|3:d;2s:n;3s:n;LIN 1:1d(2-1)2n;2:1d(6-1)3n;
^
org.eurocarbdb.sugar.seq.SequenceFormatException: expecting a residue name/identifier, found 'n'
RES 1b:a-dgal-hex-1:5;2s:n;4s:thiol;LIN 1:1d(3-1)2n;2:1d(4-1)3n;
^
org.eurocarbdb.sugar.seq.SequenceFormatException: expecting a residue name/identifier, found 'n'
RES 1b:a-dgro-dgal-non-2:6|1:a|2:keto|3:d;1s:n-glycolyl;LIN 1:1o(5-1)2d;
^
org.eurocarbdb.sugar.seq.SequenceFormatException: expecting a residue name/identifier, found 'n'
RES1b:b-dgal-hex-1:5;2s:n-acetyl;3b:a-lgal-hex-1:5|6:d;4b:b-dgal-hex-1:5;LIN1:1d(2-1)2n;2:1o(3-1)3d;3:1o(4-1)4d;
^
The grammar:
/* glycoct_grammar.g -- a grammar for carbohydrates in IUPAC nomenclature */
header // <-- this section appears at the top of the auto-generated parser
{
package org.eurocarbdb.sugar.seq.grammar;
}
/* class GlycoctParser *//*****************************************************
*<p>
* This class defines an LLk parser based on ANTLR (http://antlr.org) syntax
* rules for parsing carbohydrate sequences in GlycoCT syntax, according
* to the syntax rules described (TODO: hassle ppl for a definitive syntax reference link).
*</p>
*<p>
* This class' superclass provides the majority of
* the semantic action code that is called from within this grammar. This
* is in order to keep the grammar as clear as possible and to facillitate
* re-targeting of this grammar to other languages than Java (at time of
* writing ANTLR also supports C++, python, C#).
*</p>
*<p>
* Note that the source code for this class has been auto-generated by ANTLR.
*</p>
*
* @see GlycoctLexer
* @see GlycoctParserAdaptor
* @see ParserAdaptor
* @see glycoct_grammar.g
*
* @author mjh [matt at ebi.ac.uk]
*/
class GlycoctParser extends Parser("org.eurocarbdb.sugar.seq.grammar.GlycoctParserAdaptor");
//~~~ ANTLR options ~~~
options {
k=2; /* lookahead */
codeGenDebug=false; /* a debugging setting */
analyzerDebug=false; /* a debugging setting */
defaultErrorHandler=false; /* needs to be false to propagate exceptions */
}
//~~~ start class init section ~~~
// this section is inserted directly into the top of the generated class
// right after the class declaration. It can contain any valid (Java) code.
{
/* empty */
}
//~~~ end class init section ~~~
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GRAMMAR ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// | |
// grammar-specification | actions for grammar |
// written in antlr | written in java |
// | |
// | |
/** Toplevel rule defining a sugar sequence. */
sugar
: res_section /* residues */
(lin_section)? /* linkages */
//TODO: (pro_section)? /* heterogeneity due to uncertainty */
//TODO: (rep_section)? /* repeats */
//TODO: // STA section too incompletely defined grammatically - omitted
//TODO sta_section /* heterogeneity due to statistical distribution, eg GAGs */
EOF
;
//~~~ SECTIONS ~~~//
res_section
: RES
(residue)+
;
lin_section
: LIN
(linkage)+
;
pro_section
: PRO
(linkage)+
;
rep_section
: REP
(sugar)+
;
//~~~ RES SECTION ~~~//
/** A numbered residue entry in the 'RES' section. */
residue
: INTEGER
residue_specification
SEMICOLON
;
/** A residue, which may be either a monosaccharide, a substituent,
* or one of the other types specified by GlycoCT (INCHI, freetext)
*/
residue_specification
: monosac_specification
| substit_specification
//| inchi_specification // TODO later
;
monosac_specification
:
"b" // for "basetype"
COLON
monosaccharide
;
substit_specification
:
"s" // for "substituent"
COLON
substituent
;
/** A monosaccharide, in GlycoCT format */
monosaccharide
: a:anomer
HYPHEN
monosaccharide_name
HYPHEN
c:monosac_superclass
HYPHEN
monosac_ring_closure
(monosac_modifications)*
;
// stem-type
monosaccharide_name
: n:IDENTIFIER
(
HYPHEN
x:IDENTIFIER { n.setText( n.getText() + "-" + x.getText() ); }
)* { addResidue( createResidueToken( n ) ); }
;
// A substituent (ie: non-monosaccharide) */
substituent
: n:IDENTIFIER
(
HYPHEN
x:IDENTIFIER { n.setText( n.getText() + "-" + x.getText() ); }
)* { addResidue( createResidueToken( n ) ); }
;
// Name/type given to the basic monosaccharide sans mods, eg: glc
monosac_stemtype
:
//HYPHEN
//stereo
s:IDENTIFIER // have to check for stereo separately here
;
// Ring size/configuration, eg: hex
monosac_superclass
: //IDENTIFIER
"hex" // hexose (6)
| "pen" // pentose (5)
| "hept" // heptose (7)
| "non" // nonulose (8)
;
monosac_ring_closure
:
terminus_position
COLON
terminus_position
;
terminus_position
:
t:INTEGER
| u:UNKNOWN
;
monosac_modifications
:
PIPE
INTEGER
( COMMA INTEGER )? // syntax for alkenes is '|2,3:en'
COLON
monosac_modification
;
monosac_modification
:
( "d" // deoxygenation
| "keto" // a carbonyl group
| "en" // double-bond
| "enx" // double-bond?
| "a" // acidic group
| "aldi" // reduced C1 carbonyl
| "sp2" // outgoing linkage with double bond
| "geminal" // 2 OH at one backbone carbon
)
;
monosac_type_identifier
:
( "b" // a base type
| "s" // a substituent
| "n" // other chemically defined entity (freetext)
| "i" // INCHI-encoded non-basetype, non-substituent
| "r" // repeating unit
// ERROR in the specification: 's' is duplicated
// | "s" // statistical unit
)
;
//~~~ LIN SECTION ~~~//
linkage { Token nrtt, rtt; }
:
// linkage numbering
i:INTEGER
COLON
// non-reducing residue id
nrti:INTEGER
nrtt=linkage_type_identifier
// then the actual linkage
LPARENTHESIS
lnrt:INTEGER
HYPHEN
lrt:INTEGER
RPARENTHESIS
// then the reducing residue id
rti:INTEGER
rtt=linkage_type_identifier
// end of linkage
SEMICOLON { addLinkage( i, nrti, nrtt, lnrt, lrt, rti, rtt ); }
;
/* inlined to avoid having to pass tokens
linkage_specification
:
linkage_terminus_specification
linkage_terminii
linkage_terminus_specification
;
linkage_terminus_specification
:
INTEGER
linkage_type_identifier
;
linkage_terminii
:
LPARENTHESIS
INTEGER
HYPHEN
INTEGER
RPARENTHESIS
;
*/
linkage_type_identifier returns [Token t]
:
( "o" // loss of H from OH
| "h" // loss of H
| "d" // loss of OH
| "n" // linkage to non-monosac/repeat
| "r" // prochiral H-atom removed, resulting in R-configuration
| "s" // prochiral H-atom removed, resulting in S-configuration
) { t = LT(1); }
;
anomer
: "a" /* alpha */
| "b" /* beta */
| "o" /* open-chain */
| "x" /* unknown */
;
stereo
:
( "d" /* dextro */
| "l" /* levo */
| "x" /* unknown */
)
;
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ LEXER ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~//
/**
*
* This class implements a lexer/scanner for carbohydrate
* sequences in Glycoct syntax. This class was auto-generated from
* the ANTLR lexer grammar in glycoct_grammar.g.
*
* @see GlycoctParser
* @see glycoct_grammar.g
*
* @author mjh [matt at ebi.ac.uk]
*/
class GlycoctLexer extends Lexer;
options {
k=3; /* lookahead */
testLiterals=true;
}
//~~~~~~~~~~~~~~~~~~~~ token separators & delimiters ~~~~~~~~~~~~~~~~~~~~~~//
COLON
options { paraphrase="a colon separator ':'"; }
: ':'
;
COMMA
options { paraphrase="a comma ','"; }
: ','
;
HYPHEN
options { paraphrase="a hyphen '-'"; }
: '-'
;
PIPE
options { paraphrase="a residue substitution delimiter '|'"; }
: '|'
;
SEMICOLON
options { paraphrase="a residue/linkage token separator ';'"; }
: ';'
;
LPARENTHESIS
options { paraphrase="a linkage start delimiter '('"; }
: '('
;
RPARENTHESIS
options { paraphrase="a linkage end delimiter ')'"; }
: ')'
;
//~~~~~~~~~~~~~~~~~~~~~~~~~~~ identifiers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~//
INTEGER
options { paraphrase="a positive integer or zero"; }
: ('1'..'9') ('0'..'9')*
| '0'
;
IDENTIFIER
options { paraphrase="a residue name/identifier"; }
: ('a'..'z')+
;
//~~~~~~~~~~~~~~~~~~~~~~~ section type identifiers ~~~~~~~~~~~~~~~~~~~~~~~~//
RES
options { paraphrase="a RES (residue) section start identifier"; }
: "RES"
;
LIN
options { paraphrase="a LIN (linkage) section start identifier"; }
: "LIN"
;
PRO
options { paraphrase="a PRO (heterogeneity due to uncertainty) section start identifier"; }
: "PRO"
;
REP
options { paraphrase="a REP (repeat) section start identifier"; }
: "REP"
;
STA
options { paraphrase="a STA (heterogeneity due to a statistical distribution) section start identifier"; }
: "STA"
;
ISO
options { paraphrase="an ISO (isotope) section start identifier"; }
: "ISO"
;
AGL
options { paraphrase="an AGL (aglycon) section start identifier"; }
: "AGL"
;
CR
: (( '\r' '\n' ) | '\n') { newline(); $setType( Token.SKIP ); }
;
WS
: (' '| '\t' ) { $setType( Token.SKIP ); }
;
Terence Parr wrote:
> Hi. What is the error message? Note you'll need to have A..Z in
> IDENTIFIER if it is to match the keywords (upper case you have).
>
> Ter
> On Dec 13, 2006, at 3:03 AM, Matt Harrison wrote:
>
>>
>> Unfortunately, it doesn't. For some bizarre reason, ('a'..'z')+
>> stubbornly refuses to match any single alphabetic character,
>> regardless of context; that is, I can call the rule 'substituent'
>> below directly with a single character of input and it doesn't match,
>> nor will it match if a single character 'substituent' occurs in the
>> middle of a token stream.
>>
>> Perhaps a bug in ANTLR? Surely this has got to be due to something
>> else I am missing due to my inexperience with ANLTR, but I can't for
>> the life of me discern what.
>>
>> cheers,
>> Matt Harrison
>>
>> ps: "identifiers" for my particular parsing problem are only
>> lower-case, and indeed, allowing upper-case ids introduces
>> non-determinism with all of the constant upper-case keywords defined
>> elsewhere in the lexer.
>>
>> Vinay Veeramachaneni wrote:
>>> Hi,
>>> Your grammar seems to be fine. You must consider to include the
>>> uppercase letters as identifiers too.
>>> IDENTIFIER options { paraphrase="a residue name/identifier"; }
>>>
>>> : ('a'..'z' | 'A'..'Z')+ ;
>>>
>>> This must solve the problem.
>>> Regards,
>>> Vinay
>>>
>>> On 12/12/06, *Matt Harrison* <matt at ebi.ac.uk
>>> <mailto:matt at ebi.ac.uk>> wrote:
>>>
>>> Salute, fellow antlers.
>>>
>>> I'm a recent convert to the world of language recognition/parsing
>>> using
>>> ANTLR, although I have used Perl /python for "simple" parsing
>>> tasks for
>>> many many man-months.
>>>
>>> I am having trouble diagnosing why the (common) lexer expression
>>> "('a'..'z')+" is not matching any single character input (eg: "n")
>>> in my
>>> grammar. Is there any situations under which this expression
>>> should not
>>> match a single character in the range 'a' - 'z'?
>>>
>>> Thanks for your time.
>>> Matt
>>>
>>> ~~~
>>> The offending parser rule is as following:
>>>
>>> substituent
>>>
>>> : IDENTIFIER
>>>
>>> (HYPHEN IDENTIFIER)*
>>>
>>> ;
>>>
>>>
>>> The lexer is pretty basic:
>>>
>>> class FooBarLexer extends Lexer;
>>>
>>> options {
>>>
>>> k=3; /* lookahead */
>>>
>>> }
>>>
>>> //~~~~~~~~~~~~~~~~~~~~ token separators &
>>> delimiters ~~~~~~~~~~~~~~~~~~~~~~//
>>>
>>>
>>>
>>> COLON
>>>
>>> options { paraphrase="a colon separator"; }
>>>
>>> : ':'
>>>
>>> ;
>>>
>>>
>>>
>>> COMMA
>>>
>>> options { paraphrase="a comma"; }
>>>
>>> : ','
>>>
>>> ;
>>>
>>> HYPHEN
>>>
>>> options { paraphrase="an internal linkage delimiter '-'"; }
>>>
>>> : '-'
>>>
>>> ;
>>>
>>> PIPE
>>>
>>> options { paraphrase="a residue substitution delimiter"; }
>>>
>>> : '|'
>>>
>>> ;
>>>
>>> SEMICOLON
>>>
>>> options { paraphrase="a residue/linkage token separator"; }
>>>
>>> : ';'
>>>
>>> ;
>>>
>>>
>>>
>>> LPARENTHESIS
>>>
>>> options { paraphrase="a linkage delimiter"; }
>>>
>>> : '('
>>>
>>> ;
>>>
>>>
>>>
>>> RPARENTHESIS
>>>
>>> options { paraphrase="a linkage delimiter"; }
>>>
>>> : ')'
>>>
>>> ;
>>>
>>>
>>> //~~~~~~~~~~~~~~~~~~~~~~~~~~~ identifiers
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~//
>>>
>>> INTEGER
>>>
>>> options { paraphrase="a positive integer or zero"; }
>>>
>>> : ('1'..'9') ('0'..'9')*
>>>
>>> | '0'
>>>
>>> ;
>>>
>>>
>>>
>>> IDENTIFIER
>>>
>>> options { paraphrase="a residue name/identifier"; }
>>>
>>> : ('a'..'z')+
>>>
>>> ;
>>>
>>> //~~~~~~~~~~~~~~~~~~~~~~~ section type
>>> identifiers ~~~~~~~~~~~~~~~~~~~~~~~~//
>>>
>>> RES
>>>
>>> options { paraphrase="a RES (residue) section start
>>> identifier"; }
>>>
>>> : "RES"
>>>
>>> ;
>>>
>>>
>>>
>>> LIN
>>>
>>> options { paraphrase="a LIN (linkage) section start
>>> identifier"; }
>>>
>>> : "LIN"
>>>
>>> ;
>>>
>>>
>>>
>>> PRO
>>>
>>> options { paraphrase="a PRO (heterogeneity due to
>>> uncertainty) section start identifier"; }
>>>
>>> : "PRO"
>>>
>>> ;
>>>
>>>
>>>
>>> REP
>>>
>>> options { paraphrase="a REP (repeat) section start
>>> identifier"; }
>>>
>>> : "REP"
>>>
>>> ;
>>>
>>>
>>>
>>> STA
>>>
>>> options { paraphrase="a STA (heterogeneity due to a
>>> statistical distribution) section start identifier"; }
>>>
>>> : "STA"
>>>
>>> ;
>>>
>>>
>>>
>>> ISO
>>>
>>> options { paraphrase="an ISO (isotope) section start
>>> identifier"; }
>>>
>>> : "ISO"
>>>
>>> ;
>>>
>>>
>>>
>>> AGL
>>>
>>> options { paraphrase="an AGL (aglycon) section start
>>> identifier"; }
>>>
>>> : "AGL"
>>>
>>> ;
>>>
>>>
>>>
>>> CR
>>>
>>> : ( '\r' '\n' )
>>>
>>> | '\n' { newline();
>>> $setType( Token.SKIP ); }
>>>
>>> ;
>>>
>>>
>>>
>>> WS
>>>
>>> : (' '| '\t' ) { $setType(
>>> Token.SKIP ); }
>>>
>>> ;
>>>
>>>
>>>
>>>
>>> --
>>> Dr Matt Harrison
>>> BTech (Biotech) Hons PhD
>>> Glycobiology Bioinformatician
>>> European Bioinformatics Institute UK
>>> http://www.ebi.ac.uk <http://www.ebi.ac.uk> +44 (0)1223 492533
>>>
>>>
More information about the antlr-interest
mailing list