[antlr-interest] lexer state and SMILES strings

Wed Nov 7 10:05:37 PST 2007

Andrew,

I'm not loving the language, but I guess you can't change it. :(

My first recommendation would be to use antlr2 -- antlr3 wants to be 
able to support all that stuff, but can't at the moment.

Next: go ahead and use the parser as a lexer, and then use a tree parser 
to handle the output. Don't get hung up on the names, just work with 
what works. For parsing the insides of square brackets, have a look at 
the book's section on emitting multiple tokens. You'll be in 
hell^W^Wwriting your own token stream class, but you can use the opening 
brace to build a mini-lexer withing the lexing subsystem.

=Austin

Andrew Dalke wrote:
> Hi all,
>
>   I'm parsing a format called SMILES.  It's a notation for
> writing molecules as a single line of ASCII.  Details at
>
> http://daylight.com/dayhtml/doc/theory/theory.smiles.html
> http://en.wikipedia.org/wiki/Simplified_molecular_input_line_entry_specification 
>
>
> I've written parsers for it using other parser generators
> but was hoping to use ANTLR for better cross-language output
> and simplifications I can do using tree grammars.
>
> My previous parsers were state dependent, and use only 1
> character lookahead.  This is easily handled by most of
> the parsers I know about for this format, which are
> hand-written using a bunch of switch statements.
>
> I've looked through the back archives and found several people
> asking about how to change the lexer state, but I didn't
> understand what to do from the responses I found.  I would
> appreciate some advice or pointers to existing code which
> does something like switching lexers, or adding lexer state, in ANTLR.
>
> =======
>
> That was the summary.  Here's the tedious details.
>
> The two main responses in the mailing list are of the form:
>
>   - create two lexers and have the parser switch between them
>
>   - the more provocative statement of Jim Idle:
>
>     Perhaps, someday, the world can require that any 'new' [insert
>     post-modernistic interpolation of 'reality' here ;-)] languages
>    have to have a sane ANTLR parser as proof of their ability not
>    to drive anyone with any sense mad  ... <duck>
>
> As he then dissed Python ("shoot anyone deciding that
> indenting should be lexically/grammatically significant on
> the vague justification that it makes people format code -
> something that a good programmer should do naturally anyway.")
> I feel comfortable not seeing the crouching duck. :)
>
>
> I couldn't find information how to generate multiple lexers.
> There's some old documentation for that in ANTLR 2's
> doc/streams.html but I see nothing in the ANTLR book which
> talks about that possibility.  (In general the index is
> lacking.  Eg, there's no index entry for 'channels',
> which does exist in the book, so I don't know if the
> lack of information about 'streams' means it doesn't exist.)
>
>
> I would like some advice, suggestions, pointers to working
> code, etc.
>
>
> So that you can see my problem, here are a few examples
> of SMILES strings and why I use lexer states to handle it.
> I'm deliberately lying about the grammar because the full
> details depends on chemistry that isn't relevant.  (Ie,
> I'm not going to talk about "aromaticity".)
>
> Here's a simple atom:
>    C     this is methane, a carbon with the default of 4 hydrogens
>   [CH4]  this is also methane, a carbon with 4 explicit hydrogens
>
>   [13C]  this is a carbon 13 atom with zero hydrogens.
>            (Carbon has many isotopes.  The most common is C-12.
>             The ratio between C-12 and radioactive C-14 is used
>             for radiocarbon dating.)
>
>   [C-]  this is a carbon atom with zero hydrogens and a charge of -1
>   [C--]  this is a carbon atom with zero hydrogens and a charge of -2
>   [C-2]  this is a carbon atom with zero hydrogens and a charge of -2
>
>
>   C-C    this is carbon bonded to another carbon through a single bond.
>           Each atom has 3 hydrogens (with the implicit rules,
>           the default valence of carbon is 4, the bond takes
>           1 electron from each, so there are 3 electrons remaining)
>
>   C=C   this is a carbon bonded to another carbon through a double bond
>           Each atom has 2 hydrogens.
>
> You can see at this point that there is ambiguity between "-"
> meaning "single bond" and "-" meaning "charge of -1".  The
> bond notation only occurs outside of the '[]' notation, and
> the charge notation only occurs inside of '[]'.
>
>   CC     this is another way of writing "C-C".  If two atoms are
>           side by side then the default rule is to assume it's
>           a single bond.
>
>   CCCCCC  This is hexane.  6 carbon atoms in a chain, with a single
>         bond between successive atoms
>
>   C1CCCCC1  This is a ring of carbons.  The '1' connects the first
>         carbon atom to the last, making a ring.  These are called
>         ring closures but they don't necessarily close rings.
>
>   C9CCCCC9  This is another way to describe the same ring.
>         I can use the digits '0'..'9'.
>
>   C%43CCCCC%43  The notation '%' DIGIT DIGIT is for ring closure
>         numbers from 10 to 99, inclusive.
>
> This last SMILES string should tokenize to
>   ATOM "C"
>   RING "43"
>   ATOM "C"
>   ATOM "C"
>   ATOM "C"
>   ATOM "C"
>   ATOM "C"
>   RING "43"
>
>
>   C.C    This is two methanes.  The "." is a way to say that
>         successive atoms are not bonded together.
>
>   C1.C1  This is another way to write "CC".  The ring closure in
>           this case connects the two carbons together, even
>           though there is no explicit connection via the '.'
>
>    C1.C12.C23.C34.C45.C5  This is the same as "CCCCCC".  The
>       ring closures connect the atoms despite the "."s that
>       are supposed to separate them.
>
> This last case should tokenize to:
>    ATOM "C"
>    RING "1"
>    DOT
>    ATOM "C"
>    RING "1"
>    RING "2"
>    DOT
>    ATOM "C"
>    RING "2"
>    RING "3"
>    DOT
>    ...
>
>
> Finally,
>
>    C-1.C-1  is the same as "C-C", which is the same as "CC".
>        This specifies the bond type ("-" or "=" for single
>        or double) for the bond created by the ring closure.
>        The default is single.
>
>    C=1.C=1  is the same as "C=C".
>
>
>    C-67-8[C-].[14C-1]-6.[12C-2]7.[CH2-3]-8
>
> This is very hairy, and should okenize somewhat as:
>
>   ATOM "C" with default hydrogen rules
>     closure ("-", "6")
>     closure (<default>, "7")
>     closure ("-", "8")
>   ATOM "C" with no hydrogens and a charge of -1
>   DOT
>   ATOM "C" with no hydrogens, isotope number 14 and charge of -1
>     closure ("-", "6")
>   DOT
>   ATOM "C" with no hydrogens, isotope number 12 and charge of -2
>     closure (<default>, "7")
>   DOT
>   ATOM "C" with two hydrogens and a charge of -3)
>      closure ("-", "8")
>
> You can see my problem.  Depending on where I am,
> "-2" is parsed as "charge of -2" or "single bond ring closure
> with identifer 2".
>
> If I use '0'...'9'+ to tokenize the isotope field, like in [12C],
> then it will also parse the 12 in
>
>   C12.C1.C2
>
> If I instead let the parser do everything, so the lexer
> items are '-', and '0'..'9', then it will be annoyingly
> tedious in the parser to have to reassemble the characters,
> because  '[238U]' will be parsed using
>
> atom_in_brackets: '[' isotope? SYMBOL hydrogen? charge? ']';
> isotope: DIGIT*; // will need to combine the tokens to get the value
> SYMBOL: 'A'..'Z' 'a'..'z'? ;
> hydrogen: 'H' DIGIT*;  // again, will combine the tokens
> charge: MINUS DIGIT*
>       | PLUS DIGIT* ;
>
> atom_list: (atom closure*)*;
> closure: DIGIT | '%' DIGIT DIGIT;
>
>
> I'm pretty sure this will work, but it ends up pushing all
> of the work into the parser.  That seems like a lot of
> waste because I'm basically using the parser as a lexer.
> While in a yacc-style parser it's very easy to define the
> two or three lexer states I need.  When I see '[' I switch
> to a "parsing the inside of []s" state, and when I see
> the closing ']' I switch back.
>
> What should I do?
>
>                 Andrew
>                 dalke at dalkescientific.com
>
>
>
>