[antlr-interest] lexer state and SMILES strings
Austin Hastings
Austin_Hastings at Yahoo.com
Wed Nov 7 10:05:37 PST 2007
Andrew,
I'm not loving the language, but I guess you can't change it. :(
My first recommendation would be to use antlr2 -- antlr3 wants to be
able to support all that stuff, but can't at the moment.
Next: go ahead and use the parser as a lexer, and then use a tree parser
to handle the output. Don't get hung up on the names, just work with
what works. For parsing the insides of square brackets, have a look at
the book's section on emitting multiple tokens. You'll be in
hell^W^Wwriting your own token stream class, but you can use the opening
brace to build a mini-lexer withing the lexing subsystem.
=Austin
Andrew Dalke wrote:
> Hi all,
>
> I'm parsing a format called SMILES. It's a notation for
> writing molecules as a single line of ASCII. Details at
>
> http://daylight.com/dayhtml/doc/theory/theory.smiles.html
> http://en.wikipedia.org/wiki/Simplified_molecular_input_line_entry_specification
>
>
> I've written parsers for it using other parser generators
> but was hoping to use ANTLR for better cross-language output
> and simplifications I can do using tree grammars.
>
> My previous parsers were state dependent, and use only 1
> character lookahead. This is easily handled by most of
> the parsers I know about for this format, which are
> hand-written using a bunch of switch statements.
>
> I've looked through the back archives and found several people
> asking about how to change the lexer state, but I didn't
> understand what to do from the responses I found. I would
> appreciate some advice or pointers to existing code which
> does something like switching lexers, or adding lexer state, in ANTLR.
>
> =======
>
> That was the summary. Here's the tedious details.
>
> The two main responses in the mailing list are of the form:
>
> - create two lexers and have the parser switch between them
>
> - the more provocative statement of Jim Idle:
>
> Perhaps, someday, the world can require that any 'new' [insert
> post-modernistic interpolation of 'reality' here ;-)] languages
> have to have a sane ANTLR parser as proof of their ability not
> to drive anyone with any sense mad ... <duck>
>
> As he then dissed Python ("shoot anyone deciding that
> indenting should be lexically/grammatically significant on
> the vague justification that it makes people format code -
> something that a good programmer should do naturally anyway.")
> I feel comfortable not seeing the crouching duck. :)
>
>
> I couldn't find information how to generate multiple lexers.
> There's some old documentation for that in ANTLR 2's
> doc/streams.html but I see nothing in the ANTLR book which
> talks about that possibility. (In general the index is
> lacking. Eg, there's no index entry for 'channels',
> which does exist in the book, so I don't know if the
> lack of information about 'streams' means it doesn't exist.)
>
>
> I would like some advice, suggestions, pointers to working
> code, etc.
>
>
> So that you can see my problem, here are a few examples
> of SMILES strings and why I use lexer states to handle it.
> I'm deliberately lying about the grammar because the full
> details depends on chemistry that isn't relevant. (Ie,
> I'm not going to talk about "aromaticity".)
>
> Here's a simple atom:
> C this is methane, a carbon with the default of 4 hydrogens
> [CH4] this is also methane, a carbon with 4 explicit hydrogens
>
> [13C] this is a carbon 13 atom with zero hydrogens.
> (Carbon has many isotopes. The most common is C-12.
> The ratio between C-12 and radioactive C-14 is used
> for radiocarbon dating.)
>
> [C-] this is a carbon atom with zero hydrogens and a charge of -1
> [C--] this is a carbon atom with zero hydrogens and a charge of -2
> [C-2] this is a carbon atom with zero hydrogens and a charge of -2
>
>
> C-C this is carbon bonded to another carbon through a single bond.
> Each atom has 3 hydrogens (with the implicit rules,
> the default valence of carbon is 4, the bond takes
> 1 electron from each, so there are 3 electrons remaining)
>
> C=C this is a carbon bonded to another carbon through a double bond
> Each atom has 2 hydrogens.
>
> You can see at this point that there is ambiguity between "-"
> meaning "single bond" and "-" meaning "charge of -1". The
> bond notation only occurs outside of the '[]' notation, and
> the charge notation only occurs inside of '[]'.
>
> CC this is another way of writing "C-C". If two atoms are
> side by side then the default rule is to assume it's
> a single bond.
>
> CCCCCC This is hexane. 6 carbon atoms in a chain, with a single
> bond between successive atoms
>
> C1CCCCC1 This is a ring of carbons. The '1' connects the first
> carbon atom to the last, making a ring. These are called
> ring closures but they don't necessarily close rings.
>
> C9CCCCC9 This is another way to describe the same ring.
> I can use the digits '0'..'9'.
>
> C%43CCCCC%43 The notation '%' DIGIT DIGIT is for ring closure
> numbers from 10 to 99, inclusive.
>
> This last SMILES string should tokenize to
> ATOM "C"
> RING "43"
> ATOM "C"
> ATOM "C"
> ATOM "C"
> ATOM "C"
> ATOM "C"
> RING "43"
>
>
> C.C This is two methanes. The "." is a way to say that
> successive atoms are not bonded together.
>
> C1.C1 This is another way to write "CC". The ring closure in
> this case connects the two carbons together, even
> though there is no explicit connection via the '.'
>
> C1.C12.C23.C34.C45.C5 This is the same as "CCCCCC". The
> ring closures connect the atoms despite the "."s that
> are supposed to separate them.
>
> This last case should tokenize to:
> ATOM "C"
> RING "1"
> DOT
> ATOM "C"
> RING "1"
> RING "2"
> DOT
> ATOM "C"
> RING "2"
> RING "3"
> DOT
> ...
>
>
> Finally,
>
> C-1.C-1 is the same as "C-C", which is the same as "CC".
> This specifies the bond type ("-" or "=" for single
> or double) for the bond created by the ring closure.
> The default is single.
>
> C=1.C=1 is the same as "C=C".
>
>
> C-67-8[C-].[14C-1]-6.[12C-2]7.[CH2-3]-8
>
> This is very hairy, and should okenize somewhat as:
>
> ATOM "C" with default hydrogen rules
> closure ("-", "6")
> closure (<default>, "7")
> closure ("-", "8")
> ATOM "C" with no hydrogens and a charge of -1
> DOT
> ATOM "C" with no hydrogens, isotope number 14 and charge of -1
> closure ("-", "6")
> DOT
> ATOM "C" with no hydrogens, isotope number 12 and charge of -2
> closure (<default>, "7")
> DOT
> ATOM "C" with two hydrogens and a charge of -3)
> closure ("-", "8")
>
> You can see my problem. Depending on where I am,
> "-2" is parsed as "charge of -2" or "single bond ring closure
> with identifer 2".
>
> If I use '0'...'9'+ to tokenize the isotope field, like in [12C],
> then it will also parse the 12 in
>
> C12.C1.C2
>
> If I instead let the parser do everything, so the lexer
> items are '-', and '0'..'9', then it will be annoyingly
> tedious in the parser to have to reassemble the characters,
> because '[238U]' will be parsed using
>
> atom_in_brackets: '[' isotope? SYMBOL hydrogen? charge? ']';
> isotope: DIGIT*; // will need to combine the tokens to get the value
> SYMBOL: 'A'..'Z' 'a'..'z'? ;
> hydrogen: 'H' DIGIT*; // again, will combine the tokens
> charge: MINUS DIGIT*
> | PLUS DIGIT* ;
>
> atom_list: (atom closure*)*;
> closure: DIGIT | '%' DIGIT DIGIT;
>
>
> I'm pretty sure this will work, but it ends up pushing all
> of the work into the parser. That seems like a lot of
> waste because I'm basically using the parser as a lexer.
> While in a yacc-style parser it's very easy to define the
> two or three lexer states I need. When I see '[' I switch
> to a "parsing the inside of []s" state, and when I see
> the closing ']' I switch back.
>
> What should I do?
>
> Andrew
> dalke at dalkescientific.com
>
>
>
>
More information about the antlr-interest
mailing list