[antlr-interest] Baffled using Antlr to parse custom markup language

Thu Sep 24 10:38:41 PDT 2009

Antlr is stumping me. Perhaps it's because I'm trying to use it to
parse a pretty gnarly markup language, but maybe not...

Between the [BEGIN] and [END] markers below is a sample of the markup
I'm dealing with.

[BEGIN]
.Extract field for no. 16 can be added as the last line of an entry,
in the form: *e 16
FOR NO. 17, ADD 17: NO COMMA. E.G. *e 16 17 or *e 17 (DO NOT USE THE
*v FIELD OR THE *d FIELD)

!!!!!!!!!!!REPEAT, NO COMMA!!!!!!!!!!!!!!

*a \\Albert, John\\ (b. 1912/13). Gardener. George Robert Fox's
gardener at The Vicarage, \circa\ 1941. (Census returns 1841 (Public Record
Office HO137/916/8).)
*b expanded by EL
*c
*v 2, 3
*e 18 19
[END]

I'll use regex notation below to help me describe what the above markup means.

Everything up to the first

^\*a

is a note and needs to be translated such that single newlines are
ignored but two or more newlines are translated to a pair of newlines.

Everything from a

^\*a

to a

(^\*[a-z]|\n{2,})

exclusive, is a STAR_A field. Similarly for STAR_B, STAR_C, STAR_V and
STAR_E fields.

How on earth would I achieve this in Antlr?

NB. Ideally, my target language would be Python; failing that, Java.
But at the moment, I'm having trouble even getting close to parsing
this reliably in Antlrworks, so any help would be appreciated.

Many thanks in advance,

Sam