[antlr-interest] Baffled using Antlr to parse custom markup language

Sam Kuper sam.kuper at uclmail.net
Thu Sep 24 10:38:41 PDT 2009


Antlr is stumping me. Perhaps it's because I'm trying to use it to
parse a pretty gnarly markup language, but maybe not...

Between the [BEGIN] and [END] markers below is a sample of the markup
I'm dealing with.

[BEGIN]
.Extract field for no. 16 can be added as the last line of an entry,
in the form: *e 16
FOR NO. 17, ADD 17: NO COMMA. E.G. *e 16 17 or *e 17 (DO NOT USE THE
*v FIELD OR THE *d FIELD)


!!!!!!!!!!!REPEAT, NO COMMA!!!!!!!!!!!!!!


*a \\Albert, John\\ (b. 1912/13). Gardener. George Robert Fox's
gardener at The Vicarage, \circa\ 1941. (Census returns 1841 (Public Record
Office HO137/916/8).)
*b expanded by EL
*c
*v 2, 3
*e 18 19
[END]

I'll use regex notation below to help me describe what the above markup means.

Everything up to the first

^\*a

is a note and needs to be translated such that single newlines are
ignored but two or more newlines are translated to a pair of newlines.

Everything from a

^\*a

to a

(^\*[a-z]|\n{2,})

exclusive, is a STAR_A field. Similarly for STAR_B, STAR_C, STAR_V and
STAR_E fields.

How on earth would I achieve this in Antlr?

NB. Ideally, my target language would be Python; failing that, Java.
But at the moment, I'm having trouble even getting close to parsing
this reliably in Antlrworks, so any help would be appreciated.

Many thanks in advance,

Sam


More information about the antlr-interest mailing list