[antlr-interest] Baffled using Antlr to parse custom markup language

Fri Sep 25 04:22:48 PDT 2009

2009/9/24 Sam Kuper <sam.kuper at uclmail.net>:
> Antlr is stumping me. Perhaps it's because I'm trying to use it to
> parse a pretty gnarly markup language, but maybe not...
>
> Between the [BEGIN] and [END] markers below is a sample of the markup
> I'm dealing with.
>
> [BEGIN]
> .Extract field for no. 16 can be added as the last line of an entry,
> in the form: *e 16
> FOR NO. 17, ADD 17: NO COMMA. E.G. *e 16 17 or *e 17 (DO NOT USE THE
> *v FIELD OR THE *d FIELD)
>
>
> !!!!!!!!!!!REPEAT, NO COMMA!!!!!!!!!!!!!!
>
>
> *a \\Albert, John\\ (b. 1912/13). Gardener. George Robert Fox's
> gardener at The Vicarage, \circa\ 1941. (Census returns 1841 (Public Record
> Office HO137/916/8).)
> *b expanded by EL
> *c
> *v 2, 3
> *e 18 19
> [END]
>
> I'll use regex notation below to help me describe what the above markup means.
>
> Everything up to the first
>
> ^\*a
>
> is a note and needs to be translated such that single newlines are
> ignored but two or more newlines are translated to a pair of newlines.

Even this seems to be incredibly difficult with ANTLR. Essentially, I
want to "stop on match" where the match is (in regex notation): ^\*a

But even leaving aside the apparent impossibility of recognising
start-of-line in ANTLR, various approaches I've tried, such as this:

grammar name_reg;
options {
	language=Java;
}
name_reg	: notes? entry* EOF;
notes		: (~'*a')*;
entry		: '*a';
ASCII		: ' '..'~';

fail to stop on encountering '*a' and instead just strip the '*' and
put the rest into 'notes'.

So please could someone help by telling me how can I make ANTLR
capture everything up to a given sequence of characters?

Many thanks,

Sam