[antlr-interest] Recognising XML in a grammar

Tue Sep 5 05:27:11 PDT 2006

Hi,

On 8/31/06, Timothy Washington <timothyjwashington at yahoo.ca> wrote:
> Hello there. I am new to ANTLR and parser generators
> in general, so I hope you'll forgive what might seem a
> simple question. I want to know how my parser can
> recognise an XML block inside of my grammar.

There's multiple ways to solve this.

> GRAMMAR
> I want to take as an example, the xml grammar file
> '$ANTLR_2.7.6/examples/java/xml/xml.g' in antlr. I'm
> writing a grammar that can contain xml (with
> namespaces and declarations) as a token. So a command
> could look like this for example:
> create
>         (entry
>                 (
>                         <?xml version='1.0' encoding='UTF-8'?>
>                         <debit xmlns='com/interrupt/bookkeeping/account'
> amount='100.00'>,
>                         <?xml version='1.0' encoding='UTF-8'?>
>                         <credit xmlns='com/interrupt/bookkeeping/account'
> amount='100.00'>
>                 )
>         )

> IMPORTING .g FILES
> I want to write a grammar that recognises all the
> tokens in this command, including the raw XML. How
> could I use the grammar definitions in 'xml.g', in my
> own grammar file. For starters, I believe you use the
> 'importVocab' grammar option.
> Class MyParser extends Parser
> options { ? importVocab=V; ?}

importVocab is only used to carry around token definitions (e.g. the
mapping of this piece of text to a unique number that's used in
subsequent (tree)parse phases)

For what you are trying to do you had better look at the token stream
multiplexing
examples. (for one solution)

> RECOGNISING XML BLOCKS
> But what I really want to know is how my parser can
> recognise a block of XML inside of my command. With
> the said 'xml.g' grammar, I can recognise start and
> end tags and cdata and so on. But I just want to
> recognise an entire XML block and pass it as a token
> to some command.

I guess it's probably best to first determine what lexing strategy is
best for what you want to do... Do you want to have:

1) one lexer that can cut up the character stream in all possible
types of tokens (for your normal grammar *and* for all the tokens that
can occur in XML) (this may become tricky or perform badly due to
conflicts between token types)

2) one lexer for your normal grammar tokens and one lexer that can
tokenize XML. And use tokenstream multiplexing to switch between them
(see the example for dealing with javadoc comments)

3) one lexer that cuts up the character stream into the tokens for
your normal grammar and passes XML chunks as one big XML token to your
parser. (this could become a variant of solution 2)

After deciding what you're going to do with the lexer you should think
of what you want to do in the parser.

For lexer solution 1+2 you probably would get a deal with everything parser.

For lexer solution 3 you might start building an AST and when you
encounter an XML token parse it's contents inline and maybe insert the
generated AST into the AST you're generating. (e.g. have a complete
xml lexer/parser/treebuilder that you call from action code) Then in a
subsequent parse phase you might grab the combined AST and do 'your
stuff'.

Well there's more ways in which you can make manageable chunks of the
parsing problem you have. The above is just a start there's more ways
of cutting up the complexity. If you try to do too much in one go you
will probably end up with something unmanageable. So I'd recommend
cutting things up at some point.

Cheers,

Ric