[antlr-interest] Recognising XML in a grammar

Ric Klaren ric.klaren at gmail.com
Tue Sep 5 08:14:51 PDT 2006


Hi,

On 9/5/06, Timothy Washington <timothyjwashington at yahoo.ca> wrote:
> Hey there Ric, thanks for responding. Your option
> number 3 is what I'm after. The software that I am
> writing will have another tool that takes that XML
> chunk and deals with it. So really, I just want to
> pass the XML as a string to my application.

Note that in what you're making now you have to rebuild the original
XML string by concatenating the tokens...

Also you say you'd like to do option 3 I presented, but you're
implementing option 1 it seems. It seems you mix up lexing and parsing
(this is actually pretty normal when you're pretty new to
antlr/parsing).

If you want to get one easy string in your parser for a chunk of
complete XML then you'll have to do this in the lexer. You can
probably use chunks from the original xml lexer and just count open
and close tags untill you got a complete chunk of xml (I assume you
don't have to validate the XML input at this stage).

First try to get a lexer running that can deal with your input and
delivers the chunks you want. E.g. only tokens from your language and
say some XML_TOKEN that contains a complete chunk of XML. After that
it will be easy to deal with comma delimited chunks of XML_TOKEN's.

If I take your earlier example:

create
         (entry
                 (
                         <?xml version='1.0' encoding='UTF-8'?>
                         <debit xmlns='com/interrupt/bookkeeping/account'
 amount='100.00'/>,
                        <?xml version='1.0' encoding='UTF-8'?>
                         <credit xmlns='com/interrupt/bookkeeping/account'
 amount='100.00'/>
                 )
         )

(I changed the <debit xml... 00'> to <debit xml... 00'/> I assume
that's a mistake)

For how you explain things I'd expect to get the following tokens from
the lexer:

CREATE   - with text 'create' (assuming you handle this as a keyword)
LPAREN - with text '('
ENTRY - with text 'entry' (assuming you handle this as a keyword)
LPAREN - with text '('
XMLTOKEN - with text '<?xml version='1.0' encoding='UTF-8'?>
                         <debit xmlns='com/interrupt/bookkeeping/account'
 amount='100.00'/>'
COMMA - with text ','
XMLTOKEN - with text '<?xml version='1.0' encoding='UTF-8'?>
                         <debit xmlns='com/interrupt/bookkeeping/account'
 amount='100.00'/>'
RPAREN - with text ')'
RPAREN - with text ')'
EOF

The in the parser you'd have a rule:

create_cmd: CREATE LPAREN ENTRY LPAREN
  XMLTOKEN (COMMA XMLTOKEN)*
RPAREN RPAREN;

Inside the action code in this rule then you could just use the
getText() method on the XMLTOKENS to access the XML as a string and
pass it to another stage.

With the solution you seem to be following now you'd have to
concatenate a bunch of tags etc. together back to one string.

> class BookkeepingParser extends Parser;
> options {
>         k=10;
>         importVocab=BookkeepingLexer;
> }

As aside note k=10 is generally spoken pretty high. Normally you'd
have something like k=2-3 a bit depending on things. Non determinism
you can fix with predicates. When you get things running you can
refactor your grammar to get rid of the worst predicates.

Cheers,

Ric


More information about the antlr-interest mailing list