[antlr-interest] Recognising XML in a grammar

Fri Sep 8 10:13:03 PDT 2006

Hey there Ric, 

I can't seem to get a proper recursive lexical rule
going for a block of XML. I want to recognise a block
of XML in my lexer. I'm able to recognise nested XML
nodes with the 'TOKEN_LITERAL' lexical rule, but the
lexer doesn't know when to end a document, or match
root XML nodes (I'm using the lexical rule in fig. 2).
The parser rule I'm using is in fig. 1. Has anyone
written an XML lexical grammar that I can compare
with? 

expr: TOKEN_LITERAL ( DELIMITER TOKEN_LITERAL )* 
    { System.out.println( "EXPRESSION >>> Parser" );
}; 

fig. 1

DELIMITER:      ','; 

TOKEN_LITERAL:
    (
        (PI)? (WS)?
        (       
            ( tag:STARTTAG
                ( WS | PI   | COMMENT  | CDATABLOCK )*

                (TOKEN_LITERAL)*
            ENDTAG ) { System.out.println("      TOKEN
LITERAL ["+ tag.getText() +"]"); }
            | 
            (tag2:EMPTYTAG) { System.out.println("    
 TOKEN LITERAL ["+ tag2.getText() +"]"); }
        )       
    ) 

fig.2 

Cheers. Tim. 

--- Ric Klaren <ric.klaren at gmail.com> wrote:

> Hi,
> 
> On 9/5/06, Timothy Washington
> <timothyjwashington at yahoo.ca> wrote:
> > Hey there Ric, thanks for responding. Your option
> > number 3 is what I'm after. The software that I am
> > writing will have another tool that takes that XML
> > chunk and deals with it. So really, I just want to
> > pass the XML as a string to my application.
> 
> Note that in what you're making now you have to
> rebuild the original
> XML string by concatenating the tokens...
> 
> Also you say you'd like to do option 3 I presented,
> but you're
> implementing option 1 it seems. It seems you mix up
> lexing and parsing
> (this is actually pretty normal when you're pretty
> new to
> antlr/parsing).
> 
> If you want to get one easy string in your parser
> for a chunk of
> complete XML then you'll have to do this in the
> lexer. You can
> probably use chunks from the original xml lexer and
> just count open
> and close tags untill you got a complete chunk of
> xml (I assume you
> don't have to validate the XML input at this stage).
> 
> First try to get a lexer running that can deal with
> your input and
> delivers the chunks you want. E.g. only tokens from
> your language and
> say some XML_TOKEN that contains a complete chunk of
> XML. After that
> it will be easy to deal with comma delimited chunks
> of XML_TOKEN's.
> 
> If I take your earlier example:
> 
> create
>          (entry
>                  (
>                          <?xml version='1.0'
> encoding='UTF-8'?>
>                          <debit
> xmlns='com/interrupt/bookkeeping/account'
>  amount='100.00'/>,
>                         <?xml version='1.0'
> encoding='UTF-8'?>
>                          <credit
> xmlns='com/interrupt/bookkeeping/account'
>  amount='100.00'/>
>                  )
>          )
> 
> (I changed the <debit xml... 00'> to <debit xml...
> 00'/> I assume
> that's a mistake)
> 
> For how you explain things I'd expect to get the
> following tokens from
> the lexer:
> 
> CREATE   - with text 'create' (assuming you handle
> this as a keyword)
> LPAREN - with text '('
> ENTRY - with text 'entry' (assuming you handle this
> as a keyword)
> LPAREN - with text '('
> XMLTOKEN - with text '<?xml version='1.0'
> encoding='UTF-8'?>
>                          <debit
> xmlns='com/interrupt/bookkeeping/account'
>  amount='100.00'/>'
> COMMA - with text ','
> XMLTOKEN - with text '<?xml version='1.0'
> encoding='UTF-8'?>
>                          <debit
> xmlns='com/interrupt/bookkeeping/account'
>  amount='100.00'/>'
> RPAREN - with text ')'
> RPAREN - with text ')'
> EOF
> 
> The in the parser you'd have a rule:
> 
> create_cmd: CREATE LPAREN ENTRY LPAREN
>   XMLTOKEN (COMMA XMLTOKEN)*
> RPAREN RPAREN;
> 
> Inside the action code in this rule then you could
> just use the
> getText() method on the XMLTOKENS to access the XML
> as a string and
> pass it to another stage.
> 
> With the solution you seem to be following now you'd
> have to
> concatenate a bunch of tags etc. together back to
> one string.
> 
> > class BookkeepingParser extends Parser;
> > options {
> >         k=10;
> >         importVocab=BookkeepingLexer;
> > }
> 
> As aside note k=10 is generally spoken pretty high.
> Normally you'd
> have something like k=2-3 a bit depending on things.
> Non determinism
> you can fix with predicates. When you get things
> running you can
> refactor your grammar to get rid of the worst
> predicates.
> 
> Cheers,
> 
> Ric
> 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com