[antlr-interest] Recognising XML in a grammar
Timothy Washington
timothyjwashington at yahoo.ca
Fri Sep 8 10:13:03 PDT 2006
Hey there Ric,
I can't seem to get a proper recursive lexical rule
going for a block of XML. I want to recognise a block
of XML in my lexer. I'm able to recognise nested XML
nodes with the 'TOKEN_LITERAL' lexical rule, but the
lexer doesn't know when to end a document, or match
root XML nodes (I'm using the lexical rule in fig. 2).
The parser rule I'm using is in fig. 1. Has anyone
written an XML lexical grammar that I can compare
with?
expr: TOKEN_LITERAL ( DELIMITER TOKEN_LITERAL )*
{ System.out.println( "EXPRESSION >>> Parser" );
};
fig. 1
DELIMITER: ',';
TOKEN_LITERAL:
(
(PI)? (WS)?
(
( tag:STARTTAG
( WS | PI | COMMENT | CDATABLOCK )*
(TOKEN_LITERAL)*
ENDTAG ) { System.out.println(" TOKEN
LITERAL ["+ tag.getText() +"]"); }
|
(tag2:EMPTYTAG) { System.out.println("
TOKEN LITERAL ["+ tag2.getText() +"]"); }
)
)
fig.2
Cheers. Tim.
--- Ric Klaren <ric.klaren at gmail.com> wrote:
> Hi,
>
> On 9/5/06, Timothy Washington
> <timothyjwashington at yahoo.ca> wrote:
> > Hey there Ric, thanks for responding. Your option
> > number 3 is what I'm after. The software that I am
> > writing will have another tool that takes that XML
> > chunk and deals with it. So really, I just want to
> > pass the XML as a string to my application.
>
> Note that in what you're making now you have to
> rebuild the original
> XML string by concatenating the tokens...
>
> Also you say you'd like to do option 3 I presented,
> but you're
> implementing option 1 it seems. It seems you mix up
> lexing and parsing
> (this is actually pretty normal when you're pretty
> new to
> antlr/parsing).
>
> If you want to get one easy string in your parser
> for a chunk of
> complete XML then you'll have to do this in the
> lexer. You can
> probably use chunks from the original xml lexer and
> just count open
> and close tags untill you got a complete chunk of
> xml (I assume you
> don't have to validate the XML input at this stage).
>
> First try to get a lexer running that can deal with
> your input and
> delivers the chunks you want. E.g. only tokens from
> your language and
> say some XML_TOKEN that contains a complete chunk of
> XML. After that
> it will be easy to deal with comma delimited chunks
> of XML_TOKEN's.
>
> If I take your earlier example:
>
> create
> (entry
> (
> <?xml version='1.0'
> encoding='UTF-8'?>
> <debit
> xmlns='com/interrupt/bookkeeping/account'
> amount='100.00'/>,
> <?xml version='1.0'
> encoding='UTF-8'?>
> <credit
> xmlns='com/interrupt/bookkeeping/account'
> amount='100.00'/>
> )
> )
>
> (I changed the <debit xml... 00'> to <debit xml...
> 00'/> I assume
> that's a mistake)
>
> For how you explain things I'd expect to get the
> following tokens from
> the lexer:
>
> CREATE - with text 'create' (assuming you handle
> this as a keyword)
> LPAREN - with text '('
> ENTRY - with text 'entry' (assuming you handle this
> as a keyword)
> LPAREN - with text '('
> XMLTOKEN - with text '<?xml version='1.0'
> encoding='UTF-8'?>
> <debit
> xmlns='com/interrupt/bookkeeping/account'
> amount='100.00'/>'
> COMMA - with text ','
> XMLTOKEN - with text '<?xml version='1.0'
> encoding='UTF-8'?>
> <debit
> xmlns='com/interrupt/bookkeeping/account'
> amount='100.00'/>'
> RPAREN - with text ')'
> RPAREN - with text ')'
> EOF
>
> The in the parser you'd have a rule:
>
> create_cmd: CREATE LPAREN ENTRY LPAREN
> XMLTOKEN (COMMA XMLTOKEN)*
> RPAREN RPAREN;
>
> Inside the action code in this rule then you could
> just use the
> getText() method on the XMLTOKENS to access the XML
> as a string and
> pass it to another stage.
>
> With the solution you seem to be following now you'd
> have to
> concatenate a bunch of tags etc. together back to
> one string.
>
> > class BookkeepingParser extends Parser;
> > options {
> > k=10;
> > importVocab=BookkeepingLexer;
> > }
>
> As aside note k=10 is generally spoken pretty high.
> Normally you'd
> have something like k=2-3 a bit depending on things.
> Non determinism
> you can fix with predicates. When you get things
> running you can
> refactor your grammar to get rid of the worst
> predicates.
>
> Cheers,
>
> Ric
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
More information about the antlr-interest
mailing list