[antlr-interest] Recognising XML in a grammar

Tue Sep 5 07:15:13 PDT 2006

--- Ric Klaren <ric.klaren at gmail.com> wrote:

> Hi,
> 
...
> 
> 3) one lexer that cuts up the character stream into
> the tokens for
> your normal grammar and passes XML chunks as one big
> XML token to your
> parser. (this could become a variant of solution 2)
> 
> After deciding what you're going to do with the
> lexer you should think
> of what you want to do in the parser.
> 
...
> 
> For lexer solution 3 you might start building an AST
> and when you
> encounter an XML token parse it's contents inline
> and maybe insert the
> generated AST into the AST you're generating. (e.g.
> have a complete
> xml lexer/parser/treebuilder that you call from
> action code) Then in a
> subsequent parse phase you might grab the combined
> AST and do 'your
> stuff'.
> 
> Well there's more ways in which you can make
> manageable chunks of the
> parsing problem you have. The above is just a start
> there's more ways
> of cutting up the complexity. If you try to do too
> much in one go you
> will probably end up with something unmanageable. So
> I'd recommend
> cutting things up at some point.
> 
> Cheers,
> 
> Ric
> 

Hey there Ric, thanks for responding. Your option
number 3 is what I'm after. The software that I am
writing will have another tool that takes that XML
chunk and deals with it. So really, I just want to
pass the XML as a string to my application. But the
grammar has to recognise a complete xml document (
with or without declaration ), as there can be a list
of XMLs passed to it in 1 command ( comma is the
delimiter ). I've come pretty far using the xml.g
example. The only remaining thing to do is to
distinguish between 1 and many XMLs. So for example,
in fig.1, the parser recognises 0 or more starttag,
endtag, et al. within a start and an end tag. So it
doesn't distinguish between different documents. 

** What I want is something like fig.2, where the
parser recognises nested XMLs. I've gotten some
success with fig.2, ie, it recognises nested XML ( see
fig.3 ). I just need to figure out how to delimit 1
XML document from another without
errors(non-determinisms). So it seems to be working,
but there's still some lexical nondeterminism ( fig.3
). I 'think' that only empty tags need to be
recognised as a complete node ( xml.g just defines
them as part of the start tag ).

class BookkeepingParser extends Parser;
options {
	k=10;
	importVocab=BookkeepingLexer; 

}

token_literal: ((PI)? 
		( STARTTAG 
				( WS |  PI  | STARTTAG  | ENDTAG  | COMMENT  |
CDATABLOCK   )* 
			ENDTAG)) { System.out.println("TOKEN LITERAL"); }; 
fig.1

...
token_literal: ((PI)? 
		( STARTTAG 
				( WS |  PI   | COMMENT  | CDATABLOCK   )* 
				(token_literal)*
			(ENDTAG)?)) { System.out.println("TOKEN LITERAL");
}; 
fig.2

ANTLR Parser Generator   Version 2.7.6 (2005-12-22)  
1989-2005
ANTLR Parser Generator   Version 2.7.6 (2005-12-22)  
1989-2005
grammar/bookkeeping.parser.g:17:
warning:nondeterminism upon
grammar/bookkeeping.parser.g:17:     k==1:PI
grammar/bookkeeping.parser.g:17:    
k==2:PI,COMMENT,ENDTAG,STARTTAG,CDATABLOCK,WS
...
grammar/bookkeeping.parser.g:17:    
k==9:PI,COMMENT,ENDTAG,STARTTAG,CDATABLOCK,WS
grammar/bookkeeping.parser.g:17:    
k==10:PI,COMMENT,ENDTAG,STARTTAG,CDATABLOCK,WS
grammar/bookkeeping.parser.g:17:     between alt 2 and
exit branch of block
BookkeepingMain INPUT length[1] / content[<?xml
version='1.0' encoding='UTF-8'?>
<bookkeeping
xmlns:account='com/interrupt/bookkeeping/account'
xmlns:journal='com/interrupt/bookkeeping/journal'
xmlns='com/interrupt/bookkeeping' id='' >  <!-- 1.
account types are: asset, liability, expense, revenue
2. each account has a given counter weight
<account:account type='asset'               id=''
name='' counterWeight='debit' /> <account:account
type='expense'     id=' name=' counterWeight='debit'
/> <account:account type='liability'   id='' name=''
counterWeight='credit' <account:account type='revenue'
    id='' name='' counterWeight='credit' /> -->
<account:accounts id='a1' ><!-- each account can have
a debit / credit to it. These are duplicated from
journal entries.  **Therefore adding entries to a
<journal/> or <transaction/> will also add the
corresponding debit/credit to the corresponding
account.  ** Adding <entry/> to a <transaction/> will
also add the entry to the corresponding <journal/>
Also 'debits/credits' in all <account/> should match
with 'debits/credits' in all
<journal><enrtry/></journal> --> 
<account:account id='1' name='office equipment'
type='asset' counterWeight='debit' ><account:debit
id='' amount='10.00' entryid='' accountid='1'
/></account:account><account:account id='2' name='tax'
type='expense' counterWeight='debit' ><account:debit
id='' amount='1.50' entryid='' accountid='2'
/></account:account><account:account id='3'
name='bank' type='asset' counterWeight='debit'
><account:credit id='' amount='11.50' entryid=''
accountid='3' /></account:account><journal:journals
id='' ><journal:journal id='j1' name='generalledger'
type='' balance=''><journal:entries id='' 
><journal:entry id='e1' entrynum='' state=''
journalid='' date='' ><account:debit id=''
amount='10.00' entryid='' accountid='1'
/><account:debit id='' amount='1.50' entryid=''
accountid='2' /><account:credit id='' amount='11.50'
entryid='' accountid='3'
/></journal:entry></journal:entries></journal:journal></journal:journals></account:accounts></bookkeeping>,
<anotherXml/>]	//*** input is a LIST of XML separated
by comma 

ATTRIBUTE: version=1.0
ATTRIBUTE: encoding=UTF-8
XMLDECL: xml

ATTRIBUTE:
xmlns:account=com/interrupt/bookkeeping/account
ATTRIBUTE:
xmlns:journal=com/interrupt/bookkeeping/journal
ATTRIBUTE: xmlns=com/interrupt/bookkeeping
ATTRIBUTE: id=
STARTTAG: bookkeeping

COMMENT:  1. account types are: asset, liability,
expense, revenue 2. each account has a given counter
weight <account:account type='asset'              
id='' name='' counterWeight='debit' />
<account:account type='expense'     id=' name='
counterWeight='debit' /> <account:account
type='liability'   id='' name=''
counterWeight='credit' <account:account type='revenue'
    id='' name='' counterWeight='credit' /> 	//***
COMMENT has commas, but NO DELIMITER

...

EMTYTAG: account:credit
ENDTAG: journal:entry
TOKEN LITERAL
ENDTAG: journal:entries
TOKEN LITERAL
ENDTAG: journal:journal
TOKEN LITERAL
ENDTAG: journal:journals
TOKEN LITERAL
ENDTAG: account:accounts
TOKEN LITERAL
ENDTAG: bookkeeping
TOKEN LITERAL
DELIMITER ','			//*** DELIMITER HERE 
line 1:2026: expecting ENDTAG, found ','

WHITE SPACE ' '
EMTYTAG: anotherXml		//*** NEXT XML DOCUMENT HERE 
line 1:2041: expecting ENDTAG, found 'null'
line 1:2041: expecting ENDTAG, found 'null'
line 1:2041: expecting ENDTAG, found 'null'
line 1:2041: expecting ENDTAG, found 'null'
line 1:2041: expecting ENDTAG, found 'null'
line 1:2041: expecting ENDTAG, found 'null'
fig.3 

Tim

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com