[antlr-interest] Recognising XML in a grammar

Tue Sep 5 10:51:26 PDT 2006

--- Ric Klaren <ric.klaren at gmail.com> wrote:

> Hi,
> 
> On 9/5/06, Timothy Washington
> <timothyjwashington at yahoo.ca> wrote:
> > Hey there Ric, thanks for responding. Your option
> > number 3 is what I'm after. The software that I am
> > writing will have another tool that takes that XML
> > chunk and deals with it. So really, I just want to
> > pass the XML as a string to my application.
> 
> Note that in what you're making now you have to
> rebuild the original
> XML string by concatenating the tokens...
> 
> Also you say you'd like to do option 3 I presented,
> but you're
> implementing option 1 it seems. It seems you mix up
> lexing and parsing
> (this is actually pretty normal when you're pretty
> new to
> antlr/parsing).
> 
> If you want to get one easy string in your parser
> for a chunk of
> complete XML then you'll have to do this in the
> lexer. You can
> probably use chunks from the original xml lexer and
> just count open
> and close tags untill you got a complete chunk of
> xml (I assume you
> don't have to validate the XML input at this stage).

Yeah, you nailed it Ric, I am new to parser
generators. I actually tried to write my own parser
before realising that I better use a packaged
solution. I actually got pretty far along - I was
using an interpreter pattern on an input stream text.
My difficulty came in breaking the input into
expressions recursively (and executing them as such). 

On that note, I can see how the 'token_literal' rule
defined in the Parser can be put in the lexer, if I'm
understanding you correctly. This is so as not to mix
up parsing and lexing. So I don't have to validate the
XML, but it has to be well-formed so that the lexer
can recognize it as input. 

> 
> First try to get a lexer running that can deal with
> your input and
> delivers the chunks you want. E.g. only tokens from
> your language and
> say some XML_TOKEN that contains a complete chunk of
> XML. After that
> it will be easy to deal with comma delimited chunks
> of XML_TOKEN's.

Right right, I see.

> 
> If I take your earlier example:
> 
> create
>          (entry
>                  (
>                          <?xml version='1.0'
> encoding='UTF-8'?>
>                          <debit
> xmlns='com/interrupt/bookkeeping/account'
>  amount='100.00'/>,
>                         <?xml version='1.0'
> encoding='UTF-8'?>
>                          <credit
> xmlns='com/interrupt/bookkeeping/account'
>  amount='100.00'/>
>                  )
>          )
> 
> (I changed the <debit xml... 00'> to <debit xml...
> 00'/> I assume
> that's a mistake)
> 
> For how you explain things I'd expect to get the
> following tokens from
> the lexer:
> 
> CREATE   - with text 'create' (assuming you handle
> this as a keyword)
> LPAREN - with text '('
> ENTRY - with text 'entry' (assuming you handle this
> as a keyword)
> LPAREN - with text '('
> XMLTOKEN - with text '<?xml version='1.0'
> encoding='UTF-8'?>
>                          <debit
> xmlns='com/interrupt/bookkeeping/account'
>  amount='100.00'/>'
> COMMA - with text ','
> XMLTOKEN - with text '<?xml version='1.0'
> encoding='UTF-8'?>
>                          <debit
> xmlns='com/interrupt/bookkeeping/account'
>  amount='100.00'/>'
> RPAREN - with text ')'
> RPAREN - with text ')'
> EOF
> 
> The in the parser you'd have a rule:
> 
> create_cmd: CREATE LPAREN ENTRY LPAREN
>   XMLTOKEN (COMMA XMLTOKEN)*
> RPAREN RPAREN;
> 
> Inside the action code in this rule then you could
> just use the
> getText() method on the XMLTOKENS to access the XML
> as a string and
> pass it to another stage.

I'm going to try your suggestion and let you know the
results. This is actually what I really wanted to do,
but didn't quite get the lexing/parsing barrier. The
only question I have left is how to attach actions to
a parser rule, but let me look at the docs for that. 

> 
> With the solution you seem to be following now you'd
> have to
> concatenate a bunch of tags etc. together back to
> one string.
> 
> > class BookkeepingParser extends Parser;
> > options {
> >         k=10;
> >         importVocab=BookkeepingLexer;
> > }
> 
> As aside note k=10 is generally spoken pretty high.
> Normally you'd
> have something like k=2-3 a bit depending on things.
> Non determinism
> you can fix with predicates. When you get things
> running you can
> refactor your grammar to get rid of the worst
> predicates.

Predicates ( semantic and syntactic ) are something
that I don't really understand yet. I read the antlr
docs, but that doesn't give me a real sense of the
why. For example, I have 2 similar options in my
syntax that look like A. and B. For this example, I
had to put my 'k' value to 7 to distinguish between A
and B. Does a syntactic predicate look like C, so that
now my 'k' or lookahead value only has to be 2 to
distinguish between A and B? 

A. "-entry"
B. "-entrynum" 
C. {"-entr"}

> 
> Cheers,
> 
> Ric
> 

Cheers Ric, 

Tim 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com