[antlr-interest] Recognising XML in a grammar
Timothy Washington
timothyjwashington at yahoo.ca
Tue Sep 5 10:51:26 PDT 2006
--- Ric Klaren <ric.klaren at gmail.com> wrote:
> Hi,
>
> On 9/5/06, Timothy Washington
> <timothyjwashington at yahoo.ca> wrote:
> > Hey there Ric, thanks for responding. Your option
> > number 3 is what I'm after. The software that I am
> > writing will have another tool that takes that XML
> > chunk and deals with it. So really, I just want to
> > pass the XML as a string to my application.
>
> Note that in what you're making now you have to
> rebuild the original
> XML string by concatenating the tokens...
>
> Also you say you'd like to do option 3 I presented,
> but you're
> implementing option 1 it seems. It seems you mix up
> lexing and parsing
> (this is actually pretty normal when you're pretty
> new to
> antlr/parsing).
>
> If you want to get one easy string in your parser
> for a chunk of
> complete XML then you'll have to do this in the
> lexer. You can
> probably use chunks from the original xml lexer and
> just count open
> and close tags untill you got a complete chunk of
> xml (I assume you
> don't have to validate the XML input at this stage).
Yeah, you nailed it Ric, I am new to parser
generators. I actually tried to write my own parser
before realising that I better use a packaged
solution. I actually got pretty far along - I was
using an interpreter pattern on an input stream text.
My difficulty came in breaking the input into
expressions recursively (and executing them as such).
On that note, I can see how the 'token_literal' rule
defined in the Parser can be put in the lexer, if I'm
understanding you correctly. This is so as not to mix
up parsing and lexing. So I don't have to validate the
XML, but it has to be well-formed so that the lexer
can recognize it as input.
>
> First try to get a lexer running that can deal with
> your input and
> delivers the chunks you want. E.g. only tokens from
> your language and
> say some XML_TOKEN that contains a complete chunk of
> XML. After that
> it will be easy to deal with comma delimited chunks
> of XML_TOKEN's.
Right right, I see.
>
> If I take your earlier example:
>
> create
> (entry
> (
> <?xml version='1.0'
> encoding='UTF-8'?>
> <debit
> xmlns='com/interrupt/bookkeeping/account'
> amount='100.00'/>,
> <?xml version='1.0'
> encoding='UTF-8'?>
> <credit
> xmlns='com/interrupt/bookkeeping/account'
> amount='100.00'/>
> )
> )
>
> (I changed the <debit xml... 00'> to <debit xml...
> 00'/> I assume
> that's a mistake)
>
> For how you explain things I'd expect to get the
> following tokens from
> the lexer:
>
> CREATE - with text 'create' (assuming you handle
> this as a keyword)
> LPAREN - with text '('
> ENTRY - with text 'entry' (assuming you handle this
> as a keyword)
> LPAREN - with text '('
> XMLTOKEN - with text '<?xml version='1.0'
> encoding='UTF-8'?>
> <debit
> xmlns='com/interrupt/bookkeeping/account'
> amount='100.00'/>'
> COMMA - with text ','
> XMLTOKEN - with text '<?xml version='1.0'
> encoding='UTF-8'?>
> <debit
> xmlns='com/interrupt/bookkeeping/account'
> amount='100.00'/>'
> RPAREN - with text ')'
> RPAREN - with text ')'
> EOF
>
> The in the parser you'd have a rule:
>
> create_cmd: CREATE LPAREN ENTRY LPAREN
> XMLTOKEN (COMMA XMLTOKEN)*
> RPAREN RPAREN;
>
> Inside the action code in this rule then you could
> just use the
> getText() method on the XMLTOKENS to access the XML
> as a string and
> pass it to another stage.
I'm going to try your suggestion and let you know the
results. This is actually what I really wanted to do,
but didn't quite get the lexing/parsing barrier. The
only question I have left is how to attach actions to
a parser rule, but let me look at the docs for that.
>
> With the solution you seem to be following now you'd
> have to
> concatenate a bunch of tags etc. together back to
> one string.
>
> > class BookkeepingParser extends Parser;
> > options {
> > k=10;
> > importVocab=BookkeepingLexer;
> > }
>
> As aside note k=10 is generally spoken pretty high.
> Normally you'd
> have something like k=2-3 a bit depending on things.
> Non determinism
> you can fix with predicates. When you get things
> running you can
> refactor your grammar to get rid of the worst
> predicates.
Predicates ( semantic and syntactic ) are something
that I don't really understand yet. I read the antlr
docs, but that doesn't give me a real sense of the
why. For example, I have 2 similar options in my
syntax that look like A. and B. For this example, I
had to put my 'k' value to 7 to distinguish between A
and B. Does a syntactic predicate look like C, so that
now my 'k' or lookahead value only has to be 2 to
distinguish between A and B?
A. "-entry"
B. "-entrynum"
C. {"-entr"}
>
> Cheers,
>
> Ric
>
Cheers Ric,
Tim
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
More information about the antlr-interest
mailing list