[antlr-interest] Re: handling line-based data with stanzas

Chris Black cblack0 at yahoo.com
Thu May 13 11:26:42 PDT 2004


--- In antlr-interest at yahoogroups.com, Ric Klaren <klaren at c...> wrote:
> On Tue, May 11, 2004 at 02:59:48PM -0000, Chris Black wrote:
> > I have been writing a few grammars for a few different file formats
> > that are line-based but also are organized into stanzas.
> > Most of these look like:
> > ---
> > header,stuff ican,parse,easily
> > 
> > start stanzatypefoo
> > column header,column header,column header
> > value,value,value
> > start stanzatypebar
> > column header,column header,column header
> > value,value,value
> > ---
> > 
[stuff deleted]
> Also check out
http://wwwhome.cs.utwente.nl/~klaren/antlr/treebuilding.txt
> 
> It lists some idioms in treebuilding. In general you don't want a
flat AST.
> Tag imaginary tokens in front of things you find interesting/want to
> discern easily in the treeparser.

Thanks very much! This link was indeed useful.

> > As an aside, some of this may be due to my seeming inability to match
> > string literals at the parser level. I try to define different stanza
> > rules based on what the stanza header contents are, but I don't seem
> > to be able to do this. I will get an error like:
> > line 18:1: expecting "Data:", found 'Data:'
> > 
> > When my grammar has:
> > matchRule: dataString DELIM FIELD
> > 
> > dataString: "Data:"
> > 
> > I believe this may be because I am importing the token vocab from the
> > shared lexer using importVocab, but I don't know.
> 
> importVocab/exportVocab is a very 'stupid' linear system. Also you
have to
> make sure to call antlr in the right order on the input files. E.g. the
> lowest in the export/import chain first then up to the last one. See the
> FAQ entries on import/exportVocab on Jguru.
> 
> You can easily verify consistency via the generated TokenTypes.txt
files.
> If some token is known in there with different numbers then you ran
> something in the wrong order (or you got a cycle in your import/export
> chain which is not possible)

I'll look into this, for now I think I am ok not matching literals in
the parser level and instead using .equals in java statements like in
your examples below. My build process is to build the lexers first
(which are in a seperate dir) and then copy the FooTokenTypes.txt file
into the dir where the parser and treeparser that importVocab that are.

> > How would someone who is a bit more experienced with ANTLR handle this
> > type of data so that I could walk around the tree and skip stanzas
> > easily? I think I should be doing something with imaginary tokens, but
> > when I experimented with them based on the examples in the
> > distribution it didn't quite seem to work the way I expected.
> 
> In general you'd have a rule that matches something interesting like a
> header or a stanza. Then at the end of the rule you'd insert a tag
at the
> top of the generated tree so you can very easily recognize it in the
> subsqequent treewalker phase. Something like this in the parser:
> 
> my_header: lbl:"me_is_header" stuff more_stuff 
> 	{
> 		## = #([GENERIC_HEADER, lbl.getText()]);
> 	}
> ;
> 
> Or...:
> 
> my_header: lbl:HEADER_ID stuff more_stuff 
> 	{
> 		if( lbl.getText().equals("foo") )	
> 			## = #([FOO_HEADER, lbl.getText()]);
> 		else if ( lbl.getText().equals("bar");	
> 			## = #([BAR_HEADER, lbl.getText()]);
> 		else
> 			## = #([GENERIC_HEADER, lbl.get_name()]);
> 	}
> ;
> 
> Keep in mind that you can turn of treeconstruction selectively in a rule
> and glue the parts together manually. (You still may have to
restructure a
> few rules to get what you want)

Thanks very much for those code snippets, the first one is what I have
been doing while exploring this on my own, but the second one may
allow me to exclude whole stanzas from the result AST which would be a
Good Thing.

> > Does anyone with more expertise using antlr have any advice or a good
> > way of going about parsing stanza-based/line-based data coming from a
> > simplistic lexer that just gives FIELD, DELIM and NEWLINEs? I'd rather
> > not have to put more logic in the lexer, as then I couldn't share the
> > lexer as easily.
> 
> I'd keep the lexer as simple as possible. Then in the parser tag
> interesting bits and in the treeparser try to make sense of it. (or use
> multiple tree parsers) The first link gives an idiom to selectively
look at
> branches in an AST.
> 
> This might also be of interest:
> 
> http://www.codetransform.com/filterexample.html
> 
> The general problem seems to be to get the right 'divide' in order to
> 'conquer' ;)
> 
> You could chunk things up pretty roughly in a first stage lexer. Then in
> the subsequent parser call another lexer/parser on the token text of a
> chunk then dupTree the generated AST into the ast generated for the
current
> parser rule (make sure to get the tokenvocabulary the same between the
> parsers for that). Conceptually the above link might give a nicer
solution
> though. Although this may give a load of small lexers/parsers that are
> quite maintainable. And you can easily glue in something new.
> 

I think I am going to go with keeping my simple shared lexers that
just output FIELD, DELIM and NEWLINE tokens for now, and then based on
the order of those and my knowledge of the file format, different
parsers will build different trees and use different tree parsers. I'd
really like to avoid doing too much work in the lexer as that seems to
bite me later on in my experience (as an aside, I see a fair number of
ppl new to antlr (myself included) trying to do far too much in the
lexer, perhaps something should go in the docs about that).

> Another option is to use your favourite scripting language it really
looks
> like something fit for that ;)

I'm sure I could do something easy in perl, but I need this parser to
integrate with the rest of a program, and I really don't want to throw
another language into the program :)

Another aside, in my reading I had never seen the "##" syntax, it
seems that this is the same as "#ruleIamin", is this the case?

Thanks so much for the response btw, I have been through several
iterations on how to handle stanza-based data and some of the snippets
you posted and/or linked to will certainly help.

Chris



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list