[antlr-interest] handling line-based data with stanzas

Thu May 13 03:24:42 PDT 2004

On Tue, May 11, 2004 at 02:59:48PM -0000, Chris Black wrote:
> I have been writing a few grammars for a few different file formats
> that are line-based but also are organized into stanzas.
> Most of these look like:
> ---
> header,stuff ican,parse,easily
> 
> start stanzatypefoo
> column header,column header,column header
> value,value,value
> start stanzatypebar
> column header,column header,column header
> value,value,value
> ---
> 
> The way I approach this now is to have some shared lexers that just
> spit out a TokenStream of FIELD, DELIM and NEWLINE tokens. Then I have
> a parser which imports the exported vocab from a parser, and builds an
> AST. In the parser I usually try to remove tokens I don't really care
> about, like the DELIMs. Then I have a TreeParser which goes through
> the AST and populates some data structures.
> 
> This works ok, but I think I am missing something. Often I want to
> skip entire stanzas, etc. And since the AST is flat (I don't do any
> special imaginary tokens or anything) the tree parser ends up having
> most of the complication. I am now carefully reading through the tree
> building section of the ANTLR documentation, but hoped that this was a
> common/simple enough problem that someone might have some clues.

Also check out http://wwwhome.cs.utwente.nl/~klaren/antlr/treebuilding.txt

It lists some idioms in treebuilding. In general you don't want a flat AST.
Tag imaginary tokens in front of things you find interesting/want to
discern easily in the treeparser.

> As an aside, some of this may be due to my seeming inability to match
> string literals at the parser level. I try to define different stanza
> rules based on what the stanza header contents are, but I don't seem
> to be able to do this. I will get an error like:
> line 18:1: expecting "Data:", found 'Data:'
> 
> When my grammar has:
> matchRule: dataString DELIM FIELD
> 
> dataString: "Data:"
> 
> I believe this may be because I am importing the token vocab from the
> shared lexer using importVocab, but I don't know.

importVocab/exportVocab is a very 'stupid' linear system. Also you have to
make sure to call antlr in the right order on the input files. E.g. the
lowest in the export/import chain first then up to the last one. See the
FAQ entries on import/exportVocab on Jguru.

You can easily verify consistency via the generated TokenTypes.txt files.
If some token is known in there with different numbers then you ran
something in the wrong order (or you got a cycle in your import/export
chain which is not possible)

> How would someone who is a bit more experienced with ANTLR handle this
> type of data so that I could walk around the tree and skip stanzas
> easily? I think I should be doing something with imaginary tokens, but
> when I experimented with them based on the examples in the
> distribution it didn't quite seem to work the way I expected.

In general you'd have a rule that matches something interesting like a
header or a stanza. Then at the end of the rule you'd insert a tag at the
top of the generated tree so you can very easily recognize it in the
subsqequent treewalker phase. Something like this in the parser:

my_header: lbl:"me_is_header" stuff more_stuff 
	{
		## = #([GENERIC_HEADER, lbl.getText()]);
	}
;

Or...:

my_header: lbl:HEADER_ID stuff more_stuff 
	{
		if( lbl.getText().equals("foo") )	
			## = #([FOO_HEADER, lbl.getText()]);
		else if ( lbl.getText().equals("bar");	
			## = #([BAR_HEADER, lbl.getText()]);
		else
			## = #([GENERIC_HEADER, lbl.get_name()]);
	}
;

Keep in mind that you can turn of treeconstruction selectively in a rule
and glue the parts together manually. (You still may have to restructure a
few rules to get what you want)

> Does anyone with more expertise using antlr have any advice or a good
> way of going about parsing stanza-based/line-based data coming from a
> simplistic lexer that just gives FIELD, DELIM and NEWLINEs? I'd rather
> not have to put more logic in the lexer, as then I couldn't share the
> lexer as easily.

I'd keep the lexer as simple as possible. Then in the parser tag
interesting bits and in the treeparser try to make sense of it. (or use
multiple tree parsers) The first link gives an idiom to selectively look at
branches in an AST.

This might also be of interest:

http://www.codetransform.com/filterexample.html

The general problem seems to be to get the right 'divide' in order to
'conquer' ;)

You could chunk things up pretty roughly in a first stage lexer. Then in
the subsequent parser call another lexer/parser on the token text of a
chunk then dupTree the generated AST into the ast generated for the current
parser rule (make sure to get the tokenvocabulary the same between the
parsers for that). Conceptually the above link might give a nicer solution
though. Although this may give a load of small lexers/parsers that are
quite maintainable. And you can easily glue in something new.

Another option is to use your favourite scripting language it really looks
like something fit for that ;)

Cheers,

Ric
-- 
-----+++++*****************************************************+++++++++-------
    ---- Ric Klaren ----- j.klaren at utwente.nl ----- +31 53 4893755  ----
-----+++++*****************************************************+++++++++-------
 "Don't call me stupid." "Oh, right. To call you stupid would be an insult
    to stupid people. I've known sheep that could outwit you! I've worn
              dresses with higher IQs!" --- A Fish Called Wanda

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/