[antlr-interest] parsing a mix of structured and free-form text

Tue Jun 20 01:43:59 PDT 2006

Hi,

I am trying to use ANTLR to parse files that are mostly structured 
(being generated by another program), but here and there have some text 
that's nearly free form, as it is basically user-input. The place where 
this unstructured text appears within the structured varies, but is 
defined by the grammar.

Ideally, what I would like to do, at the point in parsing when I know 
that unstructured text follows, is to simply read enough characters 
(these are fix-width fields, so I know how many I need to read), so that 
parsing of the structured text can continue.

The problem here are look-ahead tokens, where Lexer goes a bit ahead of 
the parser and chews up input characters in advance.

The places where this unstructured text appears are such that there's no 
need to use look-ahead tokens to decide which grammar rule to apply.

I used ANTLR for some simpler things. I also used Bison and Flex before 
and used Flex states to control grabbing characters when places with 
unstructured text are approached. But I'm not that familiar with ANTLR 
to know how to do it, or whether it's possible at all.

I don't know if it would be possible to get the text of look-ahead 
tokens, discard them, and force lexer to continue from a different 
position in the input stream.

Any help/hints/ideas are very much welcome.

Milan