[antlr-interest] simple but really messy records

Fri Feb 22 08:44:35 PST 2008

> -----Original Message-----
> From: brian [mailto:brianchina60221 at gmail.com]
> Sent: Friday, February 22, 2008 2:02 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] simple but really messy records
> 
> I'm trying to decide whether ANTLR is appropriate for something.
> 
> I'm trying to clean up several million one-line fixed-length records
> from an IBM AS/400. They're entered by humans and are very messy, but
> the vast majority do have some structure that can be recognized and
> parsed, say, by regular expressions. I just need to decide which part
> of the line represents a name, which represents a date, etc. I could
> write maybe 100 regular expressions and apply them until I got a match
> to recognize most of them. But I think this does a lot of the same
> work over and over, making it slow. I might be able to write a truly
> horrendous regular expression that doesn't redo as much work, but it'd
> be hard to maintain. I think maybe ANTLR's memoize might help a lot.
> 
> But the only token I can think to make is CHARACTER. Sometimes a date
> looks like 2008-02-22, but something that looks that way isn't
> necessarily a date, so I don't want to make a token for it. Usually if
> you see 'int' or whatever, you can say without hesitation that means a
> certain thing and can generate an appropriate token, but I don't think
> I can here. There aren't any keywords. I basically want to use ANTLR
> as if it was a way of building a regex that would be way too difficult
> to make/maintain by hand.
> 
> So people with a lot of experience, please say whether using ANTLR is
> probably good or probably bad. If probably bad, are there more
> appropriate technologies? Thank you.

I think that this really depends on just what is in the input data. It 
isnt necessarily a bad way to go and you could use backtracking and 
memorize and it might not be too bad. I think you are correct that it 
would be more maintainable and the horrors of trying to write the 
grammar in the first place might be worth it. Personally though, I would 
just write a program in high level language and give it lots of 
comments. 

You can probably make more tokens than that, even if you just have DIGIT 
and CHAR, but thing is that if you have keep looking at the char in CHAR 
with predicates and so on, then it probably isn't going to be any more 
readable then a regex. If regex can do it though, then you can probably 
write a lexer that acts more like flex and tries all the combinations 
and rejects them when they don't match along these lines:

TOKEN
	: 'x'
		(
			 ('yyzz')=>'yyzz' { $type = XYYZZ; }
			| ('zzzzzzzz') => ... etc

You might find that more readable than regex, but you may well need a 
stateful lexer. 

Of course, you may not need a parser at all. You could try the filtering 
lexer approach (see the fuzzy java example), with a catch all ANY : . ; 
rule as the last rule in the filter and some lexer state flags. I think 
that a day or so working on a filtering lexer would tell you if you will 
gain anything this way and is probably worth a try.

Jim