[antlr-interest] simple but really messy records

Fri Feb 22 02:01:44 PST 2008

I'm trying to decide whether ANTLR is appropriate for something.

I'm trying to clean up several million one-line fixed-length records
from an IBM AS/400. They're entered by humans and are very messy, but
the vast majority do have some structure that can be recognized and
parsed, say, by regular expressions. I just need to decide which part
of the line represents a name, which represents a date, etc. I could
write maybe 100 regular expressions and apply them until I got a match
to recognize most of them. But I think this does a lot of the same
work over and over, making it slow. I might be able to write a truly
horrendous regular expression that doesn't redo as much work, but it'd
be hard to maintain. I think maybe ANTLR's memoize might help a lot.

But the only token I can think to make is CHARACTER. Sometimes a date
looks like 2008-02-22, but something that looks that way isn't
necessarily a date, so I don't want to make a token for it. Usually if
you see 'int' or whatever, you can say without hesitation that means a
certain thing and can generate an appropriate token, but I don't think
I can here. There aren't any keywords. I basically want to use ANTLR
as if it was a way of building a regex that would be way too difficult
to make/maintain by hand.

So people with a lot of experience, please say whether using ANTLR is
probably good or probably bad. If probably bad, are there more
appropriate technologies? Thank you.