[antlr-interest] simple but really messy records
Mark Volkmann
r.mark.volkmann at gmail.com
Fri Feb 22 05:06:50 PST 2008
On Fri, Feb 22, 2008 at 4:01 AM, brian <brianchina60221 at gmail.com> wrote:
> I'm trying to decide whether ANTLR is appropriate for something.
>
> I'm trying to clean up several million one-line fixed-length records
> from an IBM AS/400. They're entered by humans and are very messy, but
> the vast majority do have some structure that can be recognized and
> parsed, say, by regular expressions. I just need to decide which part
> of the line represents a name, which represents a date, etc. I could
> write maybe 100 regular expressions and apply them until I got a match
> to recognize most of them. But I think this does a lot of the same
> work over and over, making it slow. I might be able to write a truly
> horrendous regular expression that doesn't redo as much work, but it'd
> be hard to maintain. I think maybe ANTLR's memoize might help a lot.
>
> But the only token I can think to make is CHARACTER. Sometimes a date
> looks like 2008-02-22, but something that looks that way isn't
> necessarily a date, so I don't want to make a token for it. Usually if
> you see 'int' or whatever, you can say without hesitation that means a
> certain thing and can generate an appropriate token, but I don't think
> I can here. There aren't any keywords. I basically want to use ANTLR
> as if it was a way of building a regex that would be way too difficult
> to make/maintain by hand.
>
> So people with a lot of experience, please say whether using ANTLR is
> probably good or probably bad. If probably bad, are there more
> appropriate technologies? Thank you.
This sounds like a job for AWK.
--
R. Mark Volkmann
Object Computing, Inc.
More information about the antlr-interest
mailing list