[antlr-interest] simple but really messy records

Fri Feb 22 05:06:50 PST 2008

On Fri, Feb 22, 2008 at 4:01 AM, brian <brianchina60221 at gmail.com> wrote:
> I'm trying to decide whether ANTLR is appropriate for something.
>
>  I'm trying to clean up several million one-line fixed-length records
>  from an IBM AS/400. They're entered by humans and are very messy, but
>  the vast majority do have some structure that can be recognized and
>  parsed, say, by regular expressions. I just need to decide which part
>  of the line represents a name, which represents a date, etc. I could
>  write maybe 100 regular expressions and apply them until I got a match
>  to recognize most of them. But I think this does a lot of the same
>  work over and over, making it slow. I might be able to write a truly
>  horrendous regular expression that doesn't redo as much work, but it'd
>  be hard to maintain. I think maybe ANTLR's memoize might help a lot.
>
>  But the only token I can think to make is CHARACTER. Sometimes a date
>  looks like 2008-02-22, but something that looks that way isn't
>  necessarily a date, so I don't want to make a token for it. Usually if
>  you see 'int' or whatever, you can say without hesitation that means a
>  certain thing and can generate an appropriate token, but I don't think
>  I can here. There aren't any keywords. I basically want to use ANTLR
>  as if it was a way of building a regex that would be way too difficult
>  to make/maintain by hand.
>
>  So people with a lot of experience, please say whether using ANTLR is
>  probably good or probably bad. If probably bad, are there more
>  appropriate technologies? Thank you.

This sounds like a job for AWK.

-- 
R. Mark Volkmann
Object Computing, Inc.