[antlr-interest] Keywords vs. freeform text

Fri Jun 12 21:28:24 PDT 2009

Hi, 
Hope this isn't too much of a newbie question.

I need to parse a format (EDI) which is basically delimited fields, but some fields must contain standardized code values whereas other fields can contain freeform text.

My question is related to lexing and/or parsing. Do I need to/want to have a lexer token for each possible code, or should I just accept a freeform TEXT token, and then later parse the actual text to determine if its a valid code?

I currently have a grammar which handles *some* of the more important codes by specifying lexer tokens. E.g.:

ST: 'ST' ;
BFR: 'BFR' ;
N1: 'N1' ;
REF: 'REF' ;
etc...

And a freeform TEXT token:
TEXT: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|':'|'-'|','|'.')+ ;

Then I use a parser rule for those possible fields where *any* text is allowed, even a code:
fieldText    : TEXT | code ;

code    : ST
        | BFR
        | N1
        | REF
etc...

This seems to be working okay for now, but I forsee problems as I'm trying to expand the grammar to work with all the various codes defined by the EDI standards. For example, some of the codes contain solely numeric characters, such as '09' or '01' or '12'. Later, I want to add checking for freeform numeric fields, such as those which might contain quantities or arbitrary integers. I think it will start to get ugly if I try to specify lexer tokens and parser rules like this:

CODE_09: '09'

NUMERIC: ('0'..'9')+

numericField: NUMERIC | numericCode

numericCode: CODE_09 | CODE_01 ... etc.

The core issue is that I need to *sometimes* treat certain fixed sequences of characters (e.g. 'ST' or '09') as special, and sometimes as merely freeform text or numeric values.

I'm fairly new to ANTLR (and parsing/lexing), so I'm not really sure what's a good way to resolve this. Any tips/pointers?

Example input:
ISA*00**00**01*812520286  // Here '01' is a special code which determines the type/format of the following field
SE*01*1052   // Here '01' is simply a numeric value which should be interpreted as an integer.

As a side topic, how do I write a lexer which properly handles both:
a) freeform alphanumeric (and spaces) input such as ('a'..'z'|'A'..'Z'|'0'..'9'|' '|':'|'-'|','|'.')+
b) freeform numeric input such as ('0'..'9')+
Is this doomed to be ambiguous? Should it be handled by the parser? Is there a way to handle it in the lexer?

Thanks

Rob

_________________________________________________________________
Create a cool, new character for your Windows Live™ Messenger. 
http://go.microsoft.com/?linkid=9656621
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090613/b1a70cd7/attachment.html