[antlr-interest] determining tokens at runtime

Mon Jun 21 08:55:25 PDT 2010

I'm very new to ANTLR and am investigating how I might use the tool for an upcoming project.

I need to be able to recognize and parse a language that is similar to X12 syntax in that it contains delimited segments each containing delimited data elements.

The first segment specifies both the element delimiter and segment delimiters used for the rest of the input, the delimiters must be different from one another and from the element data, and the segment delimiter can contain multiple characters.

Below is a very simple test grammar in which I want to convert to being able to determine the element delimiter (ED) at runtime (always the 1st character after 'STA') and the segment delimiter (SD).  I suspect I can't do this entirely in the grammar and may need to subclass/override some core ANTLR classes or maybe even scan the input buffer.

I'm not sure where to go from here and haven't yet found anything that appears useful either in the Definitive ANTLR Ref book or via google. I'd appreciate any RTFM links I missed if this has already been discussed many times before, or any pointers on where to look in the source for extending existing ANTLR behavior.

Thanks, Jon

// Simple.g
grammar Simple;

tokens {
  STA = 'STA';
  BEG = 'BEG';
  END = 'END';
}

transaction : header beg_segment footer;

header : STA segment_body;
beg_segment : BEG segment_body;
footer : END segment_body;
segment_body : ED DATA ED DATA SD;

DATA : 'A'..'Z'+;
ED : '*';
SD : '\r' '\n' | 'r';

// test data
STA*HEADER*SEGMENT
BEG*TRANSACTION*HEADER
END*FOOTER*SEGMENT