[antlr-interest] Customizing token separators without recompiling

Jim Idle jimi at temporal-wave.com
Sun Jun 7 18:52:06 PDT 2009


Hi,

If the entire structure is just these lines then it is likely that a  
parser is overkill to be honest. However you can create a lexer rule  
that changes it's definition at runtime, but you must be careful that  
set of delimiters would never otherwise appear in the input.

What you do is add a member method        to the lexer that accepts  
the delimiter then use a gated predicate to select the token:

@lexer::members {
protected int delim;
public void setDelim(int d) {
   delim = d;
}
}

DELIM : {input.LA(1) == d}?=> . ;

But note that by using this rule, you will always get DELIM for that  
character and so if you had:

SEMI : ';' ;

But set the delimiter to ';' then you would no longer get SEMI.

Perhaps it would be best to write a custom lexer.

EDU is another good idea screwed up by design by comittee where none  
if the members will give up their proprietory formats :(

Jim


On Jun 7, 2009, at 4:45 PM, Dukie Banderjee  
<dukie_banderjee at hotmail.com> wrote:

>
> "If you simply want to break apart a line of text based on an  
> arbitrary
> delimiter, it would be much easier to write a program in Perl, Python,
> Java, etc. that split the text based on a configuration setting."
>
> That's basically what I'm doing right now (in C#, by hand). Are you  
> saying that ANTLR can't work at all with this?
>
> At some level it becomes a parsing issue. Each line has a different  
> meaning, and should perform a different action and/or gather  
> different information.
>
> It seems to me that these files would lend themselves very well to  
> an intermediate AST form. For example, the style of document I  
> showed you earlier was an Ansi 830 format. There is another format  
> which is UN Edifact, which looks like this:
> DTM+2:20080523:102'
> QTY+1:1500:EA'
> SCC+1++D:ZZZ'
>
> Although this looks totally different, it is logically the same  
> information as the previous example I showed (FST*...).
>
> I was hoping to use ANTLR to work on two different grammars to  
> translate the raw text into tokens, which could further be  
> translated into a generic command tree (basically to add records  
> into a DB) that would be functionally equivalent whether it  
> originally came from Ansi 830 or UN Edifact.
>
> It seems to me that ANTLR would have been a good tool to use to do  
> this translation. I'd rather not be forced to do the entire thing by  
> hand just because of this token separator issue.
>
> Is there a way I could perform the token splitting manually (as you  
> suggest), but then feed the resulting tokens into an ANTLR-generated  
> parser to do the rest of the work?
>
> Thanks,
>
> Rob
>
> Date: Sun, 7 Jun 2009 15:02:09 -0700
> From: jsrs701 at yahoo.com
> Subject: RE: [antlr-interest] Customizing token separators without  
> recompiling
> To: antlr-interest at antlr.org; dukie_banderjee at hotmail.com
>
> Oh, I'm saying you wouldn't want to use a grammar at all.  The  
> problem you've described is lexical, not grammatical.  If you simply  
> want to break apart a line of text based on an arbitrary delimiter,  
> it would be much easier to write a program in Perl, Python, Java,  
> etc. that split the text based on a configuration setting.
>
> If further parsing needs to happen on the newly-split fields, then  
> you can attack that problem piecemeal on an individual basis.
>
> Make sense?
>
>
> _________________________________________________________________
> We are your photos. Share us now with Windows Live Photos.
> http://go.microsoft.com/?linkid=9666047
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address


More information about the antlr-interest mailing list