[antlr-interest] Customizing token separators without recompiling
Jim Idle
jimi at temporal-wave.com
Sun Jun 7 18:52:06 PDT 2009
Hi,
If the entire structure is just these lines then it is likely that a
parser is overkill to be honest. However you can create a lexer rule
that changes it's definition at runtime, but you must be careful that
set of delimiters would never otherwise appear in the input.
What you do is add a member method to the lexer that accepts
the delimiter then use a gated predicate to select the token:
@lexer::members {
protected int delim;
public void setDelim(int d) {
delim = d;
}
}
DELIM : {input.LA(1) == d}?=> . ;
But note that by using this rule, you will always get DELIM for that
character and so if you had:
SEMI : ';' ;
But set the delimiter to ';' then you would no longer get SEMI.
Perhaps it would be best to write a custom lexer.
EDU is another good idea screwed up by design by comittee where none
if the members will give up their proprietory formats :(
Jim
On Jun 7, 2009, at 4:45 PM, Dukie Banderjee
<dukie_banderjee at hotmail.com> wrote:
>
> "If you simply want to break apart a line of text based on an
> arbitrary
> delimiter, it would be much easier to write a program in Perl, Python,
> Java, etc. that split the text based on a configuration setting."
>
> That's basically what I'm doing right now (in C#, by hand). Are you
> saying that ANTLR can't work at all with this?
>
> At some level it becomes a parsing issue. Each line has a different
> meaning, and should perform a different action and/or gather
> different information.
>
> It seems to me that these files would lend themselves very well to
> an intermediate AST form. For example, the style of document I
> showed you earlier was an Ansi 830 format. There is another format
> which is UN Edifact, which looks like this:
> DTM+2:20080523:102'
> QTY+1:1500:EA'
> SCC+1++D:ZZZ'
>
> Although this looks totally different, it is logically the same
> information as the previous example I showed (FST*...).
>
> I was hoping to use ANTLR to work on two different grammars to
> translate the raw text into tokens, which could further be
> translated into a generic command tree (basically to add records
> into a DB) that would be functionally equivalent whether it
> originally came from Ansi 830 or UN Edifact.
>
> It seems to me that ANTLR would have been a good tool to use to do
> this translation. I'd rather not be forced to do the entire thing by
> hand just because of this token separator issue.
>
> Is there a way I could perform the token splitting manually (as you
> suggest), but then feed the resulting tokens into an ANTLR-generated
> parser to do the rest of the work?
>
> Thanks,
>
> Rob
>
> Date: Sun, 7 Jun 2009 15:02:09 -0700
> From: jsrs701 at yahoo.com
> Subject: RE: [antlr-interest] Customizing token separators without
> recompiling
> To: antlr-interest at antlr.org; dukie_banderjee at hotmail.com
>
> Oh, I'm saying you wouldn't want to use a grammar at all. The
> problem you've described is lexical, not grammatical. If you simply
> want to break apart a line of text based on an arbitrary delimiter,
> it would be much easier to write a program in Perl, Python, Java,
> etc. that split the text based on a configuration setting.
>
> If further parsing needs to happen on the newly-split fields, then
> you can attack that problem piecemeal on an individual basis.
>
> Make sense?
>
>
> _________________________________________________________________
> We are your photos. Share us now with Windows Live Photos.
> http://go.microsoft.com/?linkid=9666047
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
More information about the antlr-interest
mailing list