[antlr-interest] Need help parsing text format
Isabelle Muszynski
imus at linuxmail.org
Mon Apr 30 05:44:38 PDT 2007
Thanks.
I had done something like that in my first attempts, but I figured there must be a better way.
Before I tried parsing this "grammar" with Antlr, I was doing it with regular expressions, which allowed me to do all I needed. And if
now I pretty much have to read each char separately, or read bunches of any char, and do the equivalent of a regular expression check
afterwards, I seem to be back to square one.
I'll try to do it as you suggest.
Isabelle
> ----- Original Message -----
> From: "Gavin Lambert" <antlr at mirality.co.nz>
> To: "Isabelle Muszynski" <imus at linuxmail.org>, antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Need help parsing text format
> Date: Tue, 01 May 2007 00:00:53 +1200
>
>
> At 23:46 30/04/2007, Isabelle Muszynski wrote:
> >My grammar is shown at the end of this mail.
> >The problem is that it won't parse the cases where alphanumeric
> >fields only contain for ex. letters :
> >
> >BCSABC/12.CHARLIE/HAM-BRE.1/bla&&^^%%$$bla.3
> [...]
> >fragment DIGIT : '0'..'9' ;
> >
> >fragment LETTER : 'a'..'z' | 'A'..'Z' ;
> >
> >fragment ALPHA : LETTER | DIGIT;
> >
> >fragment ANY_CHAR : ALPHA | SPECIAL_CHAR ;
> >
> >fragment WS_CHAR : (' ' | '\t' | '\u000C' ) ;
> >
> >WS : WS_CHAR+;
> >
> >NEWLINE : '\r'? '\n' ;
> >
> >LETTER_WORD : LETTER+ ;
> >NUMBER_WORD : DIGIT+ ;
> >ALPHA_WORD : ALPHA+ ;
> >FREE_WORD : ANY_CHAR+ ;
>
> I think this is the problem: there is ambiguity between ALPHA_WORD
> and both of LETTER_WORD and NUMBER_WORD, since each is a subset of
> ALPHA_WORD. Since they're all productions and it can't change its
> mind and become a different token later on, it has to guess which
> one to use. I suspect (given the order here) it'll pick
> LETTER_WORD if it consists only of letters, and NUMBER_WORD if it
> consists only of numbers. It might pick ALPHA_WORD for mixed cases
> but then it also might generate a combination of LETTER_WORDs and
> NUMBER_WORDs instead -- that's another ambiguity.
>
> The same applies for FREE_WORD, since that's another superset.
>
> There are dodges you can use to get around this sort of thing, but
> in this case since your grammar is so interwoven I think you might
> be better off making a minimalist lexer and do most of the heavy
> lifting in the parser instead. In other words, remove the ALPHA,
> ALPHA_WORD, ANY_CHAR, and FREE_WORD rules, and make SPECIAL_CHAR an
> output rule (not a fragment). Then make equivalents to ALPHA_WORD,
> and FREE_WORD as parser rules instead of lexer rules.
>
> You could even remove all the WORD rules and make each lexer token
> only contain a single character (except possibly for whitespace).
> Probably wouldn't make a lot of difference in this case :)
>
=
--
Powered by Outblaze
More information about the antlr-interest
mailing list