[antlr-interest] Need help parsing text format

Isabelle Muszynski imus at linuxmail.org
Mon Apr 30 05:44:38 PDT 2007


Thanks.
I had done something like that in my first attempts, but I figured there must be a better way.
Before I tried parsing this "grammar" with Antlr, I was doing it with regular expressions, which allowed me to do all I needed. And if 
now I pretty much have to read each char separately, or read bunches of any char, and do the equivalent of a regular expression check 
afterwards, I seem to be back to square one.

I'll try to do it as you suggest.

Isabelle

> ----- Original Message -----
> From: "Gavin Lambert" <antlr at mirality.co.nz>
> To: "Isabelle Muszynski" <imus at linuxmail.org>, antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Need help parsing text format
> Date: Tue, 01 May 2007 00:00:53 +1200
> 
> 
> At 23:46 30/04/2007, Isabelle Muszynski wrote:
>  >My grammar is shown at the end of this mail.
>  >The problem is that it won't parse the cases where alphanumeric
>  >fields only contain for ex. letters :
>  >
>  >BCSABC/12.CHARLIE/HAM-BRE.1/bla&&^^%%$$bla.3
> [...]
>  >fragment DIGIT 	: '0'..'9'	;
>  >	
>  >fragment LETTER :	'a'..'z' | 'A'..'Z' ;	
>  >
>  >fragment ALPHA 	:	LETTER | DIGIT;
>  >
>  >fragment ANY_CHAR :	ALPHA | SPECIAL_CHAR ;
>  >
>  >fragment WS_CHAR  : (' ' | '\t' | '\u000C' ) ;
>  >
>  >WS      :       WS_CHAR+;
>  >
>  >NEWLINE  : '\r'? '\n' ;
>  >
>  >LETTER_WORD :	LETTER+ ;
>  >NUMBER_WORD  :	DIGIT+ ;
>  >ALPHA_WORD  : ALPHA+ ;
>  >FREE_WORD : ANY_CHAR+ ;
> 
> I think this is the problem: there is ambiguity between ALPHA_WORD 
> and both of LETTER_WORD and NUMBER_WORD, since each is a subset of 
> ALPHA_WORD.  Since they're all productions and it can't change its 
> mind and become a different token later on, it has to guess which 
> one to use.  I suspect (given the order here) it'll pick 
> LETTER_WORD if it consists only of letters, and NUMBER_WORD if it 
> consists only of numbers.  It might pick ALPHA_WORD for mixed cases 
> but then it also might generate a combination of LETTER_WORDs and 
> NUMBER_WORDs instead -- that's another ambiguity.
> 
> The same applies for FREE_WORD, since that's another superset.
> 
> There are dodges you can use to get around this sort of thing, but 
> in this case since your grammar is so interwoven I think you might 
> be better off making a minimalist lexer and do most of the heavy 
> lifting in the parser instead.  In other words, remove the ALPHA, 
> ALPHA_WORD, ANY_CHAR, and FREE_WORD rules, and make SPECIAL_CHAR an 
> output rule (not a fragment).  Then make equivalents to ALPHA_WORD, 
> and FREE_WORD as parser rules instead of lexer rules.
> 
> You could even remove all the WORD rules and make each lexer token 
> only contain a single character (except possibly for whitespace).  
> Probably wouldn't make a lot of difference in this case :)

>


=


-- 
Powered by Outblaze


More information about the antlr-interest mailing list