[antlr-interest] ANTLR Basic Question

Kevin J. Cummings cummings at kjchome.homeip.net
Fri Jul 9 14:03:58 PDT 2010


On 07/09/2010 03:10 PM, Klaus Martinschitz wrote:
>   Hi ANTLR Gurus,
> 
> A beginner's question.
> I want to write a compiler for Crystallographic Information File Format 
> ' (CIF). I don't want to explain the syntax in detail only the problem I 
> have to face with.
> 
> The data starts with a token
> 
> 'data_'
> 
> followed by arbitrary characters and an EOL, e.g.
> 
> data_global

Just curious.  Are there a finite number of such tokens?  Or is
something like data_xyzzy legal?

> .
> 
> There is also a token
> 
> 'loop_';
> 
> Somewehere in my BNF I write something like
> 
> DATA
>      :(('d'|'D')('a'|'A')('t'|'T')('a'|'A')'_')
>      ;
> 
> LOOP
>      :
>      (('l'|'L')('o'|'O')('o'|'O')('p'|'P')'_')
>      ;
> 
> dataBlockHeading
>      :    (DATA NONBLANCKCHAR+ EOL)
>      ;
> 
> dataItem
>      :    (tag WHITESPACE value) | (LOOP loopHeader loopBody)
>      ;
> 
> The first two expressions are tokens the second are rules. My problem is 
> following. The file starts with
> 
> data_global
> 
> BUT the *lo* of data_g*lo*bal is parsed from the LOOP token. How can 
> this be if the parser is in the dataBlockHeadingrule? The parser must 
> know that the characters *lo* belong to NONBLANCKCHAR and not to LOOP,
> or?

Please don't confuse lexing with parsing.  Lexing is the process of
converting the character stream to tokens.  Parsing is the process of
putting the tokens together.  Lexing happens independently of and out of
the context of the parser.  So, you should make sure that all of our
tokens are defined without regard to your parser rules.

Also, there may be some ordering problems with your lexer, in that rules
defined first might take precedence over rules defined later.  When this
happens to me, I usually use predicates to help my lexer out.
Often this leads to merging certain token rules and overriding the
resulting token types in certain circumstances.

In your case, can you lex the extra characters into your DATA_ token
(especially since there can be nothing following the DATA_ part except
more characters).  This might be problematic if anything can follow your
DATA_ tokens.

Finally, wildcards can cause no end of problems (especially to
beginners).  Use them only as a last resort.  Usually, they can be avoided.

-- 
Kevin J. Cummings
kjchome at rcn.com
cummings at kjchome.homeip.net
cummings at kjc386.framingham.ma.us
Registered Linux User #1232 (http://counter.li.org)


More information about the antlr-interest mailing list