[antlr-interest] newbie request for help

Fri Dec 5 08:48:17 PST 2008

On Thu, 2008-12-04 at 22:46 -0800, Kenny Leung wrote:

> Hi All.
> 
> I thought I would get my feet wet by writing a parser for Objective-C  
> type encodings. I thought it would be pretty easy for such a brief  
> "language", but it is turning out to be pretty difficult.
> 
> One of the problems lies in parsing something like this:
> 
>      {vids=^vids}
> 
> which means a struct named "vids", which is composed of void * (^v),  
> int, double, and short.
> 
> After the "{", I need to interpret vids as a single token, and after  
> the "=", I need to interpret the characters as separate tokens.
> 
> One of the interesting things I found was that this is legal:
> 
>      NUMBER : '0'..'9';
> 

This is a lexer rule that turns a stream of characters into a token for
the parser. Lexer rules start with an upper case letter.

> but this is not:
> 
>      number : '0'..'9';

This is a parser rule (starts with a lower case letter. Hence you cannot
use ranges because you cannot guarantee that the two separate tokens '0'
and '9', which this rule auto-creates, have any meaning as a range. 

> 
> I bumped into this because I thought I'd "inline" the rule for the  
> name after the "{". Can someone explain this?
> 
> Is there a way I can say, "use tokenizer rule A after the "{", but use  
> tokenizer rule B after the "=".

No. The lexer (all the UpperCase rules) runs first and creates all the
tokens, then the parser runs (all the lowerCase rules) against the
pre-determined tokens.

The thing that almost everyone runs in to is that the parser cannot
influence the lexer as the lexer has already run. Don't use 'XXX' in
your parser rules, create a token:

XXX : 'XXX'

and use the symbol XXX in your parser rules.

For problems like the above you need a rule set something like:

// Lexer
POINT : '^' ;
OPEQ : '=';
LBRACE : '{';
RBRACE: '}';
ID : ('a'..'z'|'A'..'Z') +;
WS : (' '|'\t')+ { $channel=HIDDEN; }

// Parser
struct : LBRACE ID OPEQ structSpec RBRACE ;

strcutSpec : ( i=ID { checkIdChars($i.text); } | p=POINT
{ checkPointer(); )+ ;

Instead of trying to get the individual characters of the type spec,
just consume as a set of natural tokens, then separate everything out
afterwards, or you will get into a mess. Don't try to think of hte
paresr in human terms, try to think of what the easiest token set to
produce is, then what this stream of tokens is going to look like in the
parser. The parser should accept any syntax that is potentially valid
and apply semantic checks. For instance above, any ID is accepted, then
you check the character spec. This allows you to issue an error such as:
"Invalid type specification character at line n, offset y", instead of
"Syntax error."

Make sure you read the FAQs and getting started articles on the Wiki,
and if you have the money, buy the book. Inspecting the example grammars
and contributed grammars is a good idea too.

Jim

> 
> AntlrWorks has been great for learning by playing. Thanks for any help!
> 
> -Kenny
> 
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20081205/dafea9ef/attachment.html