[antlr-interest] case-insensitive parsing

Thu Apr 23 07:43:28 PDT 2009

On Fri, Apr 24, 2009 at 12:19 AM, Andreas Meyer
<andreas.meyer at smartshift.de> wrote:
> Ok ... there are two options:
> (2) use island grammars, as advertised on the Wiki
> (http://www.antlr.org/wiki/display/ANTLR3/Island+Grammars+Under+Parser+Control).
> however, this is quite complicated to set up
>
> Island grammars are nice for complicated cases, but maybe in this case
> are just overkill, because the boundary of your comment syntax can be
> identified by the lexer, you do not need the full parser for that. Hope
> that helps :-)
>
Island grammars can also be done under lexer control which eliminates
much of the complexity and fragility of doing it under parser control.
See the island-grammar example in the examples pack. However this does
not integrate the comments into the main token stream, if you just
need to put them in a structure for later use this is probably fine.
Otherwise you can have your two lexers produce a single token stream
to feed to a single parser. If you have your main lexer
import the tokenVocab of your sub-lexer and the parser import the
tokenVocab of the main lexer then you should be able to have a
combined token vocabulary. You could have the start comment rule emit
all the tokens of the comment using multiple emits (See the wiki) ,
but it's probably better to override next token and have it handle the
sub-lexer.
So you would have something like (untested):
@members {
 CommentLexer commentLexer = null;
 boolean inComment = false;
 void enterComment() {
     if ( commentLexer == null )
       commentLexer = new CommentLexer(input, state);
     inComment = true;
 }

 public Token nextToken() {
   while (true) {
     state.token = null;
     state.channel = Token.DEFAULT_CHANNEL;
     state.tokenStartCharIndex = input.index();
     state.tokenStartCharPositionInLine = input.getCharPositionInLine();
     state.tokenStartLine = input.getLine();
     state.text = null;
     if ( input.LA(1)==CharStream.EOF ) {
       return Token.EOF_TOKEN;
     }
     try {
       // CHANGES HERE
       if ( inComment ) {
         // Lex from commentLexer instead
         commentLexer.mTokens();
         if ( state.type == COMMENT_END )
           inComment = false; // Switch back to main lexer for next token
       } else
         mTokens();
       // END CHANGES
       if ( state.token==null ) {
         emit();
       }
       else if ( state.token==Token.SKIP_TOKEN ) {
         continue;
       }
       return state.token;
     }
     catch (NoViableAltException nva) {
       reportError(nva);
       recover(nva); // throw out current char and try again
     }
     catch (RecognitionException re) {
       reportError(re);
       // match() routine has already called recover()
     }
   }
 }
}

COMMENT: '/**' { enterComment(); };

The CommentLexer is just a standard lexer for the comments (not
including the opening '/*') which produces COMMENT_END for the final
'*/' in the comment. By sharing the input stream and state object the
two lexers should keep in step and it's just a matter of calling
mTokens from whichever lexer should be in control (mTokens just fills
the state object).

Tom.
> Cheers,
> Andreas
>