[antlr-interest] case-insensitive parsing
Thomas Brandon
tbrandonau at gmail.com
Thu Apr 23 07:43:28 PDT 2009
On Fri, Apr 24, 2009 at 12:19 AM, Andreas Meyer
<andreas.meyer at smartshift.de> wrote:
> Ok ... there are two options:
> (2) use island grammars, as advertised on the Wiki
> (http://www.antlr.org/wiki/display/ANTLR3/Island+Grammars+Under+Parser+Control).
> however, this is quite complicated to set up
>
> Island grammars are nice for complicated cases, but maybe in this case
> are just overkill, because the boundary of your comment syntax can be
> identified by the lexer, you do not need the full parser for that. Hope
> that helps :-)
>
Island grammars can also be done under lexer control which eliminates
much of the complexity and fragility of doing it under parser control.
See the island-grammar example in the examples pack. However this does
not integrate the comments into the main token stream, if you just
need to put them in a structure for later use this is probably fine.
Otherwise you can have your two lexers produce a single token stream
to feed to a single parser. If you have your main lexer
import the tokenVocab of your sub-lexer and the parser import the
tokenVocab of the main lexer then you should be able to have a
combined token vocabulary. You could have the start comment rule emit
all the tokens of the comment using multiple emits (See the wiki) ,
but it's probably better to override next token and have it handle the
sub-lexer.
So you would have something like (untested):
@members {
CommentLexer commentLexer = null;
boolean inComment = false;
void enterComment() {
if ( commentLexer == null )
commentLexer = new CommentLexer(input, state);
inComment = true;
}
public Token nextToken() {
while (true) {
state.token = null;
state.channel = Token.DEFAULT_CHANNEL;
state.tokenStartCharIndex = input.index();
state.tokenStartCharPositionInLine = input.getCharPositionInLine();
state.tokenStartLine = input.getLine();
state.text = null;
if ( input.LA(1)==CharStream.EOF ) {
return Token.EOF_TOKEN;
}
try {
// CHANGES HERE
if ( inComment ) {
// Lex from commentLexer instead
commentLexer.mTokens();
if ( state.type == COMMENT_END )
inComment = false; // Switch back to main lexer for next token
} else
mTokens();
// END CHANGES
if ( state.token==null ) {
emit();
}
else if ( state.token==Token.SKIP_TOKEN ) {
continue;
}
return state.token;
}
catch (NoViableAltException nva) {
reportError(nva);
recover(nva); // throw out current char and try again
}
catch (RecognitionException re) {
reportError(re);
// match() routine has already called recover()
}
}
}
}
COMMENT: '/**' { enterComment(); };
The CommentLexer is just a standard lexer for the comments (not
including the opening '/*') which produces COMMENT_END for the final
'*/' in the comment. By sharing the input stream and state object the
two lexers should keep in step and it's just a matter of calling
mTokens from whichever lexer should be in control (mTokens just fills
the state object).
Tom.
> Cheers,
> Andreas
>
More information about the antlr-interest
mailing list