[antlr-interest] Allowing lexer modes

Tue Jan 6 06:51:03 PST 2009

Dear all,

I'm trying to make a lexer that changes modes depending on semantic predicates,
but the grammar is pretty temperamental, sometimes the
PROSE rule is not used at all, and sometimes it is used!

The idea of this grammar is to have prose that is interspersed with
expressions.

expressions are started and ended with the symbols '\[' and '\]'.
One line expressions can be started with %%.
Comments are started with %.

The idea is that the following string is valid:

<<<
prose that might contains operators like + and -
followed by an expression:
[
a + b
x * y - z

f - g

]
and then maybe some comments that
% are invisibile and
cannot be seen. % (and might be in mid sentence!)

Directives are the same as one line math modes
%% [
but they can be interspersed with prose
%% h + j %comments are valid here too
and are not affected at all
%% ]

This includes comments and directives in expressions:
[
u + v % this is just an example
%% w + x - y % as well as directives

]
I think that covers almost everything!
<<<

The grammar I've constructed is below -- could anybody tell me if this
is the right approach?

As an alternative, I've thought about using "island grammars",
triggered by the '\start' and '%%' tokens,
but I'm not sure which would be better.

Thanks,

Zenzike

grammar Modes;

options {
  language = Java;
}

@lexer::members {
static boolean text = true;
static boolean text0 = true;
static boolean frag = false;

public boolean check(String s) {
  for (int i=0; i<s.length(); i++) {
    if (input.LA(i+1) != s.charAt(i))
      return false;
  }
  return true;
}

public boolean isCode() {
  String[] code = {
    "\\begin",
    "\\end",
    "<"
  };

  for (String c:code) {
    if (check(c))
      return true;
  }
  return false;
}
}

prog
  : stat+
  ;
stat
  : BEXPR (expr)+ EEXPR
  ;
expr
  : ID (binop ID)*
  ;
binop
  : PLUS | MINUS | TIMES | DIV | CROSS
  ;

COMMENT : {input.LA(1)=='\%' && input.LA(2)!='\%'}?=> '%' (~NL)* NL*
{
if (frag) {
    frag=false;
    text=text0;
}
$channel=HIDDEN;};

WS      : (' '|'\t') {$channel=HIDDEN;};

VERBATIM: {text}?=> '<' .* '>'  {$channel=HIDDEN;};

BEXPR   : '\\begin'         {text=false;};
EEXPR   : '\\end'         {text=true; };
FRAGMENT: MODMOD      {text0=text;text=false;frag=true;$channel=HIDDEN;};

fragment
MODMOD  : '%%'|'^';

PROSE       : {text && !frag && !isCode() && !check("\%")}?=> .
{
while (!isCode() && !check("\%") && !check("\r") && !check("\n")) {
  matchAny();
  if (state.failed)
    return;
}
}
              NL*
              {$channel=HIDDEN; }
            ;

// I also tried the approaches below ...
//PROSE   : {text && !frag && !isCode()}?=>  ({!isCode() &&
!check("\r") && !check("\n")}?=> ~'%')+ NL* {$channel=HIDDEN;};
/*
PROSE       : {text && !frag && !isCode() && !check("\%") }?=> . NL*
              {$channel=HIDDEN; }
            ;
*/

ENDLINE : NL+
{if (frag) {
    frag=false;
    text=text0;
}
$channel=HIDDEN;};

fragment
NL      : ('\r'|'\n')
        ;

fragment LETTER : 'a'..'z'|'A'..'Z';
fragment DIGIT  : '0'..'9';

PLUS    : '+';
MINUS   : '-';
TIMES   : '*';
DIV     : '/';
CROSS   : 'cross';  \\these should not be tokenized as CROSS or BANG
when in PROSE.
BANG    : '\bang';

ID      : LETTER (LETTER|DIGIT)*
        | '\\' (LETTER|DIGIT)+
        ;