[antlr-interest] Literals and subrules

Thu Feb 25 02:22:26 PST 2010

Hi All,

I'm not sure I understand why the following grammars, which I thought should recognize the same language, do not all work.  The differences are in the use of literals in the parser rules versus literals in the lexical analyzer rules, and in the use of parentheses for sub-rules.  The language is very simple: just a single letter followed by the end of file.  Can someone explain why some work and others do not?

1) This grammar places the literals in the parser rules.  Antlr generates a parser (exits with 0), but the parser does not compile.

  $ cat Doit1.g
  grammar Doit1;

  prog:
          id
          EOF
      ;

  id:
          'a' .. 'z'
          | 'A' .. 'Z'
      ;
  $ java org.antlr.Tool Doit1.g

  $ javac Doit1*.java
  Doit1Parser.java:71: illegal start of expression
              if (  ) {
                    ^
  1 error

2) The second grammar Doit2 places parentheses around the literal ranges because I eventually want to recognize something more than one character using the '+' sub-rule.  Using parentheses is supposed to be a "Subrule. Like a call to a rule with no name." according to the documentation (http://www.antlr.org/wiki/display/ANTLR3/Grammars), so it should be legal.  Unfortunately, this grammar causes Antlr to generate an error message regarding EOF.  I don't understand why a sub-rule used here does not work.

  $ cat Doit2.g
  grammar Doit2;

  prog:
          id
          EOF
      ;

  id:
          ( 'a' .. 'z' )
          | ( 'A' .. 'Z' )
      ;

  $ java org.antlr.Tool Doit2.g
  warning(200): Doit2.g:8:3: Decision can match input such as "EOF" using multiple
   alternatives: 1, 2
  As a result, alternative(s) 2 were disabled for that input
  error(201): Doit2.g:8:3: The following alternatives can never be matched: 2

3) This grammar, Doit3, places the literals in the lexer rules.  Antlr produces a parser and lexer that compile and the recognizer accepts the language.

  $ cat Doit3.g
  grammar Doit3;

  prog:
          ID
          EOF
      ;

  ID:
          'a' .. 'z'
          | 'A' .. 'Z'
      ;

  $ java org.antlr.Tool Doit3.g

  $ javac Doit3*.java

  $ java Doit3 < i

  $

4) This grammar, Doit4, is almost the same as Doit3, but uses parentheses for sub-rules.  This grammar works, but I'm not sure why because this seems inconsistent in light that grammar Doit2 does not.

  $ cat Doit4.g
  grammar Doit4;

  prog:
          ID
          EOF
      ;

  ID:
          ( 'a' .. 'z' )
          | ( 'A' .. 'Z' )
      ;

  $ java org.antlr.Tool Doit4.g

  $ javac Doit4*.java

  $ java Doit4 < i

  $
Usually, I would simply move all literals to the lexical analyzer (I.e., use literals only in lexer rules) because this is how traditionally parsers and lexers were done.  But, I often see grammars for Antlr that have literals sprinkled through out both parser and lexer rules, so I thought I would give it a try.

Ken