[antlr-interest] Question about syntactic predicates in lexer rules

Wed Apr 14 04:07:14 PDT 2010

Hello

I've got a question about syntactic predicates in lexer rule fragments. I encountered the following problem, while implementing the lexer for a language which can escape line-breaks with a backslash. The intention has been to catch escaped line-breaks in the lexer and just skip them. Below, a simplified version of the rules used. Nevertheless, they are sufficient to reproduce the question. The rules were intended to recognize something like the following. (I use <\n> to represent a linebreak)
---
foo \<\n>
<\n>
---
This works well so far. But, a problem occurs if I remove the whitespace between 'foo' and '\':
---
foo\<\n>
<\n>
---
If I run the parser in ANTLRWorks, with this input I get the following message:
"line 1:3 no viable alternative at character '\'"

I could figure out, that if I remove the alternative "UNIVERSAL_CHARACTER_NAME" from " IDENTIFIER_NONDIGIT", I works fine. I expect ANTLR to try to match that alternative due to the backslash, although it cannot succeed because of the required 'u' or 'U' in this alternative. Subsequently, I receive the error message and two "NEWLINE" tokens on the stream. Actually I had been quite surprised about this outcome and did not expect an alternative outcome due to the missing whitespace. As I expect, that I just do not understand how ANTLR works completely, I've tried to avoid the lexer entering the "UNIVERSAL_CHARACTER_NAME" alternative, by adding a syntactic predicate:
---
fragment
IDENTIFIER_NONDIGIT
                :              ('a'..'z'|'A'..'Z' | '_')+
                |             (UNIVERSAL_CHARACTER_NAME ) => UNIVERSAL_CHARACTER_NAME
                ;
---
That did not work out either. I have still received the error.
But, adding a syntactic predicate in the "IDENTIFIER" rule fixed my problem:

---
IDENTIFIER
                :              (IDENTIFIER_NONDIGIT) ((IDENTIFIER_NONDIGIT) => IDENTIFIER_NONDIGIT | DIGIT)*
                ;
---
This let the lexer create the expected token stream containing "foo" and one "<\n>". Basically, I do not recognize a difference between the two places, beside that one of the rules is a fragment rule, the other is not. Is it not possible to add syntactic predicates in fragment rules? Or did I miss something fundamentally?

Thanks for reading my questions.

Regards
Thomas

Grammar:
grammar simple;

code
                :              IDENTIFIER* NEWLINE
                ;

WS         :
                (' '|'\t')+ {skip();}
                ;

SKIPPED_NEWLINE
                :              '\\\n' {skip();}
                ;

NEWLINE
                :              '\n'
                ;

IDENTIFIER
                :              (IDENTIFIER_NONDIGIT) (IDENTIFIER_NONDIGIT | DIGIT)*
                ;

fragment
IDENTIFIER_NONDIGIT
                :              ('a'..'z'|'A'..'Z' | '_')+
                |             UNIVERSAL_CHARACTER_NAME
                ;

fragment
UNIVERSAL_CHARACTER_NAME
                :              '\\u' HEX_QUAD
                |             '\\U' HEX_QUAD HEX_QUAD
                ;

fragment
HEX_QUAD
                :              HEXADECIMAL_DIGIT HEXADECIMAL_DIGIT HEXADECIMAL_DIGIT HEXADECIMAL_DIGIT
                ;

fragment
HEXADECIMAL_DIGIT
                :     '0'..'9'
                |     'a'..'f'
                |     'A'..'F'
                ;

fragment
DIGIT
                :              '0'..'9'
                ;