[antlr-interest] Question about syntactic predicates in lexer rules
tcorbat at hsr.ch
tcorbat at hsr.ch
Wed Apr 14 04:07:14 PDT 2010
Hello
I've got a question about syntactic predicates in lexer rule fragments. I encountered the following problem, while implementing the lexer for a language which can escape line-breaks with a backslash. The intention has been to catch escaped line-breaks in the lexer and just skip them. Below, a simplified version of the rules used. Nevertheless, they are sufficient to reproduce the question. The rules were intended to recognize something like the following. (I use <\n> to represent a linebreak)
---
foo \<\n>
<\n>
---
This works well so far. But, a problem occurs if I remove the whitespace between 'foo' and '\':
---
foo\<\n>
<\n>
---
If I run the parser in ANTLRWorks, with this input I get the following message:
"line 1:3 no viable alternative at character '\'"
I could figure out, that if I remove the alternative "UNIVERSAL_CHARACTER_NAME" from " IDENTIFIER_NONDIGIT", I works fine. I expect ANTLR to try to match that alternative due to the backslash, although it cannot succeed because of the required 'u' or 'U' in this alternative. Subsequently, I receive the error message and two "NEWLINE" tokens on the stream. Actually I had been quite surprised about this outcome and did not expect an alternative outcome due to the missing whitespace. As I expect, that I just do not understand how ANTLR works completely, I've tried to avoid the lexer entering the "UNIVERSAL_CHARACTER_NAME" alternative, by adding a syntactic predicate:
---
fragment
IDENTIFIER_NONDIGIT
: ('a'..'z'|'A'..'Z' | '_')+
| (UNIVERSAL_CHARACTER_NAME ) => UNIVERSAL_CHARACTER_NAME
;
---
That did not work out either. I have still received the error.
But, adding a syntactic predicate in the "IDENTIFIER" rule fixed my problem:
---
IDENTIFIER
: (IDENTIFIER_NONDIGIT) ((IDENTIFIER_NONDIGIT) => IDENTIFIER_NONDIGIT | DIGIT)*
;
---
This let the lexer create the expected token stream containing "foo" and one "<\n>". Basically, I do not recognize a difference between the two places, beside that one of the rules is a fragment rule, the other is not. Is it not possible to add syntactic predicates in fragment rules? Or did I miss something fundamentally?
Thanks for reading my questions.
Regards
Thomas
Grammar:
grammar simple;
code
: IDENTIFIER* NEWLINE
;
WS :
(' '|'\t')+ {skip();}
;
SKIPPED_NEWLINE
: '\\\n' {skip();}
;
NEWLINE
: '\n'
;
IDENTIFIER
: (IDENTIFIER_NONDIGIT) (IDENTIFIER_NONDIGIT | DIGIT)*
;
fragment
IDENTIFIER_NONDIGIT
: ('a'..'z'|'A'..'Z' | '_')+
| UNIVERSAL_CHARACTER_NAME
;
fragment
UNIVERSAL_CHARACTER_NAME
: '\\u' HEX_QUAD
| '\\U' HEX_QUAD HEX_QUAD
;
fragment
HEX_QUAD
: HEXADECIMAL_DIGIT HEXADECIMAL_DIGIT HEXADECIMAL_DIGIT HEXADECIMAL_DIGIT
;
fragment
HEXADECIMAL_DIGIT
: '0'..'9'
| 'a'..'f'
| 'A'..'F'
;
fragment
DIGIT
: '0'..'9'
;
More information about the antlr-interest
mailing list