[antlr-interest] Semantic Predicates in a Lexer
"Paul Bouché (NSN)"
paul.bouche at nsn.com
Fri Mar 20 11:02:45 PDT 2009
Hi Jim,
thanks for your response.
Jim Idle schrieb:
> Paul Bouché (NSN) wrote:
>> Hi,
> Firstly, do not forget that you cannot set such a flag from the parser
> as the lexer runs first and creates all the tokens.
I know which I personally find a bug, if there was not a token buffer,
one could steer the lexer depending on parser state. This would make the
lexer much more powerfull. Ter stated in his book we need tokens because
this makes things easier and this is also how humans analyse text, but
humans don't buffer up all words in a page but read them word by word
and keep a short buffer of recently read tokens so as to make context
sensitive decisions ;-)
Sorry I don't mean to come off abrasive, I am glad for your reply!
>> Here is a lexer excerpt:
>> NUMBER : DIGIT_+;
>> SIMPLENAME: {noColonInNames}?=> LETTER_+;
>> COLON: {noColonInNames}?=> COLON_;
>> NAME: {!noColonInNames}?=> (LETTER_ | COLON_)+;
>> fragment DIGIT_: '0'..'9';
>> fragment LETTER_: 'a'..'z' | 'A'..'Z';
>>
> Assuming that you can configure these flags in lexer context are not
> expecting them to be respected by the lexer if the parser sets them,
> then you should be able to do this:
>
> grammar ttt;
>
> @lexer::members
> {
> boolean noColonInNames = false;
> }
>
> test
> : (SIMPLENAME | COLON | NAME)* EOF ;
>
> fragment LETTER_
> : ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z')*
> ;
>
> fragment COLON
> : ':'
> ;
>
> fragment SIMPLENAME
> :
> ;
>
> NAME
> : {!noColonInNames}?=> (LETTER_ | COLON)+ { noNamesInColons = true; }
> | LETTER_+ { $type = SIMPLENAME; }
> | COLON { $type = COLON; }
> ;
>
> However, I suspect that you will find it much easier to use predicates
> in the parser, even if it is only the first one you come across that
> should be NAME COLON NAME:
>
> grammar ttt;
>
> @lexer::members
> {
> boolean noColonInNames = false;
> }
>
> test
> : names* EOF ;
>
> names
> : {!noColonInNames}?=> name { System.out.println("Var is '" +
> $name.text + "'"); }
> | {noColonInNames}?=> NAME (COLON NAME)*
> ;
>
> name
> : ((NAME | COLON)=>(NAME | COLON))+
> ;
>
> fragment LETTER_
> : ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z')*
> ;
>
> fragment COLON
> : ':'
> ;
>
> NAME
> : LETTER_+
> ;
I tried what you suggested, basically you moved the gated semantic
predicate into subrules of the NAME rule. Here is what I tried:
grammar LexerPredicates;
@lexer::members
{
boolean noColonInNames = true;
boolean parsedSimpleName = false;
}
@members {
boolean noColonInNames = false;
}
test
: (simplename | colon | name)* EOF ;
name
: NAME
;
colon
: COLON
;
simplename
: SIMPLENAME
;
fragment DIGIT_
: '0'..'9'
;
NUMBER
:
{noColonInNames || !noColonInNames}?=>
DIGIT_+
;
fragment LETTER_
: 'a'..'z'|'A'..'Z'
;
fragment COLON
: ':'
;
fragment SIMPLENAME
:
;
fragment LETTERORDIGIT_
:
DIGIT_ | LETTER_
;
NAME
: {!noColonInNames}?=> (LETTERORDIGIT_ | COLON)+
| LETTERORDIGIT_+ { $type = SIMPLENAME; } //A
| COLON { $type = COLON; } { noColonInNames = false;} //B
;
WHITESPACE
:
( ' ' | '\t' /*| NEWLINE_*/)+ // removed new line due to
backward compatibility
{ skip(); }
;
If I try to compile this I get the following error:
internal error:
org.antlr.analysis.NFAToDFAConverter.getPredicatesPerNonDeterministicAlt(NFAToDFAConverter.java:1603):
no AST/token for nonepsilon target w/o predicate
which is better than no error and tells me that I need to cover the
other alternative within the name rule. Marked with //A and //B. I did
that, i.e.:
NAME
: {!noColonInNames}?=> (LETTERORDIGIT_ | COLON)+
| {noColonInNames }?=> LETTERORDIGIT_+ { $type = SIMPLENAME; }
| {noColonInNames }?=> COLON { $type = COLON; } { noColonInNames =
false;}
;
If I then set the noColonInNames to true for the input
a : b a 3 a3 3a
I get
SIMPLENAME COLON NAME NAME NAME NAME NAME
which is not what I want, I want the 3 to be recognized as a NUMBER as
it works without predicates. Imo there is a bug or I cannot understand
why it does not work.
I solved this by combining your approach and generating several tokens
when the NAME lexer rule matches, of course with the desired type. The
NAME COLON NAME solution did not work because I would have had to do
major rework in our tree parser which I wanted to avoid. Now it works
nicely as I desired it, but as praiseworthy as ANTLR is, it is also
sometimes a real beast, or maybe I am the beast and just didn't get it :)
Thx,
Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090320/2e1ae5df/attachment.html
More information about the antlr-interest
mailing list