[antlr-interest] Context-sensitive lexing and ANTLR v4
Terence Parr
parrt at cs.usfca.edu
Mon Apr 11 13:25:47 PDT 2011
I see in an early 2004 workshop that I intended to handle Context-sensitive lexing:
http://www.antlr.org/workshop/ANTLR2004/proceedings/ANTLR-3.0-Features.pdf
Each parser decision point generates special rule
in lexer with possible choices: e.g., (ID|INT)
Difficulties
“for”, find “fore”must say “missing for, found ID”
whitespace
The C++ template vs ">>" token problem simply
disappears; i.e., when lexing
List<List<int>> a;
nested template has ">>" in it. Lexer, without context,
cannot know which to pick. Only the parser knows that
it expects ">" followed by ">" not ">>" token
Scott Stanchfield also has some thoughts along these lines
http://javadude.com/articles/antlr-context-sensitive-scanner.html
I'm glad I wrote that slide because I couldn't remember what the difficulties were with context-sensitive Lexing. keywords are an issue as is white space. If I remember correctly Rats has a predicate in its identifier rule that makes it fail if it finds the id is also a keyword (yep, just checked). For whitespace, it simply scarfs whitespace I think in between rule references maybe.
Instead of forcing context-sensitive entry points into the lexer, I think a scannerless parser is simpler to understand conceptually. Rats is very good at combining grammars and it might be fun to come up with a scannerless version of ANTLR. It can be done easily right now by simply passing in characters as tokens and turning on backtracking with memoization. Perhaps I'll try that out.
stat : 'return' e ';' | id '=' e ';' {String s = $id.text;} ;
id : 'a' | 'b' | 'c' | ... ;
e : int ;
int : '0' | '1' | '2' ... ;
yep, that should work even with that action. There is no notion of a token really. hhm...cool.
Ter
PS oh crap...I should be preparing to teach in 30 minutes!
More information about the antlr-interest
mailing list