[antlr-interest] Context-sensitive lexing and ANTLR v4

Terence Parr parrt at cs.usfca.edu
Mon Apr 11 13:25:47 PDT 2011


I see in an early 2004 workshop that I intended to handle Context-sensitive lexing:

http://www.antlr.org/workshop/ANTLR2004/proceedings/ANTLR-3.0-Features.pdf

 Each parser decision point generates special rule
in lexer with possible choices: e.g., (ID|INT)
 Difficulties
 “for”, find “fore”must say “missing for, found ID”
 whitespace
 The C++ template vs ">>" token problem simply
disappears; i.e., when lexing
List<List<int>> a;
nested template has ">>" in it.  Lexer, without context,
cannot know which to pick.  Only the parser knows that
it expects ">" followed by ">" not ">>" token

Scott Stanchfield also has some thoughts along these lines

http://javadude.com/articles/antlr-context-sensitive-scanner.html

I'm glad I wrote that slide because I couldn't remember what the difficulties were with context-sensitive Lexing.   keywords are an issue as is white space.   If I remember correctly Rats has a predicate in its identifier rule that makes it fail if it finds the id is also a keyword (yep, just checked). For whitespace, it simply scarfs whitespace I think in between rule references maybe.

Instead of forcing context-sensitive entry points into the lexer, I think a scannerless parser is simpler to understand conceptually. Rats is very good at combining grammars and it might be fun to come up with a scannerless version of ANTLR. It can be done easily right now by simply passing in characters as tokens and turning on backtracking with memoization. Perhaps I'll try that out.

stat : 'return' e ';' | id '=' e ';' {String s = $id.text;} ;

id : 'a' | 'b' | 'c' | ... ;
e : int ;
int : '0' | '1' | '2' ... ;

yep, that should work even with that action. There is no notion of a token really. hhm...cool.

Ter
PS oh crap...I should be preparing to teach in 30 minutes!



More information about the antlr-interest mailing list