[antlr-interest] Context-sensitive lexing and ANTLR v4

Jason Doege jdoege at gmail.com
Wed Apr 13 11:42:34 PDT 2011


One way to handle keywords is with a zero-width look-ahead assertion. You
append this construct to a keyword production with a test that passes if the
next character is a non-alpha-numeric but then leaves it on the input
stream.

Best regards,
Jason

On Mon, Apr 11, 2011 at 3:25 PM, Terence Parr <parrt at cs.usfca.edu> wrote:

> I see in an early 2004 workshop that I intended to handle Context-sensitive
> lexing:
>
> http://www.antlr.org/workshop/ANTLR2004/proceedings/ANTLR-3.0-Features.pdf
>
>  Each parser decision point generates special rule
> in lexer with possible choices: e.g., (ID|INT)
>  Difficulties
>  “for”, find “fore”must say “missing for, found ID”
>  whitespace
>  The C++ template vs ">>" token problem simply
> disappears; i.e., when lexing
> List<List<int>> a;
> nested template has ">>" in it.  Lexer, without context,
> cannot know which to pick.  Only the parser knows that
> it expects ">" followed by ">" not ">>" token
>
> Scott Stanchfield also has some thoughts along these lines
>
> http://javadude.com/articles/antlr-context-sensitive-scanner.html
>
> I'm glad I wrote that slide because I couldn't remember what the
> difficulties were with context-sensitive Lexing.   keywords are an issue as
> is white space.   If I remember correctly Rats has a predicate in its
> identifier rule that makes it fail if it finds the id is also a keyword
> (yep, just checked). For whitespace, it simply scarfs whitespace I think in
> between rule references maybe.
>
> Instead of forcing context-sensitive entry points into the lexer, I think a
> scannerless parser is simpler to understand conceptually. Rats is very good
> at combining grammars and it might be fun to come up with a scannerless
> version of ANTLR. It can be done easily right now by simply passing in
> characters as tokens and turning on backtracking with memoization. Perhaps
> I'll try that out.
>
> stat : 'return' e ';' | id '=' e ';' {String s = $id.text;} ;
>
> id : 'a' | 'b' | 'c' | ... ;
> e : int ;
> int : '0' | '1' | '2' ... ;
>
> yep, that should work even with that action. There is no notion of a token
> really. hhm...cool.
>
> Ter
> PS oh crap...I should be preparing to teach in 30 minutes!
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>



-- 
--Jason Doege
jdoege at gmail.com


More information about the antlr-interest mailing list