[antlr-interest] Re: Lexer vs Parser

Mon May 20 18:37:55 PDT 2002

--- In antlr-interest at y..., "micheal_jor" <open.zone at v...> wrote:
> Hi All,
> 
> Following on from my Unicode post, I would appreciate some advice on 
> where to place some lexing/parsing decisions.
> 
> For instance with my definitions for Unicode categories, the 
> characters to be recognized include the individual characters in the 
> category as well as Unicode escape sequences that resolve to a 
> characters in the category.
> 
> This what I did in the lexer:
> 
> protected UNICODE_CLASS_Nl
>   : ( { IsUnicodeClass_Nl(LA(1)) }? . 
>     | { IsUnicodeClass_Nl(esc_char.getText()) }? 
> esc_char:UNICODE_ESCAPE_SEQUENCE 
>     )
>   ;
> 
> My question is should I be checking that the escape sequence 
resolves 
> to a character in the category in the Lexer or should that be 
> postponed to the Parser's battery of semantic analysis.

Unless there is reason for the parser to see the escape sequence as a 
distinct token, do it in the lexer.  This problem usually does not 
arise because the normal development process is the more top-down 
approach of designing the parse grammar first, and the lexing grammar 
as a consequence--what tokens do I need to recognize, and what can be 
filtered out before it gets to the parsing phase.  I gather that you 
are faced with a complex lexing problem which makes it difficult to do 
this.

The grammar which is most subject to change is the parse grammar:  you 
want to keep that as simple as possible.  At the same time, the lexer 
is the fastest (potentially) processing phase--your language processor 
will be faster if the lexer does more work and the parser and tree 
walkers less.  Lexers tend to be "write once, then tweak"--it is 
usually easy to borrow an existing lexical grammar for a new language 
processor--while borrowed parse grammars often need extensive 
rewriting, even if the language being processed by the derived parser 
is the same as the one recognized by the original.  Actions need to be 
inserted, trees need to be built, and so forth.  When in doubt, try to 
keep the parse grammar as clean as possible.  Cleanliness in the 
lexing grammar often does not matter--it gets edited rarely.  If you 
build more than one tool to process one input language, then the 
parser and tree walkers are subject to change while the lexer probably 
is not.  Focus on minimizing the complexity of the parser and tree 
walker grammars--it will pay off!

> In general, what is a good rule of thumb for deciding what goes into 
> a Lexer or a Parser. With Flex/Bison it is much easier since the 
> lexer/parser have very different capabilities.

You've been led astray by Ter's unified EBNF approach.  ANTLR parsers 
have enough more capability than LALR parsers that you can avoid 
passing information from parser to lexer, something which corrupts 
complex lex/yacc (flex/bison) grammars into a difficult-to-maintain 
mess.  ANTLR lexers can do (in theory) tokenization in situations 
where flex cannot (Ter has a FORTRAN example documented somewhere), 
but the function of the lexer is the same:  to convert a character 
stream into a token stream for processing by the parser.  The parser 
recognizes syntax, and ANTLR provides some capabilities for 
semantics-directed parsing decisions (semantic predicates).  Predicate 
hoisting (not supported by ANTLR 2) adds additional capabilities to 
the parser.  The boundaries between lexer and parser should be sharper 
for ANTLR than for flex/bison!

> What do you fine people suggest?
> 
> 
> Cheers,
> 
> Micheal

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/