[antlr-interest] Re: Lexer vs Parser
lgcraymer
lgc at mail1.jpl.nasa.gov
Mon May 20 18:37:55 PDT 2002
--- In antlr-interest at y..., "micheal_jor" <open.zone at v...> wrote:
> Hi All,
>
> Following on from my Unicode post, I would appreciate some advice on
> where to place some lexing/parsing decisions.
>
> For instance with my definitions for Unicode categories, the
> characters to be recognized include the individual characters in the
> category as well as Unicode escape sequences that resolve to a
> characters in the category.
>
> This what I did in the lexer:
>
> protected UNICODE_CLASS_Nl
> : ( { IsUnicodeClass_Nl(LA(1)) }? .
> | { IsUnicodeClass_Nl(esc_char.getText()) }?
> esc_char:UNICODE_ESCAPE_SEQUENCE
> )
> ;
>
> My question is should I be checking that the escape sequence
resolves
> to a character in the category in the Lexer or should that be
> postponed to the Parser's battery of semantic analysis.
Unless there is reason for the parser to see the escape sequence as a
distinct token, do it in the lexer. This problem usually does not
arise because the normal development process is the more top-down
approach of designing the parse grammar first, and the lexing grammar
as a consequence--what tokens do I need to recognize, and what can be
filtered out before it gets to the parsing phase. I gather that you
are faced with a complex lexing problem which makes it difficult to do
this.
The grammar which is most subject to change is the parse grammar: you
want to keep that as simple as possible. At the same time, the lexer
is the fastest (potentially) processing phase--your language processor
will be faster if the lexer does more work and the parser and tree
walkers less. Lexers tend to be "write once, then tweak"--it is
usually easy to borrow an existing lexical grammar for a new language
processor--while borrowed parse grammars often need extensive
rewriting, even if the language being processed by the derived parser
is the same as the one recognized by the original. Actions need to be
inserted, trees need to be built, and so forth. When in doubt, try to
keep the parse grammar as clean as possible. Cleanliness in the
lexing grammar often does not matter--it gets edited rarely. If you
build more than one tool to process one input language, then the
parser and tree walkers are subject to change while the lexer probably
is not. Focus on minimizing the complexity of the parser and tree
walker grammars--it will pay off!
> In general, what is a good rule of thumb for deciding what goes into
> a Lexer or a Parser. With Flex/Bison it is much easier since the
> lexer/parser have very different capabilities.
You've been led astray by Ter's unified EBNF approach. ANTLR parsers
have enough more capability than LALR parsers that you can avoid
passing information from parser to lexer, something which corrupts
complex lex/yacc (flex/bison) grammars into a difficult-to-maintain
mess. ANTLR lexers can do (in theory) tokenization in situations
where flex cannot (Ter has a FORTRAN example documented somewhere),
but the function of the lexer is the same: to convert a character
stream into a token stream for processing by the parser. The parser
recognizes syntax, and ANTLR provides some capabilities for
semantics-directed parsing decisions (semantic predicates). Predicate
hoisting (not supported by ANTLR 2) adds additional capabilities to
the parser. The boundaries between lexer and parser should be sharper
for ANTLR than for flex/bison!
> What do you fine people suggest?
>
>
> Cheers,
>
> Micheal
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list