[antlr-interest] How do you structure a two-part lexer?
Gavin Lambert
antlr at mirality.co.nz
Fri May 29 14:17:06 PDT 2009
At 08:18 30/05/2009, Steve Cooper wrote:
> script: html ('<?' php '?>' html)*;
>
>The problem is that one language will have very different token
>sets; while html might have tokens like LT, GT, and TAGNAME,
>php will have ID, SEMICOLON, etc.
>
>So should I go for a single lexer? Two lexers feeding into a
>single parser? Two parsers? I have no idea to go about
>interlacing languages like this. Any advice would be greatly
>appreciated.
If the points where you switch between the two are lexically
distinct (as they are in PHP, for example), then the best way to
do this would be to have either two lexers feeding into a single
parser or two lexer/parser combos outputting separate ASTs that
get merged later on (the former being simpler than the latter, in
general).
There are two common patterns for this sort of thing: in one case,
you have a "master" lexer and a "child" lexer (for PHP, the HTML
lexer would probably be the master and the PHP-code lexer the
child). The master lexer produces a single token containing
everything that should be examined by the child lexer instead;
when the parser processes this token, it creates a child lexer to
process the content of that token alone and switches input streams
until it's done. This is the approach taken by the included
examples, I think.
The other pattern is to have each lexer itself explicitly transfer
control to the other when it encounters its end sequence (aka the
start sequence for the other lexer). I think this is a little
trickier to code, but it results in the parser receiving a
seamless stream of tokens from both lexers.
Have a look at the "island grammar" examples provided with ANTLR
(and in the book) -- note that this would be an island grammar
under lexer control, which is much simpler than the parser-control
example that's in the Wiki.
Either way you go, though, don't forget to write lots of unit
tests! ;)
More information about the antlr-interest
mailing list