[antlr-interest] How do you structure a two-part lexer?

Gavin Lambert antlr at mirality.co.nz
Fri May 29 14:17:06 PDT 2009


At 08:18 30/05/2009, Steve Cooper wrote:
 >    script: html ('<?' php '?>' html)*;
 >
 >The problem is that one language will have very different token
 >sets; while html might have tokens like LT, GT, and TAGNAME,
 >php will have ID, SEMICOLON, etc.
 >
 >So should I go for a single lexer? Two lexers feeding into a
 >single parser? Two parsers? I have no idea to go about
 >interlacing languages like this. Any advice would be greatly
 >appreciated.

If the points where you switch between the two are lexically 
distinct (as they are in PHP, for example), then the best way to 
do this would be to have either two lexers feeding into a single 
parser or two lexer/parser combos outputting separate ASTs that 
get merged later on (the former being simpler than the latter, in 
general).

There are two common patterns for this sort of thing: in one case, 
you have a "master" lexer and a "child" lexer (for PHP, the HTML 
lexer would probably be the master and the PHP-code lexer the 
child).  The master lexer produces a single token containing 
everything that should be examined by the child lexer instead; 
when the parser processes this token, it creates a child lexer to 
process the content of that token alone and switches input streams 
until it's done.  This is the approach taken by the included 
examples, I think.

The other pattern is to have each lexer itself explicitly transfer 
control to the other when it encounters its end sequence (aka the 
start sequence for the other lexer).  I think this is a little 
trickier to code, but it results in the parser receiving a 
seamless stream of tokens from both lexers.

Have a look at the "island grammar" examples provided with ANTLR 
(and in the book) -- note that this would be an island grammar 
under lexer control, which is much simpler than the parser-control 
example that's in the Wiki.

Either way you go, though, don't forget to write lots of unit 
tests! ;)



More information about the antlr-interest mailing list