[antlr-interest] Languages within HTML

Stuart Watt SWatt at infobal.com
Thu Jan 31 13:00:55 PST 2008


An intriguing problem. I did not expect this work in PHP, and if the PHP was
intended to be processable as XML it would be invalid, as the markup tags
would cease to be processing instructions. PHP authors are usually
encouraged to do <? echo("?".">"); ?> or similar. I have a feeling ASP is
simpler, with the tags being processed before the source code is parsed,
because ASP allows multiple languages it has to work differently.

So it seems like the PHP processor behaves as if it starts in HTML mode, and
transforms (incrementally) everything up to "<?" into what behaves like an
echo statement. It then drops into PHP mode and parses (again incrementally)
until it hits "?>" where it would accept a statement terminator, switching
back into HTML and repeating. This is not at all what I had expected, and
implies all sorts of problems mixing JSP/ASP and PHP anyway, as JSP is
implemented as a rewrite, more or less. 

This processing model implies that PHP may need to be the "root" grammar,
with the HTML elements handed off to other grammars if and when needed.
Other grammars can be identified (partly!) by the tags, but <% ... %> can be
Java, JavaScript, VB, even Perl. I haven't tested any of these awkward cases
(such as Perl's heredoc) but I would guess (And it is a guess) that it is
PHP that is the oddity here. 

I've started to look at Pygments as a solution to doing code highlighting.
It does not parse deeply, but for segmenting stuff and handing things off
between different languages, particularly PHP/HTML/JSP/ASP, it works well.
However, it would be confused by this example. 

All the best
Stuart

-----Original Message-----
From: Monty Zukowski [mailto:monty at codetransform.com]
Sent: Thursday, January 31, 2008 3:24 PM
To: Darien Hager
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Languages within HTML


You could probably get pretty far just by handling strings, comments &
escape sequences for each embedded language.

Monty

On Jan 31, 2008 10:15 AM, Darien Hager <darien.hager at etelos-inc.com> wrote:
> I'm experimenting with ANTLR to try to solve a particular problem, and I'd
> like to check some assumptions and ask for any suggestions.
>
> Situation: I have an HTML file with boundaries defining blocks of embedded
> code, such as PHP and JSP. More than one language can be embedded.
>
> Suppose PHP blocks are encapsulated with <? ?> markers, and JSP blocks in
<%
> %> markers.
>
> What I want to do is analyze the file and create a AST tree that begins
with
> line of siblings for each segment. (e.g. HTML, PHP, HTML, JSP, PHP, HTML,
> PHP)
>
> However, don't want it to be so naive that a properly-quoted end-marker
will
> be wrongly hit e.g. : <? echo("?>"); ?>
>
> Question: Is the only robust way to do this to create (or re-use) grammars
> for PHP and JSP?
>  I'm assuming the answer is yes, in which case it's no longer a small
> experiment anymore.
>
> --
> Darien Hager
> Developer
> Etelos, Inc.
> darien at etelos.com
>
> http://www.etelos.com
> "Revolutionizing the way applications are developed, distributed and
> consumed."
>
> This e-mail message, including attachments, may contain confidential
> information for the sole use of the intended recipient(s). If you are not
> the intended recipient, then this is notice that any use, disclosure,
> dissemination, distribution or copying is strictly prohibited. If you have
> received this message in error please contact the sender by reply mail and
> destroy all copies of the original message.


More information about the antlr-interest mailing list