[antlr-interest] General questions about something complicated

denstar valliantster at gmail.com
Thu Aug 5 22:49:39 PDT 2010


Hello!

A little backstory (this will be a long message):

I've been working on an editor for my favorite language, CFML, on and
off for years now.  The editor is based on Eclipse.

The way we parse code is this hand-made kludge job.  I want to take
things to a higher level, so I thought, "hey, this ANTLR thing looks
pretty nifty!".  That was like a couple years ago.

I've messed with writing a grammar, on and off, and haven't got far.
Someone recently said they'd buy me The Book, which will probably
help, but what I have to do seems pretty hard, and just lurking on
this list for a bit makes me feel even more out of my depth, which
isn't "teh awesome", to to speak.

I'm not really too concerned with being out of my depth, as the kinds
of deep water sharks/tentacled beasts I fear don't live on the
internet, but I do wonder about the best way to achieve my goal.
Forgive my probably lame questions:

An example of the source code that I have to parse (it's a
markup/scripting language, mixed with HTML sometimes, similar to PHP):

<html>
	<cffunction name="test">
		<cfargument name="fred" test="test"/>
		<cfscript>
			WriteOutput("FREDFREDFRED"); somethinghere = 343;
		</cfscript>
		<cfif thisisatest is 1>
			<cfoutput>#fred#</cfoutput>
		</cfif>
	</cffunction>
	<cfscript>
	  todaysDate = now();
	
	  function doSomething(String doWhat) {
	  	var done = arguments.doWhat & " later";
	  	return done;
	  }
	  function returnSomething(theThing) {
	  	return theThing;
	  }
	</cfscript>

	<cfset fred = 2/>
	<cfset bob = doSomething("build a parser") />
	<cfset test(fred)/>
	<cffunction name="test" >
		<cfset var woo="hoo" />
		<cfargument name="test" default="#WriteOutput("">"")#"/> <!--- I
think this is valid! --->
	</cffunction>
<body>
  <cf_myCustomTag action="rock">
	<cfoutput>
		This is a <b>test</b> #fred#
	</cfoutput>
	<table>
		<tr>
			<td style="<cfoutput>#somethinghere#</cfoutput>">asdfasdf</td>
			<td style="fred"></td>
		</td>
	</table>
</body>
</html>

That's some of the nastiest bastard data as an example.  Generally
it's far better than that.

I wrote something that uses the Jericho HTML lib to parse the tags,
and that works well enough, I guess.  When I hit a <cfscript> tag I
hand it off to another (broken) parser.

The cfscript stuff is ECMAScript-ish, so I think I can modify an
existing grammar and get the broken parser going (I don't have as much
trouble modifying stuff as creating it), but how would you guys go
about handling parsing something like this?

Should I try to write an overall ANTLR grammar for everything, maybe
with a sub-grammar-type-deal for the script stuff?  Or just say screw
it, and stick to using ANTLR for just the ECMAScript-like portion?

It gets a lot more complicated than the above code example, too, even
for just the script stuff.  There are a few CFML engines, and some
care about semi colons and some don't (which I've seen handled
elsewhere, so not too worried about), and some can do different "for"
loops, etc. (more worried about things like this).  They change by
version, as well, and I'd like to support different versions in a
perfect world.

I have to be honest-- I didn't know anything about ASTs and Lexing and
Parsing a few years ago.  Maybe in some abstract form, but not like I
do now (a lot more, relatively).  And I *still* don't think I've
totally (or even "very much") grokked it, or I wouldn't be asking
these questions.

I'm wondering if I'm insane for thinking about using ANTLR for the
"whole shebang".  In the few years that I've been watching antlr, lots
of nifty stuff has been added, which makes me think that maybe it's
not as crazy an idea as it seemed at one time, at least.

But it's probably too much to bite off at once, even if it's not a
crazy idea, neh?  Maybe I should stick to futsing with one of the
existing EMCA grammars for the script-like portions, and try to wrap
my head around antlr and parsing in general more first?  Start from
scratch and actually learn this stuff?

I'll probably be the one working on the grammar in the future, so tho
I'm tempted to try to get someone to donate time/money==grammar, I
want to learn.  But I don't have another few years to produce, so
what's the practical approach, given this long and
not-very-well-expressed background?

Apologies for framing my questions as poorly as I fear I have.  =)

:Den

-- 
If all mankind minus one were of one opinion, mankind would be no more
justified in silencing that one person than he, if he had the power,
would be justified in silencing mankind.
John Stuart Mill


More information about the antlr-interest mailing list