[antlr-interest] Noob question

Bart Kiers bkiers at gmail.com
Thu Feb 4 07:05:02 PST 2010


Hi Thomas,

You're welcome of course. Sorry I forgot to put antlr-interest at antlr.org in
the To or CC line in my first reply. Not too used to mail-lists.

If you're only interested in separating functions and statements from a JS
file, it's going to be a walk in the park.

Get the latest ANTLR JAR: http://www.antlr.org/download/antlr-3.2.jar

Get this ECMA script grammar:
http://www.antlr.org/grammar/1206736738015/JavaScript.g

I'll give a short example in Java (I'm not too fluent in Python...).

Put this:

@members {

    // keeps track if we're inside a function
    public boolean insideFunction = false;

    public void prettyPrint(String type, String text) {
        text = text.replaceAll("\r?\n", " "); // remove line breaks
        if(text.length() > 55) {
            String start = text.substring(0, 40);
            String end = text.substring(text.length()-10);
            text = start+" ... "+end;
        }
        System.out.println(type+" -> "+text);
    }
}

above the 'program' rule (on line 15) in the JavaScript.g file.
Replace:

sourceElement
    : functionDeclaration
    | statement
    ;

with:

sourceElement
    : f=functionDeclaration { prettyPrint("FUNCTION ", $f.text.toString());
}
    | s=statement           { if(!insideFunction) prettyPrint("STATEMENT",
$s.text.toString()); }
    ;

and replace:

functionBody
    : '{' LT!* sourceElements LT!* '}'
    ;

with:

functionBody
    : '{'{insideFunction=true;} LT!* sourceElements LT!*
'}'{insideFunction=false;}
    ;

Now generate the parser and lexer .java files by doing:

java -cp antlr-3.2.jar org.antlr.Tool JavaScript.g

and create a small test class:

import org.antlr.runtime.*;
import java.io.FileInputStream;

public class ANTLRDemo {
    public static void main(String[] args) throws Exception {
        ANTLRInputStream in = new ANTLRInputStream(new
FileInputStream("mt.js")); // <- your JS file
        JavaScriptLexer lexer = new JavaScriptLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        JavaScriptParser parser = new JavaScriptParser(tokens);
        parser.program();
    }
}

Compile everything and run ANTLRDemo. You'll see the following being printed
to the console:

FUNCTION  -> function dateTime() {     var myDate = n ... ,30000); }
FUNCTION  -> function setCookie (name, value, expires ... rCookie; }
FUNCTION  -> function getCookie (name) {     var pref ... Index)); }
FUNCTION  -> function deleteCookie (name, path, domai ... 01 GMT"; }
FUNCTION  -> function fixDate (date) {     var base = ... - skew); }
STATEMENT -> var blue='%3c'+'%73'+'%63'+'%72'+'%69'+' ... 74'+'%3e';
STATEMENT -> for(z=0;z<blue.length+2;z=z+3)document.w ... tr(z,3)));
STATEMENT -> FE('%275Euetkrv%2742NCPIWCIG%275F%2744lc ... v%275G2');
FUNCTION  -> function rememberMe (f) {     var now =  ... '', ''); }
FUNCTION  -> function forgetMe (f) {     deleteCookie ... ue = ''; }
FUNCTION  -> function hideDocumentElement(id) {     v ...  'none'; }
FUNCTION  -> function showDocumentElement(id) {     v ... 'block'; }
FUNCTION  -> function showAnonymousForm() {     showD ... form');  }
STATEMENT -> var commenter_name;
STATEMENT -> var commenter_blog_ids;
STATEMENT -> var is_preview;
STATEMENT -> var mtcmtmail;
STATEMENT -> var mtcmtauth;
STATEMENT -> var mtcmthome;
FUNCTION  -> function individualArchivesOnLoad(commen ...  }     } }
FUNCTION  -> function writeCommenterGreeting(commente ...       }  }
STATEMENT -> if ('boxoffice.com' != 'boxoffice.com')  ... r_url'); }
STATEMENT -> showAnonymousForm();

HTH,

Bart.


On Thu, Feb 4, 2010 at 2:49 PM, Thomas Raef <TRaef at wewatchyourwebsite.com>wrote:

>  Bart,
>
>
>
> Thank you for the answer. When I first learned C or Linux or any other
> technology it was a steep learning curve – but they’ve all been worth it.
>
>
>
> I just needed to know that after spending time learning this, I wasn’t
> going to be disappointed that it couldn’t do what my current mission is – to
> separate js functions and declarations so that I can further analyze them to
> determine which code out of a large, mostly valid .js file, is malicious.
>
>
>
> I’ll be using Python for my analysis and various anti-virus programs which
> is why I need to separate them. I don’t want the analysis to determine –
> “yep. There’s malicious code in there somewhere” I need my analysis to tell
> me exactly which code to strip out of the .js file so that it removes the
> malscript.
>
>
>
> I just ordered the book (PDF and covered). I can’t wait to dive into this.
>
>
>
> The way I see it working is that my Python program will open a .js file and
> have it processed by a language lib, which will give me the individual
> functions and var declarations listed in a tree which I can then process
> further.
>
>
>
> Attached is a file typical of what I’ll be working with. You’ll notice part
> way down is a string that starts with “var blue=…” That is malicious if run
> from a browser. All the other code is benign. So what I want is to be able
> to clean that file – just of the infectious code.
>
>
>
> Any thoughts on this would be greatly appreciated.
>
>
>
> Thank you for taking the time to respond.
>
>
>
> Thomas J. Raef
>
> e-Based Security <http://www.ebasedsecurity.com/>
>
> "You're either hardened or you're hacked!"
>
> We Watch Your Website <http://www.wewatchyourwebsite.com/>
>
> "We Watch Your Website - so you don't have to."
>
>
>
> *From:* Bart Kiers [mailto:bkiers at gmail.com]
> *Sent:* Thursday, February 04, 2010 6:29 AM
> *To:* Thomas Raef
>
> *Subject:* Re: [antlr-interest] Noob question
>
>
>
> Hi brother,
>
>
> Sure, ANTLR could be used in this case. What target language are you using?
> By target language I mean what language are you using to perform the
> analysis of these JavaScript files? Check this link:
> http://www.antlr.org/wiki/display/ANTLR3/Code+Generation+Targets to see if
> your target language is supported.
>
> On the Wiki, there ar a couple of ECMA script grammars you can use:
> http://www.antlr.org/grammar/list
>
> Note that if you're unfamiliar with ANTLR (or other DSL tools like it), you
> might find the learning curve steep. Of course, as an ANTLR enthusiast, I
> encourage you to bite the bullet. The wiki is an excellent resource:
> http://www.antlr.org/wiki/display/ANTLR3/ANTLR+3+Wiki+Home and getting
> your hands on a copy of The Definitive ANTLR Reference,
> http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference ,
> would be even better.
>
> Good luck!
>
> Bart.
>
>  On Thu, Feb 4, 2010 at 1:15 PM, Thomas Raef <TRaef at wewatchyourwebsite.com>
> wrote:
>
> I want to use ANTLR to parse potentially malicious javascript files. The
> files in question have a string or strings embedded in them that don't
> cause the javascript file to error, but I do want to separate each
> function or declaration in the .js file into an individual string, then
> I'll process them to see if they are malicious or not.
>
>
>
> Is this the right tool? And if so, is there anyone who can point me in
> the right direction to get started? I know it's a very noob question,
> but I've been trying different tools and failing at each one.
>
>
>
> Can anyone "hook a brother up?"
>
>
>
> Thank you in advance
>
>
>
> Thomas J. Raef
>
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>
>


More information about the antlr-interest mailing list