[antlr-interest] x86 assembler parser (AT&T syntax)

Tue Feb 28 08:00:10 PST 2006

Hello all,

In case anyone has interest in a grammar for assembler: I wrote a  
parser, transformer, and emitter for x86 assembler (AT&T syntax).  It  
successfully parses all assembler produced in a typical Linux kernel  
build (both Linux 2.4 and 2.6).  Its performance seems pretty nice  
(using the Antlr C++ backend).

This was my first Antlr grammar, and written in a big rush, thus  
excuse strangeness in my grammar, and feel free to share critique.

The grammar file is here:
http://l4hq.org/cvsweb/cvsweb/~checkout~/afterburner/asm-parser/Asm.g

The project plus build instructions are here:
http://l4ka.org/projects/virtualization/afterburn/

Some notes about the grammar :

Assembler has many mnemonics, and I wanted the mnemonic lookups to be  
fast via the hash table.  The problem: AT&T syntax appends a suffix  
to the mnemonics to denote bit-width.  I didn't see a way to separate  
the mnemonic from the suffix in the lexer when using the hash table,  
and so the grammar is a bit verbose with the mnemonic declarations.   
I probably should generate the grammar from another grammar, to avoid  
bugs that could easily accompany all of the tedious mnemonic  
declarations.

My grammar is covered with manual tree construction commands, because  
I want a tree without any syntax residue, i.e., I want the tree nodes  
to represent an intention, independent of the syntax.  Syntax is  
controversial for x86 assembler, because it has two types of syntax,  
Intel and AT&T, which are so different that they even reverse the  
ordering of source and destination operands.  The grammar could be  
the basis of a tool that transforms between the two syntax types  
(although perhaps the grammar is too attached to AT&T syntax).

The parser is incomplete for lack of time, and because I reverse  
engineered the GNU assembler syntax (commands, macros, etc.) for lack  
of a concise definition.  My goal for the grammar is to transform the  
x86 instructions that are sensitive to privilege level, while  
ignoring everything else, and thus I implemented an overly broad  
grammar that probably accepts illegal assembler (and probably raises  
errors on valid assembler).

Joshua