History of Decompilation (2000-present)
University of London's Asm21toc reverse compiler, 2000.
This assembly language decompiler for Digital Signal Processing (DSP) code was written in a compiler-compiler called rdp [
JSW00 ]. The authors note that DSP is one of the last areas where assembly language is still commonly used. As usual, the decompilation from assembly language is considerably easier than from executable code; in fact the authors "doubt the usefulness" of decompiling from binary files.
Proof-Directed De-compilation of Low-Level Code, 2001.
Katsumata and Ohori published a paper [
KO01 ] on decompilation based on proof theoretical methods. The input is Jasmin, essentially Java assembly language. The output is an ML-like simply typed functional language. Their example shows an iterative implementation of the factorial function transformed into two functions (an equivalent recursive implementation). Their approach is to treat each instruction as a constructive proof representing its computation. While not immediately applicable to a general decompiler, their work may have application where proof of correctness (usually of a compilation) is required.
In [
Myc01 ], Mycroft compares his Type-Based decompilation with this work. Structuring to loops and conditionals is not attempted by either system. He concludes that the two systems produce very similar results in the areas where they overlap, but that they have different strengths and weaknesses.
Computer Security Analysis through Decompilation and High-Level Debugging, 2001.
Cifuentes et al suggested dynamic decompilation as a way to provide a powerful tool for security work. The main idea is that the security analyst is only interested in one small piece of code at one time, and so high level code could be generated "on the fly". One problem with traditional (static) decompilation is that it is difficult to determine the range of possible values of variables; by contrast, a dynamic decompiler can provide at least one value (the current value) with no effort [
CWVE01 ].
Type Propagation in IDA Pro Disassembler, 2001.
Guilfanov describes the type propagation system in the popular disassembler IDA Pro
Ida Pro. The types of parameters to library calls are captured from system header files. The parameter types for commonly used libraries are saved in files called type libraries. Assignments to parameter locations are annotated with comments with the name and type of the parameter. This type information is propagated to other parts of the disassembly, including all known callers. At present, no attempt is made to find the types for other variables not associated with the parameters of any library calls [
Gui01 ].
DisC, by Satish Kumar, 2001.
This decompiler is designed to read only programs written in Turbo C version 2.0 or 2.01; it is an example of a compiler specific decompiler. There is no significant advantage to this approach, since general techniques are not much more difficult to implement. It is an interesting observation that since most aspects of decompilation are ultimately pattern matching in some sense, the difference between pattern matching decompilers and general ones is essentially one of the generality of the patterns.
http://www.debugmode.com/dcompile/disc.htm.
ndcc decompiler, 2002.
André Janz modified the dcc decompiler to read 32-bit Windows Portable Executable (PE) files. The intent was to use the modified decompiler to analyse malware. The author states that a rewrite would be needed to fully implement the 80386 instruction set. Even so, reasonable results were obtained, while retaining dcc's severe limitations [
Jan02 ].
The Anatomizer Decompiler, circa 2002.
K. Morisada released a decompiler for Windows 32-bit programs, which appears to be comparable in capability to the REC compiler for that platform. See also
AnatomizerDecompiler and
http://jdi.at.infoseek.co.jp.
Analysis of Virtual Method Invocation for Binary Translation, 2002.
Tröger and Cifuentes show a method of analysing indirect call instructions. If such a call implements a virtual method call and is correctly identified, various important aspects of the call are extracted. The technique as presented is limited to one basic block; as a result, it fails for some less common cases. [
TC02 ].
Boomerang, 2002.
This is an open source decompiler, with several front ends (two are well developed) and a C back end. It uses an internal representation based on the Static Single Assignment form, and pioneers dataflow-based type analysis. At the time of writing, it is still limited to quite small (toy) binary programs.
http://boomerang.sourceforge.net.
Desquirr, 2002.
This is an IDA Pro Plugin, written by David Eriksson as part of his Master's thesis. It decompiles one function at a time to the IDA output window. While not intended to be a serious decompiler, it illustrates what can be done with the help of a powerful disassembler and about 5000 lines of C++ code. Because a disassembler does not carry semantics for machine instructions, each supported processor requires a module to decode instruction semantics and addressing modes. The X86 and ARM processors are supported. Conditionals and loops are emitted as gotos, there is some simple switch analysis, and some recovery of parameters and returns is implemented.
http://desquirr.sourceforge.net/desquirr.
Analyzing Memory Accesses in x86 Executables, 2004.
Balakrishnan and Reps from the University of Wisconsin have developed a framework for analysing binary programs that they call Codesurfer/x86. The aim is to produce intermediate representations that are similar to those that can be created for a program written in a high-level language. The binary is first disassembled with the IDA Pro disassembler. A plug-in for IDA Pro called Connector, provided by
Grammatech Inc, interfaces to the rest of the tool. Value-set Analysis (VSA) is used to produce an intermediate representation, which is presented in a source code analysis tool called
CodeSurfer (sold separately by Grammatech Inc for C/C++ source code analysis) [
BR04,
RBLT05 ]. Codesurfer/x86 may be
commercially available soon.
R. Falke's Type Analysis for Decompilers, 2004
Raimar Falke, in his Diploma Thesis [
Fal04 ] (German) for the Technical University of Dresden, implemented an adaption of Mycroft's type constraint theory in a decompiler called YaDeC. He extended it to handle arrays. To keep the project manageable, he used objdump for the front end, ignored floating point types, assumed only stack parameters, and so on. An English translation of the paper's summary can be found in
FalkeDiplomaSummary.
Andromeda Decompiler, 2004-2005.
Andrey Shulga wrote a decompiler for Windows x86 and C. The decompiler itself was never released, however a GUI program for manipulating the IR generated by the compiler is available at the author's web site [
Shu04 ]. Only an x86 front end is written at present, and only a C/C++ back end, although the decompiler is claimed to be capable of other front and back ends. The output for the provided demonstration IR is extremely impressive, but it is not clear whether the IR is largely automatically generated by the decompiler, or hand edited. The web page has been inactive since May 2005.
Hex Rays Decompiler Plugin, 2007.
Ilfak Guilfanov, author of the IDA Pro disassembler, released a commercial decompiler plugin for IDA Pro [
Gui07a,
Gui07b ]. This plugin adds a decompiler view to the other views available in the interactive disassembler. One function is shown at a time; most functions decompile in a fraction of a second to a quite C-like output. The author stresses that output is not designed for re-compilation, only for more rapid comprehension of what the function is doing. The output includes compound conditionals (with || and &&), loops (for, while, break, etc), and function parameters and returns. There is also an
API which gives access to the decompiler's IR, allowing custom analyses if desired. At this stage, only the x86 architecture is supported. The decompiler relies on the disassembler (and possibly manual intervention) to separate code from data and to identify functions.
Static Single Assignment for Decompilation, 2007.
Mike Van Emmerik completed this PhD theses at the University of Queensland in October 2007 [
VE07 ]. The main theme is how the SSA form enables various aspects of decompilation to be more readily analysed, although this leads to other topics such as type analysis and the analysis of indirect jumps and calls. An industry case study is discussed. Although considerable progress is made, many problems still remain, particularly related to alias analysis. There is a chapter of results, using the Boomerang open source decompiler. A new algorithm for finding preserved locations in procedures in the presence of recursion is given. There is a comprehensive glossary of terms, history, comparison with Java decompilers, a survey of compiler infrastructures that may be suitable for decompilation, and the problems faced by decompilers are summarised in terms of separation problems (code from data, pointers from constants, and original from offset pointers).
History Of Decompilation 1 (1960-1979)
History Of Decompilation 2 (1980-1999)
References
- JSW00
- A. Johnstone, E. Scott, and T. Womack. Reverse Compilation for Digital Signal Processors: a Working Example. In Proceedings of the Hawaii International Conference on System Sciences. IEEE-CS Press, January 2000.
- KO01
- S. Katumata and A. Ohori. Proof-directed de-compilation of low-level code. In European Symposium on Programming, volume 2028 of Lecture Notes in Computer Science, pages 352-366. Springer-Verlag 2001.
- Myc01
- A. Mycroft. Comparing type-based and proof-directed decompilation. In Proceedings of the Working Conference on Reverse Engineering, pages 362-367, Stuttgart Germany. IEEE-CS Press 2001.
- CWVE01
- C. Cifuences, T. Waddington, and M. Van Emmerik. Computer Security Analysis through Decompilation and High Level Debugging. In Proceedings of the Working Conference on Reverse Engineering, pages 375-380, Stuttgart Germany. IEEE-CS Press 2001.
- Gui01
- I. Guilfanov. A simple type system for program reengineering. In Proceedings of the Working Conference on Reverse Engineering, pages 357-361, Stuttgart Germany. IEEE-CS Press 2001.
- Jan02
- André Janz. Experimente mit einem Decompiler im Hinblick auf die forensische Informatik, 2002. http://agn-www.informatik.uni-hamburg.de/papers/doc/diparb_andre_janz.pdf .
- TC02
- J. Tröger and C. Cifuentes. Analysis of virtual method invocation for binary translation. In Proceedings of the Working Conference on Reverse Engineering, Richmond, Virginia; pages 65-74. IEEE-CS Press, 2002.
- BR04
- G. Balakrishnan and T. Reps. Analyzing Memory Accesses in x86 Executables. In Proceedings of Compiler Construction (LNCS 2985) pages 5-23. Springer-Verlag April 2004.
- Fal04
- R. Falke. Entwicklung eines Typeanalysesystem für einen Decompiler (Development of a type analysis system for a decompiler), 2004. Diploma thesis, German language. http://risimo.net/diplom.ps . No longer available, however archived here (archive.org, 491kB).
- Shu04
- Andrey Shulga. Andromeda Decompiler web page, 2004. http://shulgaaa.at.tut.by .
- RBLT05
- T. Reps, G. Balakrishnan, J. Lim and T. Teilelbaum. A Next-Generation Platform for Analysing Executables. To appear in Proc. 3rd Asian Symposium on Programming Languages and Systems, Springer-Verlag 2005.
- Gui07a
- I. Guilfanov. Blog: Decompilation Gets Real, April 2007. http://hexblog.com/2007/04/decompilation_gets_real.html .
- Gui07b
- I. Guilfanov. Hex-rays home page, 2007. http://www.hex-rays.com .
- VE07
- M. Van Emmerik. Static Single Assignment for Decompilation. PhD thesis, University of Queensland, 2007. http://vanemmerikfamily.com/mike/master.pdf or http://vanemmerikfamily.com/mike/master.ps.gz .