This is my idea of how I would write a general decompiler:
- Lets assume, we start with the assembler output of the disassemblers, which already contain the proper labels. My idea would to read in the whole file into memory (no problem on a Linux box, I would say), and group instructions in blocks, based on the labels.
- Then I would decode each instruction in more primitive operations, in order to do away with all the different addressing modes. Most instructions only modify on register or memory location.
- From this, it should be possible to determine which are the registers read and written by each block (assuming that all memory locations refer to C-variables). Based on this we can do a flow analysis, in order to find out whether there are any register mapped variables.
- If this has be done, we can determine which are the expressions that are calculated by the instructions for each store to a variable (either register based or memory based). Normalisation of these expression might be needed.
- Further more we have to recognize the high-level statements that resulted in the block structure, and the jumps between them (which is described in: Reverse Compilation Techniques (Cifuentes 1994) (download gzipped postscript file here). From this, we can start generating the code, which will only deal with operations on bytes, words, long words, and float (doubles).
For the reconstructing of the types, we first of all need type resolution techniques, whenever parameters are passed in function calls, or when they are assigned to each other. The way to do this is to assign a unique type to each variable and parameter, and derive equivalence rules. Problems are caused when in the original source type casts are used. Also enumeration types, and other types, which are equivalent with one of the standard types cannot be recovered.
The detection of array types and pointer types is not hard. Structured types, especially when union's are used, pose extra problems.
CategoryDecompilation