---++ ON THE INVERSE OF COMPILING ---+++
W.L. Caudle ---+++
Sperry-UNIVAC ---+++
26 April 1980 ---+++ 1. INTRODUCTION It is fitting that the subject of program conversion aids and portability be a major topic at this conference held in San Diego. For, it was from this city that the first decompiler description was published by the late Dr. Maurice H. Halstead in his book entitled Machine Independent Computer Programming[20]. Though its major topic was the description of the ALGOL dialect called NELIAC, it contained a brief report on the implementation of a decompiler for the UNIVAC M-460 Countess computer. That work was done under his direction at the Navy Electronics Laboratory on Point Loma where he began his pioneering efforts on portability, conversion aids, and, ultimately, the invariant properties of languages [18]. At that time, high level languages (HLL) were in their infancy. There was a continuing debate on the feasibility of programming any but the most mundane of applications in HLL. The suggestion that a program written in machine language could be translated automatically to NELIAC was generally received as a jest. It is not surprising, then, that the first decompiler got its motivation from a friendly wager that the process was infeasible. There were the usual reasons for its impossibility, such as "there is no way to find out what the programmer had in mind when the program was written." This is, of course, irrelevant since a computer program must be completely unambiguous for its correct operation. The term decompiler was coined then to identify the inverse of compiling. In addition to the M-460 decompiler, two other decompilers were implemented with the guidance or consultation of Maury Halstead: the IBM 7094 to UNIVAC 1108 NELIAC decompiler at Lockheed Missiles and Space, and the UNIVAC 494 to 1100 Series Inverse Compiler at Sperry*UNIVAC in Minnesota. This paper will describe some of the techniques, problems, and experiences which are the result of the author's association with the development of these two decompilers and their application within conversion projects. ---+++ 2. THE INVERSE OF COMPILING The inverse of compiling, or decompiling, is part of the idealized software cycle: canonicalization, compilation, optimization, and decompilation [18]. It is a natural process which is applied each time a program is examined with the objective of abstracting its global properties. Decompilation deals with the processes of raising the language level from that in which a program is written to a level which allows translation to a different language, and possible transport to a different architecture, where the process is reversed by compiling. Ideally, decompilation results in the complete decoupling of data and algorithms, i.e., data declarations and procedures. More commonly, decompiling applies to the conversion of a machine or assembler program to HLL. However, many of the processes used in decompilation are applicable to the conversion of HLL programs which have not been written in an easily portable manner. Decompiling is distinguished from other forms of language conversion, such as direct translation, by the use of level raising techniques. These techniques rely extensively upon the same kind of processes used for optimization during compilation. However, optimization is not the goal, for a compiler can accomplish this. Rather, the goal is the representation of programs at a high level, with a minimum volume, to achieve readability, maintainability, and portability. Data base conversion is a related, and necessary, part of the conversion process [29,39,50,55]. To be effective, a program conversion aid must perform in tandem with a data base conversion aid. While the two problems cannot be solved separately, this paper will concentrate on the process of decompilation and its relation to data conversion, but not on data base conversion itself. ---+++ 3. AN INVERSE COMPILATION MODEL There are several papers [23,24,26,27,28,30,31] which provide models for the decompilation process, although there are few accounts of actual implementations. The following model was implemented as the UNIVAC 494 to 1100 Series Inverse Compiler at the Roseville Software Development Center in Minnesota. The model consists of the following processes: input processing, basic block generation, local data usage analysis, control and data flow analysis, data mapping, reporting, intermediate form preparation, level transformations, and target language production. The model is distinguished from prior models by the extensive use of analysis techniques to derive data structures in recognition of the fact that data conversion is of primary importance. A general description of the model follows and includes some of the problems in each of the processes involved. Major portions of the processes are independent of the source architecture, and could be made completely independent by using a machine description language [4] to specify the machine language being decompiled. ---++++ 3.1. INPUT PROCESSING The minimum input required for this model is the absolute code of a program which executes on the UNIVAC 494. In this case, language statements produced by the decompiler will reference generated names, unless names are present in the absolute in the form of diagnostic tables. If assembly and collection (linkage editor) listings are available in machine readable form, these are used to supplement the information derivable from the absolute program. Names are generated from these sources where applicable. Note that raw source statements are not an input: the information available from the source assembly and collection listings replaces much processing which would be required if raw source were used. Specifically, all of the mappings between labels and the absolute code are obtainable from these listings. The user interface provides a third form of input. It allows every kind of information derived by the decompiler to be changed by the user, an important facility for any conversion aid. A Subprogram Library, accessible through the user interface, allows the global properties of subprograms to be collected, either by the user prior to processing or by the decompiler during its processing, so that the decompilation of subprograms need only occur once. Through it, the user can provide supplemental information such as calling sequences, type of return, and data usage. This is valuable where unconventional calling sequences are employed. Replacement routines can be defined so that a substitute call is made wherever the original is encountered. Finally, no changes to the program being converted are allowed either prior to, or during, its decompilation. This is an important requirement, since a change to a program should require revalidation of the program prior to its decompilation. ---++++ 3.2. BASIC BLOCK GENERATION This process identifies instructions and data by developing the basic block structure and control flow graph of the program. Each basic block, a sequence of instructions having a single entry and exit, is linked to its predecessors and successors in control flow. Block generation begins at the entry point to the program, which is obtained automatically or as input by the user, and analyzes and simulates control flow variables, iterating until no new links or blocks are found. Since the identification of instruction blocks is performed without knowledge of program input data values, it is usually incomplete and requires some manual assistance. If assembly listings are part of the input, the Disassembly Report provides an aid by flagging program words which are coded as instructions, and which are not found within a basic block. While an item is coded as an instruction in assembler language, it need not actually be an instruction, or it may not be reachable by program flow. It is not unusual to find instruction sequences which have been isolated, but not removed, during maintenance operations. ---++++ 3.3. DATA USAGE ANALYSIS Data usage analysis constructs the local view of operand usage. No assumptions are made, that is, only those instructions which have been found to reside in a basic block are examined to obtain operand usage. The data type, data length, and reference type for each operand are recorded in a local data dictionary. Machine registers are treated in the same way as other instruction operands. A classification is made of instructions by grouping them by their general usage of operands. As an example, an integer add instruction and a floating point add instruction are considered to be identical. This classification greatly simplifies data flow analysis by providing a smaller set of cases to consider. It also provides machine independence to the process with its general treatment. ---++++ 3.4. CONTROL AND DATA FLOW ANALYSIS Using the derived control flow structure and the local data dictionary, control flow analysis derives the interval and loop structure of the program, and includes the computation of operand status, finally resulting in reach information [1]. A critical part of the conversion of programs is the conversion of data, both internal and external, to a form which is represented by the target language and which represents the usage implied by the program. Data flow analysis, based on control flow analysis, develops a global data dictionary of operand usage to support the declaration of operands in the target language. That analysis requires the propagation of local type and length information to operands which can reach, and which can be reached by, an operand reference from a basic block. As a simple example of this process, consider the following instruction sequence. L AO,ABC Load register AO from ABC FA AO,FPT Floating add FPT to AO S AO,DEF Store AO in DEF The following inferences can be drawn: ABC, FPT, and DEF are floating point operands If either AO or DEF are busy on exit from the block containing these instructions, then the floating point type can be propagated to the forward reached blocks If ABC or FPT is busy on entry, then the floating point type can be propagated to the backward reached blocks which reference them This kind of analysis may also be used to determine various dependencies among operands, such as input, output, control flow, and data bounding operands. Length information is also propagated in order to derive implied field definitions which are not explicit. ---++++ 3.5. DATA MAPPING The local and global data usage of operands are combined and used to derive a data mapping from the source machine to the target machine. This mapping is used as input to target language data declaration production and is a major part of input required by a data base conversion program. In the UNIVAC 494 to 1100 Series application, the 30 bit word storage and registers are mapped one-to-one into the 36 bit storage. Each different usage of data then requires redefinition of the storage involved. In the case of COBOL output, which has no declarable pointer variables (for based addressing), special program linking methods were employed to achieve the required mappings. There are obvious problems with such a mapping, particularly where multiple word compact operands are involved. These include double precision or multiple word character operands. The problems caused by such mappings depend on how the operands are referenced and what percentage of code is affected. For the usual program, the percentage is small or is localized. For some programs, such a mapping may render the resulting program useless. It is during the data mapping process that the results of data flow analysis are indispensable. The concern is the relaxation of the strict 30 bit mapping to a 36 bit mapping so that a minimum of code is generated during recompilation. For this purpose, as an example, the knowledge that an operand is used, or contains results derived by, only single precision binary integer arithmetic, allows the operand to be mapped safely to a 36 bit full word in the target Language. ---++++ 3.6. REPORTS The reporting phase usually concludes the first attempt at decompiling a program. The reports generated by this model provide the user with information to aid completion of the decompilation. A Disassembly report provides a listing of the program, in assembly code format, showing the control flow structure, operand usage, and reach information. It is a useful report even if the program is converted manually. A report on the data mapping is provided so that any particular data item mapping can be changed, via the user interface, on a subsequent decompilation of the same program. Other reports include a basic block and block link listing, a cross reference, and program statistics. After analyzing these reports, providing additional information to ensure that all basic blocks and links have been found, and verifying data mappings, the complete process is repeated from the beginning: two or three iterations are usually required. If no additional input is necessary, then the decompilation is completed by specifying that the target language is to be produced. ---++++ 3.7. INTERMEDIATE FORM PREPARATION Each instruction in a basic block is translated to an intermediate form represented by an N-tuple. Twenty-five intermediate forms are provided to represent all basic statement types, including labels, transfer of control, and subroutine invocation. The N-tuples reference the Data Dictionary and Data Map and provide a convenient machine independent data structure for level transformations prior to generating the target language statements. ---++++ 3.8. LEVEL TRANSFORMATIONS Level transformations are another objective of the analyses performed by the decompiler and result in more compact, easily understandable, and maintainable source language. Some of these, being of the form of classical optimizations, could be provided by an optimizing compiler during recompilation. However, this would not provide the natural, readable expression of the statements which is the desired result. Redundant operand references are removed. The majority of such redundant references arise from operands which represent machine registers. Subprogram calls are introduced at this point. The detection of subprogram calls and the identification of actual and formal parameters is, in general, an unsolved problem. However, the variety of calling sequences found in assembler programs is limited, due to standardization, and minimizes the effects of having an incomplete general solution. Conditional statements are transformed to minimize the number of transfers needed to represent the conditions. Translation of idioms must be performed to free the target program from nonsensical expressions. An idiom is an instruction or group of instructions which have an effect contrary to the usual or literal meaning. Shifting to perform multiplication or division is a simple idiom. As a more complex example, the particular conversion of integer to floating point illustrated below is an idiom. L A0,INT Load AO from integer INT OR A0,(0233000000000) OR in the characteristic FA,U A1,0 Floating add to normalize S A1,REAL REAL is the floating result A direct translation of this sequence would generally result in an incorrect result, since its correctness is dependent on mixed mode arithmetic and the order of the instructions produced on recompilation. ---++++ 3.9. TARGET LANGUAGE PRODUCTION The final step in this model is the production of the target language statements to represent the program. No analysis is required in this process; all statements are produced directly from the intermediate forms using the data mapping table, the data dictionary, and the Subprogram Library. In the implementation of this model, two target languages were provided: ASCII COBOL and PLUS, a Sperry*UNIVAC systems programming language. COBOL was added as the second language and required only that the data mapping and target language productions be modified. ---+++ 4. INVERSE COMPILATION - MAJOR TECHNICAL PROBLEMS The major technical problems in inverse compilation result primarily from the incomplete specification of the input program by its code. That is, while the code is an unambiguous statement of an algorithm, the algorithm may also be highly data dependent in ways which are difficult or impossible to derive from the code. Indeed, the control flow of the program and the references it makes to data may not be identifiable without the information which resides in possibly hundreds of data input files. It is easy to construct programs which are impossible to decompile with complete success, either automatically or manually, from the program's code. It is also easy to construct programs which do decompile with complete success. What is the difference, and why aren't all programs written so that the latter is achievable? Most likely, the lack of portability objectives is the culprit. The two examples below are simple demonstrations of intractable control structures. Example - Incomplete Subprogram Knowledge I = RANDOM (expression) GO TO I Example - External Data Dependency READ (5, 50) I GO TO I The first example above assumes that a function RANDOM computes an integer label, of bounded value known by the programmer, which is used to select the execution of the code at the label I. Unless the decompiler can analyze the subroutine RANDOM, a complete control structure cannot be determined. This example also demonstrates that decompilation must be performed on complete program units, unless information is provided for the missing units (e.g., through the Subprogram Library facility of the model.) In the second example, the control structure is determined from input to the program, and cannot be determined by any amount of analysis which does not consider data input. In either of the examples, a data reference using the value of I as an index, rather than to effect a control transfer, will result in an indeterminate data structure. These examples might lead a decompiler designer to consider the dynamic analysis of programs together with appropriate sets of input data. However, such analysis is expensive and guarantees failure, because the appropriate data sets cannot be determined. It is simpler to allow the user to supply the missing links directly. Either of the examples above can be rewritten to provide the complete specification of control paths by using conditional statements. ---++++ 4.1. CONTROL FLOW DETERMINATION The examples above demonstrate some of the problems with basic block generation. In the application of the block generation program at Sperry*Univac, about 95% of all blocks in the average program were automatically identified with user input required to specify the missing links, entries, and exits. It is possible, of course, that a single missing link can hide 99% of the blocks on an initial run with the decompiler. While the process may find all blocks, it may not find all links, and may result in incorrect level transformations. ---++++ 4.2. CONTROL AND DATA FLOW ANALYSIS Interval Analysis techniques require that either the control flow graph be reducible, or that it be made reducible by node splitting [1]. Techniques for node splitting are available, but none provide an optimal solution, That is, the number of nodes introduced may not be the minimum possible and may incur a performance penalty on analysis using the interval structure. Programs written in assembler language are more likely, than are HLL programs, to contain flow structures which are irreducible, because the programmer has complete freedom to transfer into the middle of loops. Thus for assembler language programs, irreducibility is a problem. Data flow analysis and level transformations require a complete and correct program flow structure. Bounds on data indices must be derivable so that data structures can be identified and translated to a higher level representation. ---++++ 4.3. ALGORITHM SYNTHESIS The most complex of the problems remaining to be solved is that of decoupling data structures and algorithms and representing these in a form allowing an algorithm, which references different data structures, to be synthesized. That is, if data structures are represented at a higher level, then the algorithms which reference these must also be translated to a higher level. The simplest case of decoupling involves, for example, a load, add, and store sequence. The problem here is to insure that mapping a word reference into storage of possibly another size has no effect on the result in the sequence. Such transformations require complete identification of the use of each data structure, not only by a single program, but also by all programs which may reference it. This implies that decompilation, or conversion of any sort, can not be done by considering only single programs, Rather, the properties of a system of programs must be examined globally to achieve the conversion. Just as optimization is made ineffective by applying it to small units of code, decompilation is ineffective on single routines or, in some cases single programs. The solution to global examination of systems of programs may become possible with the availability of such software support systems as RDL/RDP [25] which may be employed to manage the global information. ---++++ 4.4. CHOICE OF A TARGET LANGUAGE Given the current state of the art in decompilation techniques, the choice of a target language is critical. The language must be rich in data declaration facilities, must allow the same storage to be described with declarations of differing types, and must allow precise mapping into the storage of the target system. The language must be capable of high level expressions and must allow low level statements where required. Low level statements are required to express instructions which have no direct high level counterpart. Where low level language statements are not available, assembler interfaces must be written. ---+++ 5. EVALUATION OF DECOMPILATION AS A CONVERSION AID ---++++ 5.1. FIGURE OF MERIT The evaluation of decompilation involves measuring its figure of merit, that is, the removed cost in resources required to successfully convert a program with the decompiler in ratio with the cost to convert the program manually. Since such an evaluation is not usually practical, the evaluation must deal with averages. While this may lead to dangerous extrapolations in particular instances, it is nevertheless useful. Based on results of studies and on the conversion of production programs performed at Sperry*UNIVAC and customer sites, the rate of conversion using the decompiler is projected to be between 25 to 40 lines of assembler code per programmer hour, or 150 to 320 lines per day. These figures are based on expert usage of the tool and assume that the person performing the conversion has comprehensive knowledge of the source and target system and languages. They also assume a high degree of motivation in the use of the decompiler. As a comparison, figures provided by the Federal COBOL Compiler Testing Service [44] specify a potential manual conversion rate of 20 to 100 lines per man day for an assembler to COBOL/FORTRAN manual conversion. This indicates a figure of merit of 67 to 87 percent for the UNIVAC 494 to 1100 Series Inverse Compiler. On a conversion project using the IBM 7094 to NELIAC decompiler, rates as high as 250 lines per man day were attained. This particular high rate was achieved on a 24000 line assembler program which had no documentation, and which performed primarily data decoding and reformatting. The high rate is partially attributable to the similarities of the IBM 7094 and the UNIVAC 1108. ---++++ 5.2. COST Over the Last twenty years, there have been many attempts, and some successes, at developing usable conversion aids. There probably have been many more conversion aids developed and discarded than there have been conversion projects completed. It is relatively simple to develop specific one-shot conversion aids which attack a few, common, conversion problems. It is very difficult to develop a general conversion aid which attacks all or most of the conversion problems. There is, in fact, a point beyond which generality does not make good economic sense. That point depends on the amount of conversion to be done, the cost to increase generality, and the increase in the figure of merit of the conversion aid [21]. The cost of a decompiler with a high figure of merit is similar to the cost of a modern HLL compiler. Such cost cannot be amortized unless it is used extensively, as in the case of a compiler. Because of the high cost and of fear of conversion by users, it is unlikely that a decompiler will be constructed other than at the direction of a computer vendor, and then only if there is no alternative to migration. However, the techniques involved in decompilation are widely applicable in other areas of software development, and with the correct development strategies, a decompiler may become a byproduct of other tools. ---++++ 5.3. WHAT? ME USE A CONVERSION AID? In practice a conversion aid may turn out, instead, to be a hindrance. This can happen even if the aid has a high figure of merit. The individuals using the aid may be motivated against its use because they have a need to show their creativity. While creativity is commendable, it causes problems in conversion. This is a problem in the life of a conversion project, regardless of the tools being used. Errors are introduced by rewriting code that could have been translated automatically. Problems are created by the refusal to admit that the output of a good conversion aid, while not always esthetically pleasing, is a giant step ahead of a rewrite from specifications, if any exist. ---++++ 6. APPLICATION OF TECHNIQUES TO OTHER AREAS Decompilation, when viewed as a method to raise the level of a language, is applicable to areas of software development other than conversion. The language being decompiled may be a HLL, a common intermediate compiler language, or a design language. The result of such decompilation may be another programming language, a design language, or simply a tabulation of the properties of the language statements being decompiled. ---++++ 6.1. CAPTURE OF EXISTING DESIGN The use of current design, analysis, and documentation tools such as PSL/PSA of the ISDOS Project [51], or RDL/RDP at Sperry*UNIVAC [25] requires a large initial investment to capture the properties of code which already exists. As with code conversion, the abstraction of design information from existing code involves, at least in principle, decompilation of the programs. While much of the information can be derived quite easily, the abstraction of useful data structure definitions is a nontrivial task, and involves global analysis techniques such as are found in the decompilation model described in this paper. ---++++ 6.2. DESIGN/CODE VERIFICATION In order to get software development on a solid foundation, the verification that a particular implementation agrees with its corresponding design must become possible by some automatic or semiautomatic method. Unless the refinement of design to increasingly lower levels is done automatically, then it is necessary to verify that the lower level description is consistent with the higher. If the design is recorded using a design language such as RDL, then decompilation of the code or intermediate level design statement can be used to check data relationships and assertions against the design. %INCLUDE{"OnInverseOfCompiling2"}%