Mark van den Brand , Paul Klint
, Chris Verhoef
University of Amsterdam,
Programming Research Group
Kruislaan 403, NL-1098 SJ Amsterdam, The Netherlands
CWI,
Department of Software Technology
P.O. Box 94079, NL-1090 GB Amsterdam, The Netherlands
markvdb@wins.uva.nl, paulk@cwi.nl, x@wins.uva.nl
Generic language technology and compiler construction techniques are a prerequisite to build analysis and conversion tools that are needed for the re-engineering of large software systems. We argue that generic language technology is a crucial means to do fundamental re-engineering. Furthermore, we address the issue that the application of compiler construction techniques in re-engineering generates new research questions in the field of compiler construction.
Categories and Subject Description: D.2.6 [Software Engineering]: Programming Environments--Interactive; D.2.7 [Software Engineering]: Distribution and Maintenance--Restructuring; D.3.2 [Programming Languages]: Language Classifications--Specialized application languages;
Additional Key Words and Phrases: re-engineering, reverse engineering, system renovation, intermediate data representation, compiler construction techniques, generic language technology, programming environment generator
In 1977, Mathew Hecht wrote in his book [Hec77] on flow analysis of computer programs ``Flow analysis can be used to derive information of use to human beings about a computer program", in fact he was referring to what we nowadays call program understanding or reverse engineering. He further motivated the use of flow analysis by stating that ``some automatic program restructuring may be possible" and that ``perhaps remodularization could be accomodated", techniques that are relevant to restructure and remodularize legacy systems. So, it comes hardly as a surprise that we will argue here that classical compiler construction techniques are extremely useful to aid in re-engineering.
Re-engineering is becoming more and more important. There is a constant need for updating and renovating business-critical software systems for many and diverse reasons: business requirements change, technological infrastructure is modernized, governments change laws, or the third millennium approaches, to mention a few. So, in the area of software engineering the subjects of program understanding and system renovation become more and more important. The interest in such subjects originates from the difficulties that one encounters when attempting to maintain large, old, software systems. It is not hard to understand that it is very difficult--if not impossible--to renovate such legacy systems.
The purpose of this paper is to show that a substantial part of the technology used in re-engineering often originates from these fields. We want to make researchers in the field of compiler construction and generic language technology aware of the application of their techniques in the field of re-engineering. Furthermore, we will identify topics for further research that are particularly relevant for re-engineering.
In [BKV96b] generic language technology is used as a core technology for re-engineering. For more information on the subject of re-engineering we refer to the annotated bibliographies [Arn93] and [BKV96a].
The term reverse engineering finds its origins in hardware technology and denotes the process of obtaining the specification of complex hardware systems. Now the meaning of this notion has shifted to software. As far as we know there is not (yet) a standard definition of what reverse engineering is but in [CC90] we can read:
``Reverse engineering is the process of analyzing a subject system to identify the system's components and their inter-relationships, and to create representations of the system in another form at higher levels of abstraction.''
According to [CC90] the following six terms characterize system renovation:
Forward engineering moves from a high-level abstraction and design to a low-level implementation. Reverse engineering can be seen as the inverse process. It can be characterized as analysing a software system in order to, firstly, identify the system components and their interactions, and to, secondly, make representations of the system on a different, possible higher, level of abstraction. This can be seen as a form of decompilation. It may be necessary to move even from assembler (or from the executables) level to a higher level.
Reverse engineering restricts itself to investigating a system. Adaptation of a system is beyond reverse engineering but within the scope of system renovation. Redocumentation focuses on making a semantically equivalent description at the same level of abstraction. It is in fact a simple form of reverse engineering. Tools for redocumentation include, among others, pretty printers, diagram generators, and cross-reference listing generators. In design recovery domain knowledge and external information is used to make an equivalent description of a system at a higher level of abstraction. So, more information than the source code of the system is used. The notion restructuring amounts to transforming a system from one representation to another one at the same level of abstraction. An essential aspect of restructuring is that the semantic behaviour of the original system and the new one should remain the same; no modifications of the functionality is involved. The purpose of re-engineering or renovation is to study the system, by making a specification at a higher abstraction level, adding new functionality to this specification and develop a completely new system on the basis of the original one by using forward engineering techniques.
In this section we will give the reader an impression on the relation between specific programming languages and re-engineering.
Since many business-critial systems that need re-engineering are written
in COBOL there are quite a number of papers available that discuss methods and
techniques that focus on COBOL. For instance, in [GBW95]
the re-engineering of 50,000 lines of COBOL to Ada is described. The goal
was to do it as automatically as possible using compiler
construction techniques. An example of the use of program slicing to aid
in re-engineering COBOL can be found in [NEK93] and [CFV93].
In [NM93] Software Refinery, a tool originally developed primarily
to generate programming environments, is used to build a modularization
tool for COBOL. In [Zuy93] aspects of re-engineering and their
relation to language technology
are discussed. We mention the compilation of COBOL code to
equational specifications, their restructuring and simplification, and
regeneration of COBOL code from them. Moreover COBOL is compiled
to an intermediate language supporting both all the features of COBOL
as well as those of JCL. Various tools support these compilations.
In [BFK 94] and [BFKO93] denotational semantics is advocated
as a formal foundation for understanding COBOL programs.
These ideas are implemented in a tool for the reverse engineering of
COBOL-74 programs.
Not only COBOL is subject to reverse engineering. We mention [OC93] in which a tool combining static and dynamic information for analyzing C programs is described. In [CCC93] a method is described to produce design level documents by static analysis of Ada programs using data flow analysis. Finally, in [Byr91] we can find a method to convert Fortran programs into Ada code. This is done via analysis of the Fortran code followed by a reimplemention of the extracted design in Ada. Needless to say that in the above cases generic language technology and compiler construction techniques play an important role.
Many re-engineering tools use compiler construction techniques. When constructing a compiler these techniques are used to go from a high-level language to a low-level implementation. When re-engineering a legacy system those techniques are used to move from a low-level implementation to a more abstract level. In compiler construction terms we could say that re-engineering amounts to the decompilation of source code into its specification.
A number of standard techniques in compiler construction are listed below together with their applications in the field of re-engineering.
The usefulness of scanner generators for re-engineering is quite obvious. For the COBOL language numerous dialects exists among which even non official ones and the development of scanners for each of these dialects is too expensive. Since the scanners are only slightly different, a generic approach using a scanner generator is appropriate.
At the syntactical level, there are many variations in the various COBOL
dialects,
and it is time-consuming to develop specific parsers for them from
scratch.
The use of parser generators ensures the correctness of parsers and gives
a considerable reduction in implementation effort.
Type checkers can be defined using syntax directed translation mechanisms, such as attribute grammars [AM91] or algebraic formalisms [DHK96]. The benefits of these formalisms are the strong relationship between syntax and semantics, and the ease of constructing such specifications. Formalisms that support some form of modularity provide facilities for re-usability in case of dialects.
We saw above that there are many applications of compiler construction technology in re-engineering. Several phases in which source code is processed by a compiler can be related to the phases that such code will pass during re-engineering. As is well-known (see, e.g., [ASU86] or [WM95]) we can distinguish the following analytic phases in a compiler: the lexical, syntactic, and semantic analyzer where the bulk of the analysis is taken care of. In re-engineering we have exactly the same phases that are also meant for analyzing the source code: the lexical phase for a rough inspection of the code, the syntax analysis for both composing a parse tree and for more involved analysis and the semantic analysis for even more involved analyses as we discussed above.
Of course the target of a compiler is to generate code from a source program. In that respect re-engineering and compiler construction differ, however, in [CM96] code generation is used for binary translation of systems from a (legacy) architecture to a modern architecture. Note that code optimization techniques are used in re-engineering as well. To generate optimal code it is necessary that the structure of a program and the dependencies between the variables in the program are made clear. To make a program understandable for human beings its structure and dependencies have to be clear as well. So, it is not surprizing that the compiler optimization techniques such as control flow analysis, data flow analysis and abstract interpretation, are also relevant in re-engineering (see above).
We conclude that re-engineering techniques benefit from compiler construction techniques and that the other well-established techniques that are available in the compiler design field could be fruitfully applied in re-engineering as well. We believe that these techniques will attract new interest by their application in re-engineering.
To what extent depend various re-engineering tools on specific programming languages? Many tools are geared towards COBOL (or some of its dialects), but some re-engineering tools claim to be language independent.
We will classify language (in)dependence in the following categories:
Generic language technology developed during the eigthies
and embodied in programming environment generators such as, for instance,
the Synthesizer Generator [RT89],
Software Refinery [Rea92],
the PSG system [BS86], CENTAUR [BCD 89] and
the ASF+SDF Meta-Environment [Kli93], forms now the basis
for interactive re-engineering systems.
Such systems provide facilities for program analysis, visualization, code
restructuring, and automatic language conversions.
Since many legacy systems are polylingual it is important that re-engineering systems are based on generic language technology. We are confronted with programs or even complete systems written in numerous dialects of old fashioned programming languages which have to be understood and analyzed. Developing new tools for all the dialects is far too expensive and can be done more effectively using generic techniques. So, there is a strong connection between re-engineering and the fields of programming environment generators and compiler generators. The generic aspect is thus extremely valuable in re-engineering, see [PP94a].
The def-use relations in programs, for example, are in fact language independent, however their implementation is often language dependent. In [Moo96] a generic data flow language is defined which is powerful enough to do all kinds of data flow analysis. An arbitrary language is translated to this data flow language.
Various new, generic, approaches to program analysis exist. In [BDFH96] an equational logic language, PIM [Fie92], is presented which can serve as a intermediate language representation to solve problems in the field of program slicing, symbolic evaluation, and dependence analysis. It is designed to be used by compilers and analysis tools to process imperative langauges and can be used for re-engineering purposes as well. Other approaches include static shape analysis, security analysis, and the generation of static analysis algorithms from semantic definitions. An overview of recent work in these areas can be found in [HN96].
One of the challenges is how to formally define the syntax and the semantics of the old fashioned programming languages one encounters during re-engineering. For example, in [Man96] the syntax of COBOL-85 is formally defined in SDF [HHKR92] using the ANSI definition of COBOL-85 [ANS85]. Are the various language definition formalisms powerful enough to be used for this purpose?
Below we will list a number of research questions generated by applying compiler construction techniques in re-engineering.