Mark van den Brand1, Paul Klint2,1,
Chris Verhoef1
Renovation of business-critical software is becoming increasingly important. We identify fundamental notions and techniques to aid in system renovation and sketch some basic techniques: generic language technology to build analysis tools, a knowledge retrieval system to aid in program understanding, and a coordination architecture that is useful to restructure monolithic systems thus enabling their renovation. We argue that these techniques are not only essential for the renovation of old software but that they can also play an important role during the development and maintenance of new software systems.
Categories and Subject Description: D.2.6 [Software Engineering]: Programming Environments--Interactive; D.2.7 [Software Engineering]: Distribution and Maintenance--Restructuring; D.2.m [Software Engineering]: Miscellaneous--Rapid prototyping; D.3.2 [Programming Languages]: Language Classifications--Specialized application languages; E.2 [Data]: Data Storage Representations--Composite structures
Additional Key Words and Phrases: re-engineering, system renovation, intermediate data representation, coordination language, query algebra
There is a constant need for updating and renovating business-critical software systems for many and diverse reasons: business requirements change, technological infrastructure is modernized, the government changes laws, or the third millennium approaches, to mention a few. So, in the area of software engineering the subjects of program understanding and system renovation become more and more important, see, for instance, [BKV96] for an annotated bibliography. The interest in such subjects originates from the difficulties that one encounters when attempting to maintain large, old, software systems. It is not hard to understand that it is very difficult--if not impossible--to renovate such legacy systems.
The purpose of this paper is to identify core technologies for system renovation and to incorporate reverse engineering techniques in forward engineering so that the maintenance problem may become more manageable in the future. We identify the following core technologies for system renovation:
We will briefly review some of these technologies in this paper. The presentation is largely determined by our own current research and practice in re-engineering and does not aim at completeness.
First we introduce some basic terminology in re-engineering and generic language technology.
We briefly recall some re-engineering terminology as proposed in [CC90]. The term reverse engineering finds its origins in hardware technology and denotes the process of obtaining the specification of complex hardware systems. Now the meaning of this notion has shifted to software. While forward engineering moves from a high-level design to a low-level implementation, reverse engineering can be seen as the inverse process. It can be characterized by analyzing a software system in order to, firstly, identify the system components and their interactions, and to, secondly, make representations of the system on a different, possible higher, level of abstraction. Reverse engineering restricts itself to investigating and understanding a system, and is also called program understanding. Adaptation of a system is beyond reverse engineering but within the scope of system renovation. The notion restructuring amounts to transforming a system from one representation to another one at the same level of abstraction. An essential aspect of restructuring is that the semantic behaviour of the original system and the new one should remain the same; no modifications of the functionality is involved. The purpose of re-engineering or renovation is to study the system, by making a specification at a higher abstraction level, adding new functionality to this specification and develop a completely new system on the basis of the original one.
Another issue we will encounter is to what extent various tools depend on specific programming languages. We will classify language (in)dependence in the following categories:
There are not many scientific papers that actually discuss the re-engineering of large software systems, as we discovered during the compilation of an annotated bibliography [BKV96]. There are probably two reasons for this. First, academic software is easily put aside, because there is in general no real economic necessity to keep it running. Second, re-engineering projects carried out in industry are not frequently published for commercial reasons. This paper is based on our experiences in re-engineering a large academic software system as well as on the cooperation with a large commercial bank and a software house specialized in financial software.
We discuss a knowledge retrieval system based on a query algebra that is language parameterized and interprets queries directly. It supports multidirectional traversal of the syntax trees representing the code of legacy systems. Some related work on the topic of knowledge retrieval systems is [Ode93, PP94a, PP94b, REQ94, DRW96]. In [Ode93] a query mechanism is used for a completely different purpose than ours, namely, to define type checkers, the tree traversal mechanism is, however, similar to ours. In [PP94a, PP94b] it is claimed that the described Source Code Algebra is language-independent, but the data base in which the source code is stored must be initialized with language specific entries. Adding a new language results in adapting the data base. The Source Code Algebra is conceptually language parameterized, but the implementation technique used required instantiation of language parameters, thus yielding a language-dependent system. All examples in [PP94b] only use top-down tree traversal. [DRW96] describes tools for analysing C/C++ programs for program understanding. These tools are generated and support a procedural mechanism to retrieve information from the C/C++ programs. So it is not based on a query algebra and is language-dependent. REQL [REQ94] is a source code query language. The syntactic notions in a language are translated to attributes in a data base and into queries to test for relationships and properties. These attributes are hard-wired in the system but can be used for more than one language. The visualization mechanism and querying mechanism are fully integrated. This is a disadvantage since it prevents the visualization of results obtained by other mechanisms. It also supports some procedural functionality to analyze programs. An implementation of the query language REQL is described that permits the connection of parsers for different languages.
We will also discuss a coordination architecture to implement the decomposition of legacy systems. A detailed overview of related work on software architectures and coordination languages can be found in [BK96b].
In Section 2 we discuss a common representation format that is useful to exchange data both between components in a (legacy) system and between tools in a re-engineering environment. In Section 3 we discuss a knowledge retrieval system based on a query algebra which uses the common representation format of Section 2. In Section 4 we discuss a mechanism to decompose a legacy system into smaller components and show how the exchange of data between these components can be established via the common representation format of Section 2. In the final section we put the core technologies for re-engineering in the perspective of forward engineering.
How can re-engineering tools share and exchange information? How can components of a legacy system exchange information during incremental renovation?
We propose a data structure, called the Annotated Term Format (ATF) specially designed for the data exchange between (possibly) heterogeneous components in a software system. This data format describes terms extended with annotations and is able to accommodate the representation of all possible data that might be exchanged between components. The representation is such that individual components are, so to speak, immune for information that is added and manipulated by other components.
ATF can also be used as internal data structure of the components themselves. It is a powerful format in which it is possible to represent, for instance, parse trees that can be annotated with many and diverse sorts of information, such as textual coordinates, access paths, or the result of program analysis (e.g., data flow analysis).
Below we will give an inductive definition of terms in Annotated Term Format, we will discuss the annotation mechanism and we give examples of its use. A full formal specification of ATF is given in [BKOV] and is beyond the scope of this paper.
By instantiating C, N, L, and F we can obtain various forms of annotated terms.
Choose , and . Examples of ATerms are then:
The use of ATF can be demonstrated by showing how to represent parse trees in ATF, and by annotating them with path and visualization information, respectively. This can be done by instantiating C, N, L, and F. Note that this example will be used in Section 3.2 when we discuss a knowledge retrieval system.
Let C and N be the empty set, and let L be . Let the set of function symbols F consist of the following elements:
The following context-free syntax rules (in SDF [HHKR92]) are necessary to parse the input sentence true and false.
sort Bool context-free syntax true -> Bool false -> Bool Bool and Bool -> Bool {left}
The parse tree below represents the input sentence true and false in ATF, in Figure 1 this parse tree is depicted.
appl(prod("Bool and Bool -> Bool"), [appl(prod("true -> Bool"),[l("true")]), w(" "),l("and"),w(" "), appl(prod("false -> Bool"),[l("false")]) ])
Note that this parse tree is completely self-contained and does not depend on a separate grammar definition.
graphnewbox graphtempnewdimen ==.5exby 0.100in to 0ptBool and Bool -> Bool=.5exby 0.600in to 0pttrue -> Bool=.5exby 0.600in to 0ptand=.5exby 0.600in to 0ptfalse -> Bool=.5exby 1.100in to 0pttrue=.5exby 1.100in to 0ptfalsedepth1.200in width0pt height 0pt
Figure 1: The syntax tree for true and false.
To demonstrate the annotation mechanism we extend the above definition to create pointers to trees in the form of paths. A path is a list of natural numbers where each number stands for the th son of the node where . The first number represents the root. The syntax of paths in ATF can be obtained by extending N with and extending the set of functions with .
The example parse tree can be annotated with path information thus:
appl(prod("Bool and Bool -> Bool"), [appl(prod("true -> Bool"), [l("true"){path([0,0,0])}]){path([0,0])}, w(" "),l("and"){path([0,1])},w(" "), appl(prod("false -> Bool"), [l("false"){path([0,2,0])}]){path([0,2])} ]){path([0])}
Note that only the application and literal nodes are annotated with path information. The production rules are not annotated since they do not represent nodes in the parse tree for the original input sentence (true and false).
Another example is to use the annotations to store visual information, such as colours, for components that support them. We extend the set C with the constants red, yellow, and blue and the set F with the unary function (colour,1). By adding the annotation colour(red) a tool for visualization knows that the annotated subterm should be printed in the colour red. Other tools will ignore this colour annotation. As an example we annotate in the example parse tree the node l("true").
appl(prod("Bool and Bool -> Bool"), [appl(prod("true -> Bool"), [l("true"){path([0,0,0]),colour(red)}]){path([0,0])}, w(" "),l("and"){path([0,1])},w(" "), appl(prod("false -> Bool"), [l("false"){path([0,2,0])}]){path([0,2])} ]){path([0])}
In a similar way we can define font, font-type, and elision annotations. Fonts can be obtained by extending the set C with the constants like times or helvetica and the set F with the unary function (font,1). Font types are defined analogously (bold and italic are examples of font types). Elision can be obtained by extending C with the constant (the common notation for elision of text) and the set F with (elision,1). Elision can be used to suppress (large) parts of program texts.
ATF resembles the intermediate tree representations [ASN87, Sno89, Aus90] usually found in compilers and programming environments. It has, however, a number of properties that make it particularly suited for our purpose:
We will see that ATF can be used as a basis for knowledge retrieval (Section 3) and is also suited for exchange of data in heterogeneous, distributed, systems (Section 4.1).
One of the crucial phases in re-engineering a (legacy) system is to obtain, collect, and store relevant knowledge about it. In this phase it is often desirable to first obtain an overall impression of the system and later on to dive into the details of relevant subsystems. The re-engineer can choose to store relevant (inferred) information to be used later on.
Given a specific re-engineering goal (e.g., year 2000 compliance, or language conversion), a re-engineer will have to zoom in and zoom out on the source code in order to increase her understanding and to identify program fragments that are relevant for the re-engineering goal at hand. Tools supporting such an interactive zooming mechanism are important, given the expectation that not all re-engineering tasks can be automated [Cor89] and will require human interaction and intelligence [You89].
Obtaining knowledge about a program is not language specific. Therefore, it is desirable that tools to obtain this knowledge are generic, i.e., can be parameterized with a desired language. An additional advantage of this approach is that a single tool is sufficient to deal with legacy systems implemented in more than one language, since even within one legacy system usually more than one language is used, viz. the combination of COBOL [ANS85] and Job Control Language JCL [Flo71].
A natural means for obtaining knowledge about a legacy system is to view it as a special type of data base that can be queried. We will use a parse tree with the possibility of adding annotations (as described in Section 2) as our data base model. Thus, the queries are functions that inspect the parse tree and modify its annotations when appropriate. Often combinations of extracted information yield important new information, therefore, it is natural to have the possibility of combining queries. This implies the existence of basic queries as well as operations to combine them, in other words an algebra of queries.
A query algebra should have at least the following properties:
A query algebra should be language parameterized since obtaining knowledge is not language specific. It can then also effortlessly deal with polylingual legacy systems. In [PP94b] this is identified as an extremely valuable feature.
Zooming in and out on the code is needed in order to locate the code fragments that cause problems or that are related to a specific re-engineering goal.
Random access on tree nodes can be obtained via the focus, an interactive pointing mechanism for the re-engineer to locate (sub)trees. A focus divides a tree in regions, such as:
graphnewbox graphtempnewdimen ==.5exby 0.290in to 0ptS=by -1by 2 by .5exby 0.690in to 0ptB A=by 1by 2 by .5exby 0.690in to 0ptL=.5exby 0.990in to 0pt(a)=.5exby 0.090in to 0ptr=.5exby 0.490in to 0ptf=.5exby 0.220in to 0ptB=.5exby 0.470in to 0ptL=.5exby 0.720in to 0ptA=.5exby 0.970in to 0pt(b)depth1.090in width0pt height 0pt
Figure 2: Regions in the syntax tree and program text.
A query algebra system for re-engineering should be adjustable in the following ways:
The Interactive Query Algebra Tool (IQAT) is a prototype we have developed to experiment with query algebras for system re-engineering to see whether the proposed properties are feasible. The query algebra should, therefore, at least support a multidirectional propagation mechanism of queries over tree regions; it should be language parameterized; and it should be adjustable. We will argue below that the IQAT system satisfies these properties.
The multidirectional propagation mechanism is explicitly available via a number of (basic) queries in IQAT. In Figure 2(a) we have seen that the syntax tree can be divided in a number of regions. An examples of a query that uses these regions is trees(Region); selects all nodes in region Region, defined by one of the constants before, after, local, spine, discussed in the previous section. Note that this query has the actual tree that is being queried as implicit parameter (see below).
Our query algebra supports a language parameterized mechanism. More precisely, instead of parameterizing it with an entire language definition some queries can be parameterized with elements of a language definition, such as non-terminal names and production rules (we call these in the sequel SyntacticCategories). Each language element in the query is translated to terms in ATF by the query evaluator. The resulting term is used when parse trees of the code of the legacy system are inspected to answer the query. For example, the query
checks whether the root of a tree is a move-statement in a COBOL program. Another example is contains(Region,SyntacticCategory); it selects all nodes in region Region which satisfy property SyntacticCategory.
The IQAT system supports a number of (basic) data types, for example booleans, integers, sets, and trees. It supports a number of operations on these types, for example the operations and, or, and not on the booleans and the operations + and - on the integers. We will not further discuss these basic types and their operations, instead, we will focus on the more sophisticated part of the query algebra: queries and operations on them. All queries work implicitly on a set of so-called tree addresses (SetofTrees) each consisting of a syntax tree and a path to a node in the syntax tree. Therefore, the first argument of all query definitions is shown between brackets to emphasize that it is an implicit argument of the query. Note that most (but not all) queries take an implicit SetofTrees argument and yield a SetofTrees as well.
We divide queries in the following conceptual categories and we give some typical examples of each category:
This query selects the nodes with the property SyntacticCategory in the region Region for all trees in SetofTrees.
This query selects all nodes in the region Region of all trees in SetofTrees.
These queries return all nodes that can be reached by going Int steps down, up, left, right, respectively, in the trees in SetofTrees.
This query checks whether one of the trees in SetofTrees contains a node with property SyntacticCategory and which is located in the region Region. In fact, exists is very similar to locate except that a boolean value is returned instead of a SetofTrees.
This query checks whether Tree has property SyntacticCategory.
This query counts the number of trees in SetofTrees.
Next we will discuss a number of composition operations on queries. Note that such operations generally are partial functions. For example, size and exists can not be sensibly composed. Below we list the most important operators.
Consider, e.g., the functional composition trees(local) size. Given an initial (implicit) set of trees S, trees(local) will compute a new set of trees S' consisting of all nodes of all subtrees of S. Next, size will compute the number of subtrees in S'.
An expression containing this operator returns all the nodes obtained by evaluating the left-hand side that satisfy the criterion of the right-hand side. For example, the provided combinator can be used to locate all while statements that contain if statements:
A prototype of the query evaluator has been developed using ASF+SDF [DHK96]. It takes a set of trees and applies the specified query on them. Before the actual evaluation starts the query expression is type-checked. The evaluation function has the following form:
Based on these experiences an implementation of the IQAT system is being made using the TOOLBUS [BK96a, BK96b], see Section 4.1. Due to the nature of the IQAT system it is crucial that it can be easily modified. This is achieved as follows:
A visualization mechanism is necessary to show the results of queries. We separated this mechanism from the query algebra, so it can also be used for the viewing of results of other tools, like a data flow analysis tool. Although we recognize the importance of graphical views, we currently restrict ourselves to textual ones.
We have shown in Section 2 that parse trees in annotated term format can be annotated with path, colour, font, font-type, and elision information. This information can be used both to give a structured view on legacy code and to visualize the result of query evaluation. These evaluations are either booleans or integers, or sets of trees. We will now only discuss the visualization of trees.
A number of access functions can be defined on annotated parse trees to manipulate the various visual annotations. For example, we can manipulate the colour annotation with the functions:
The annotations colour, font, and font-type can be used to emphasize parts of the legacy code, whereas elision is useful to suppress parts of the legacy code.
The set of trees provided by queries, such as trees, serves as input for the access functions of the visualization mechanism. The access functions return a parse tree annotated with colour, font, font-type, and elision information. A viewing component traverses the resulting tree and interprets the visual annotations by providing the requested colours, fonts, font-types, and elisions.
We have presented a generic approach for querying source code. Although we are currently concentrating on the functionality of the query system itself rather than already applying it in large case studies, we expect the following benefits from this approach:
How can a re-engineering tool invoke functionality provided by other tools? How can components of a legacy system call each other's services during incremental renovation?
We will first describe in Section 4.1 an implementation of a software bus, called the TOOLBUS [BK96a, BK96b], that facilitates the construction of software systems consisting of independent, cooperating, components. In Section 4.2 we focus on the creation and modification of systems by means of the TOOLBUS. In Section 4.3 we emphasize on using the TOOLBUS\ to re-engineer legacy systems, demonstrated by an example: the re-engineering of the ASF+SDF Meta-Environment[Kli93].
The TOOLBUS [BK96a, BK96b] is a component interconnection architecture resembling a hardware communication bus. To control the possible interactions between software components connected to the bus direct inter-component communication is forbidden.
graphnewbox graphtempnewdimen ==.5exby 0.100in to 0pt=.5exby 0.750in to 0pt=.5exby 0.700in to 0pt=.5exby 0.750in to 0pt=.5exby 0.100in to 0ptTOOLBUS:=.5exby 0.550in to 0ptAdapters:=.5exby 0.750in to 0ptComponents:depth0.900in width0pt height 0pt
Figure 3: General architecture of the TOOLBUS.
The TOOLBUS serves the purpose of defining the cooperation of a number of components () that are to be combined into a complete system as is shown in Figure 3. The internal behaviour or implementation of each component is irrelevant: they may be implemented in different programming languages or be generated from specifications. Components may, or may not, maintain their own internal state. The parallel process in Figure 3 describes the initial behaviour of the interaction of the components and the interaction between the sequential processes . Where a sequential process can, for example, describe communication between components, communication between processes, creation of new components, or creation of new processes.
To give an idea of the expressive power of the sequential processes we give a simplified BNF definition:
We only give the main operators and we have abstracted from the data part of the atomic actions send, receive, and create. The operators +, , and * stand for choice, sequential composition, and iteration, respectively. The constant process stands for the deadlocked process, send and receive are actions for communication, and create is an action for the creation of processes and components. A TOOLBUS process consists of parallel sequential processes: , where || stands for parallel composition. The operators in TOOLBUS processes stem from the algebra of communicating processes (ACP) [BW90, BV95]. In fact, TOOLBUS processes are expressions over a process algebra that also supports relative and absolute time aspects [BB96]. A detailed discussion of TOOLBUS processes is beyond the scope of this paper, and can be found in [BK96a].
The adapters in Figure 3 are small pieces of software needed for each component to adapt it to the common data representation (ATF, see Section 2) and the TOOLBUS process expression. A number of adapters are available via TOOLBUS libraries, but for specific components they have to be implemented separately.
The TOOLBUS provides an architecture to compose systems from (basic) components. Within this architecture there is a mechanism to add and replace components without affecting the rest of the system. The components use a common data format (ATF) to exchange information. The common data format can also be used as internal data structure by the components themselves.
Systems developed with the TOOLBUS can more easily be modified than classical, monolithic, ones. Reasons for modifying an existing system are deleting obsolete components, replacing components by better ones, extending the system with new functionality, and system maintenance. Modifications of a TOOLBUS-based system can take place at various levels:
If we apply the approach just described in a top-down manner it can also be used to decompose an existing system into (smaller) components. By analyzing a monolithic system one can identify its logic components and their relationships, these are the arrows in Figure 4(a). Having identified these logic components and relationships it may be possible to decompose such a monolithic system. If the decomposition is feasible, we can implement these relationships in the software bus, see Figure 4(b) where the arrows in the monolithic system have been moved to the software bus. The monolithic system is connected to the software bus via an adapter that redirects the internal communication to the software bus, we call this communication extraction. Finally we can decompose the monolithic system by physically decomposing the logic components and connect them to the software bus, see Figure 4(c), we call this component restructuring. Furthermore, it is possible to reorganize the communication in the software bus, we call this communication restructuring, this is depicted by the reshuffling of the arrows in Figure 4(c). After having identified these components it has to be decided whether the functionality of the components is obsolete or re-usable. Obviously components with obsolete functionality can be thrown away. For components with re-usable functionality it must be decided whether it is possible to re-use existing code and adapt it, to reimplement the entire component, or to use existing third party software. We can now use the techniques described in Section 4.2 to renovate the system in an incremental way. A similar migration strategy for legacy systems is elaborately discussed in [BS95].
There is not necessarily a one-to-one correspondence between the processes and the components. The possibility of controlling a complex component by more than one sequential process frees us from the need to physically split such a component. This is an extremely valuable feature in the communication extraction phase of the decomposition of a legacy system. Namely, communication between the logic components in the legacy system is transferred to communication between the sequential processes in TOOLBUS. It is also possible to have more physical components than sequential processes, e.g., components that operate sequentially can be described by a single sequential process this is a valuable feature in the communication restructuring phase of the decomposition of a legacy system. Even if n = m there is not necessarily a one-to-one correspondence between processes and components.
graphnewbox graphtempnewdimen ==.5exby 0.875in to 0ptmonolithic system=.5exby 1.100in to 0pt(a)=.5exby 0.675in to 0ptadapter=.5exby 1.100in to 0pt(b)=.5exby 0.425in to 0pt=.5exby 0.850in to 0pt=.5exby 0.800in to 0pt=.5exby 0.850in to 0pt=.5exby 1.100in to 0pt(c)depth1.200in width0pt height 0pt
Figure 4: General organization of restructering.
The above strategy is used at CWI/UvA to re-engineer the ASF+SDF Meta-Environment, an interactive programming environment generator developed in the mid eighties.
The re-engineering goals of the ASF+SDF Meta-Environment were manifold; we will enumerate them below:
A first attempt to restructure the ASF+SDF Meta-Environment in order to re-engineer it is described in [KB93]. The main goal of this restructuring was to use third party software for the user interface and the text editor. Those components were successfully separated from the main system and every part worked stand-alone. However, the interaction between those parts failed, since the interaction led to unexpected deadlocks. These problems were studied and described in [VVW94] and gave rise to the development of a more sophisticated coordination archictecture, the TOOLBUS [BK96a, BK96b].
At the time of writing this paper the re-engineering process is still in progress. The relationships between the components are implemented in the TOOLBUS. All components are being reimplemented in parallel by various people.
Interestingly enough, re-engineering techniques can play a useful role during forward engineering as well. Combined with the observation that most systems undergo many forward as well as reverse engineering activities during their life time we argue that a discipline of software engineering should be developed in which both forward and reverse engineering are seemlessly integrated. This approach will improve both the cost effectiveness of the software production and the quality of the resulting software.
graphnewbox graphtempnewdimen ==.5exby 0.100in to 0ptcreation=.5exby 0.100in to 0ptmaintenance=.5exby 0.100in to 0ptlegacy=.5exby 0.200in to 0pt1=.5exby 0.400in to 0pt=.5exby 0.700in to 0pt=.5exby 1.000in to 0pt=.5exby 1.200in to 0pt-1=.5exby 0.800in to 0pt timedepth1.300in width0pt height 0pt
Figure 5: Classical life cycle of a software system.
One can distinguish the following phases in the life cycle of a software system:
The above approach yields a legacy system that can only be re-engineered at high costs. In order to avoid this we propose an alternative scenario that reduces the maintenance effort in the long run and prevents the creation of a legacy system. We distinguish the following two phases:
graphnewbox graphtempnewdimen ==.5exby 0.677in to 0pt=.5exby 0.451in to 0pt=.5exby 0.903in to 0pt=.5exby 1.288in to 0pt-1=.5exby 0.066in to 0pt1=.5exby 0.066in to 0ptcreation=.5exby 0.066in to 0ptmaintenance=.5exby 0.846in to 0pt timedepth1.288in width0pt height 0pt
Figure 6: Desired life cycle of a software system.
The core technologies for system renovation that we have briefly reviewed in this paper may thus also contribute to the discipline of software engineering as a whole.
We thank Pieter Olivier for useful comments related to Section 4.1. In addition, we thank all our colleagues in the Resolver project for general discussions on the topic of system renovation.