DEVICE INTEROPERATION - BRITISH TELECOMM

Title:

DEVICE INTEROPERATION

Document Type and Number:

WIPO Patent Application WO/2008/119975

Kind Code:

Abstract:

A device (50) for interoperating with a plurality of remote devices (20, 22, 24, 26) comprises means for causing a program (50) specifying a number of distinct operations to be carried out by selecting one or more remote devices (20, 22, 24, 26) to carry out remote programs offered by the remote devices, the first device including means for receiving information about the nature of and availability of the remote programs offered by the remote devices accessible to the first device, means for automatically analysing the computer program together with the received information about remote programs to select and request the services of a plurality of remote programs and auto-recovery means for, in the event that a selected remote program fails, identifying plural possible recovery strategies and automatically analysing these in order to determine a reliability value in respect of each possible recovery strategy and selecting the recovery strategy having the highest determined reliability value.

Inventors:

THOMPSON SIMON GILES (GB)
NGUYEN DUONG (GB)
LI YANG (GB)

Application Number:

PCT/GB2008/001122

Publication Date:

October 09, 2008

Filing Date:

March 31, 2008

Export Citation:

Click for automatic bibliography generation Help

Assignee:

BRITISH TELECOMM (GB)
GHARIB HAMID (GB)
THOMPSON SIMON GILES (GB)
NGUYEN DUONG (GB)
LI YANG (GB)

International Classes:

H04L29/08

Other References:

BENATALLAH B ET AL: "QoS-Aware Middleware for Web Services Composition" IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, IEEE SERVICE CENTER, LOS ALAMITOS, CA, US, vol. 30, no. 5, 1 May 2004 (2004-05-01), pages 311-327, XP011111883 ISSN: 0098-5589 -& ZENG ET. AL: "Quality Driven Web Services Composition" WWW2003, BUDAPEST, HUNGARY, [Online] 24 May 2003 (2003-05-24), pages 1-11, XP007909004 Retrieved from the Internet: URL:http://eprints.qut.edu.au/1778/1/p358-zeng.pdf> [retrieved on 2009-06-29] -& JUSTIN O SULLIVAN ET AL: "What's in a Service? Towards Accurate Description of Non-Functional Service Properties" INTERNET CITATION, [Online] 1 January 2002 (2002-01-01), pages 117-133, XP007908998 Retrieved from the Internet: URL:http://www.springerlink.com/content/t17847156u838120/fulltext.pdf> [retrieved on 2009-06-25]
VINCENZO GRASSI AND SIMONE PATELLA: "Reliability Prediction for Service-Oriented Computing Environments" IEEE INTERNET COMPUTING, IEEE SERVICE CENTER, NEW YORK, NY, US, 1 May 2006 (2006-05-01), pages 43-49, XP007908983 ISSN: 1089-7801
JAEGER M C ET AL: "Improving the QoS of WS Compositions Based on Redundant Services" NEXT GENERATION WEB SERVICES PRACTICES, 2005. NWESP 2005. INTERNATIONA L CONFERENCE ON SEOUL, KOREA 22-26 AUG. 2005, PISCATAWAY, NJ, USA,IEEE, 22 August 2005 (2005-08-22), pages 189-194, XP010892566 ISBN: 978-0-7695-2452-8 -& JAEGER M C ET AL: "QoS aggregation for web service composition using workflow patterns" ENTERPRISE DISTRIBUTED OBJECT COMPUTING CONFERENCE, 2004. EDOC 2004. P ROCEEDINGS. EIGHTH IEEE INTERNATIONAL MONTEREY, CA, USA 20-24 SEPT. 2004, PISCATAWAY, NJ, USA,IEEE, 20 September 2004 (2004-09-20), pages 149-159, XP010730220 ISBN: 978-0-7695-2214-2

Attorney, Agent or Firm:

WILLIAMSON, Simeon, Paul (PPC5A BT Centre,81 Newgate Street,London, Greater London EC1A 7AJ, GB)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims

1. A method of interoperating between devices in which a first device operates to perform a computer program specifying a number of distinct operations to be carried out at least some of which are performed by remote programs carried out by remote devices to which the first device has access, the first device receiving information about the nature of and availability of the remote programs offered by the remote devices accessible to the first device, automatically analysing the computer program together with the received information about remote programs to identify plural possible strategies and automatically analysing these in order to determine a reliability value in respect of each possible strategy; and selecting and requesting the services of a remote device or devices based on a strategy selected in dependence upon its determined reliability value.

2. A method according to claim 1 wherein, in the event that a selected remote program fails, the method further comprises identifying plural possible recovery strategies and automatically analysing these in order to determine a reliability value in respect of each possible recovery strategy and selecting and requesting the services of a remote device or devices based on a recovery strategy selected in dependence upon its determined reliability value.

3. A method according to either preceding claims wherein the reliability value factors in the number of different ways that an individual remote program or a component thereof can be achieved using different remote programs.

4. A method of interoperating between devices, the method comprising the steps of: running a computer program for use on a first device, the program specifying a number of distinct operations to be carried out, the first device receiving information about the nature of and availability of remote programs offered by remote devices accessible to the first device, automatically analysing the computer program together with the received information about remote programs to select and request the services of a plurality of remote programs and, in the event that a selected remote program fails, identifying plural possible recovery strategies and automatically analysing these in order to determine a reliability value in respect of each possible recovery strategy and selecting the recovery strategy having the highest determined reliability value.

5. A device for interoperating with a plurality of remote devices, the device comprising means for causing a program specifying a number of distinct operations to be carried out by selecting one or more remote devices to carry out remote programs offered by the remote devices, the first device including means for receiving information about the nature of and availability of the remote programs offered by the remote devices accessible to the first device, means for automatically analysing the computer program together with the received information about remote programs to identify plural possible strategies and

means for automatically analysing these in order to determine a reliability value in respect of each possible strategy; and means for selecting and requesting the services of a remote device or devices based on a strategy selected in dependence upon its determined reliability value.

6. A device for interoperating with a plurality of remote devices, the device comprising means for causing a program specifying a number of distinct operations to be carried out by selecting one or more remote devices to carry out remote programs offered by the remote devices, the first device including means for receiving information about the nature of and availability of the remote programs offered by the remote devices accessible to the first device, means for automatically analysing the computer program together with the received information about remote programs to select and request the services of a plurality of remote programs and auto-recovery means for, in the event that a selected remote program fails, identifying plural possible recovery strategies and automatically analysing these in order to determine a reliability value in respect of each possible recovery strategy and selecting the recovery strategy having the highest determined reliability value. 7. A computer program or programs for carrying out the method of any one of claims 1 to 4 during execution.

8. Carrier means carrying the computer program or programs of claim 7.

Description:

Device lnteroperation

Field of the Invention

The present invention relates to device interoperation, and in particular to the interoperation of devices connected via a telecommunications network.

Background to the Invention

Increasingly, the new generation of distributed software systems, in particular distributed business systems, are being constructed using Web Services. A Web Service defines a set of Web Service operations (WS-operations) that may be called upon by the client software systems.

Each Web Service is realised through one or more instances of that Web Service (WS-instance). Each WS-instance is an independent, stand-alone software system physically residing somewhere on an organisation's private Intranet or the global, public Internet. Each WS-instance can operate independently from other WS-instances of the same Web Service or WS-instances of other Web Services (i.e. is loosely-coupled to others), and is highly inter-operable.

Given the fact that WS-instances run as independent software systems, it is possible that multiple WS-instances of a Web Service may be provided by one or more independent service-providers. These providers may, for example, be companies: (i) competing in an open, competitive market for provision of that Web Service, (ii) offering the service freely on the Internet. It is also likely that different Web Services might have different number of their WS-instances available for use at any point in time. This might be due to various factors, including: (i) market conditions (e.g. popularity of certain Web Services might encourage service-providers to bring more WS-instances to the market), (ii) criticality of a specific Web Service might require employing more WS ^: instances of a Web Service compared to others.

Given the possibility of multiple, independent WS-instances of a Web Service, it is also possible that the number of those WS-instances might change in real time during the execution of a software system using that Web Service. The change might occur due to commercial or technical reasons. Examples of the former include: (i) forced withdrawal of the WS-instances offered by a service-provider due to e.g. bankruptcy, (ii) emergence of new WS-instances or withdrawal of existing ones due to fluctuations in market demand for a Web Service. An example of the latter category is the failure of hardware/software platforms used by the WS-instance.

Although, by definition, each WS-instance is capable of executing all operations of the Web Service of which it is an instance, the WS-instance may have different

probabilities for successfully executing different operations. This could, for example, be due to the different complexity of those operations.

One consequence of the above feature, together with the fact that there may be multiple, independent WS-instances of a Web Service, is that each WS-instance normally may have a different probability (compared to other WS-instances) to successfully execute each specific operation of that Web Service. The difference between success probabilities of different WS-instances can be due to technical reasons such as: (i) the type and age of employed hardware platforms, (ii) the maturity and robustness of the software platforms, (iii) speed and reliability of communications links providing access to the Web Service.

Finally, in addition to each WS-instance having a distinct probability to successfully execute each specific WS-operation, it is also possible that this probability might also change temporarily in real time or permanently over time. For example, the probability might temporarily go up or down at peak (and off- peak) demand periods; or goes down permanently as the employed hardware gets older and consequently, becomes less reliable. The combination of these aspects, namely,

(i) multiple WS-instances per Web Service,

(ii) dynamically varying number of WS-instances per Web Service,

(iii) distinct probability of a specific WS-instance to successfully execute a specific WS-operation, (iv) different WS-instances of a Web Service might have different probabilities to successfully (and independently) execute a specific operation of that Web Service, (v) dynamically changing success probabilities for each (WS-instance,

WS-operation) pair. brings about a complex, dynamic execution environment for the software systems composed of Web Services. In such environments the reliability of the individual WS-instances and their client software systems can change dynamically as a result of change in any of the above aspects. [Chen, Huang, 1992] This paper presents a unified algorithm to efficiently compute the "Distributed Program Reliability (DPR)" and "Distributed System Reliability (DSR)" of a distributed processing system (DPS). The components of such a DPS include:

• a set of processing elements (e.g. computers), • a set of programs available at each processing element,

• a set of data files available at each processing element,

• the set of data files (from the above set) needed to execute each program,

• the communication links between processing elements, and their states (up or down)

The paper explains that one approach to formulate the DPR and DSR is to generate all disjoint File Spanning Trees (FSTs) in the DPS graph such that the DPR and DSR can be expressed by the probability that at least one of these FSTs is working. Based on this approach, the algorithm presented in this paper efficiently generates disjoint FSTs by cutting different links and then compute the DPR and DSR using a simple and consistent union operation on the probability space of the FSTs.

[Genama, Rosenblum, Uchitel, 2005] This paper presents an approach for predicting the reliability of component-based software. The approach involves extending a scenario specification to model (1) the probability of component failure, and (2) scenario transition probabilities derived from an operational profile of the system.

From the extended scenario specification, probabilistic behaviour models are synthesized for each component and are then composed in parallel into a model for the system. Finally, a user-oriented reliability model described by [Cheung, 1980] is used to compute a reliability prediction from the system behaviour model.

[Kusiak, Zakarian, 1996, IEEE] This paper deals with the reliability analysis of process models defined using the Integration DEFinition Methodology (IDEF3). The paper extends the system reliability evaluation techniques, i.e. the system reduction approach and minimum path and cut sets methods for reliability evaluation of IDEF3 models. It also proposes an algorithm for computing reliability of an IDEF3 model from a path set-activity incidence matrix.

Summary of the Invention

According to a first aspect of the present invention, there is provided a method of interoperating between devices in which a first device operates to perform a computer program specifying a number of distinct operations to be carried out at least some of which are performed by remote programs carried out by remote devices to which the first device has access, the first device receiving information about the nature of and availability of the remote programs offered by the remote devices accessible to the first device, automatically analysing the computer program together with the received information about remote programs to identify plural possible strategies and automatically analysing these in order to determine a reliability value in respect of each possible strategy; and selecting and requesting the services of a remote device or devices based on a strategy selected in dependence upon its determined reliability value.

Preferably, in the event that a selected remote program fails, the method further comprises identifying plural possible recovery strategies and automatically analysing these in order to determine a reliability value in respect of each possible recovery strategy and selecting and requesting the services of a remote device or devices based on a recovery strategy selected in dependence upon its determined reliability value.

Preferably, the reliability value factors in the number of different ways that an individual remote program or a component thereof can be achieved using different remote programs. For example, if a first strategy involves performing three consecutive remote programs each of which is deemed to have a 90% reliability (giving an overall reliability for the strategy of 90%.90%.90%=73%) and a second strategy involves performing two consecutive remote programs each of which is deemed to have only an 80% reliability (giving an overall reliability of 80%.80%=64%), then the first strategy would generally be preferred. However, if in the first strategy none of the remote programs had any possible substitutions in the event of a failure, whereas the first of the remote programs in the second strategy had a backup alternative remote program in the event of the main program failing with a reliability of 70%, then the reliability of the second strategy if up to one failure is permitted is (100% - 20%.30%).80% = 75% and would thus be preferred if the user would tolerate up to one failure and back-up at the subcomponent level. The exact ways in which these strategies can be evaluated are described in greater detail below. However, it is preferable if a user is able to specify the extent to which individual failures within a (recovery) strategy followed by backup remote programs are to be tolerated in finding an optimum strategy or recovery strategy.

According to a second aspect of the present invention, there is provided a device for interoperating with a plurality of remote devices, the device comprising means for causing a program specifying a number of distinct operations to be carried out by selecting one or more remote devices to carry out remote programs offered by the remote devices, the first device including means for receiving information about the nature of and availability of the remote programs offered by the remote devices accessible to the first device, means for automatically analysing the computer program together with the received information about remote programs to identify plural possible strategies and means for automatically analysing these in order to determine a reliability value in respect of each possible strategy; and means for selecting and requesting the services of a remote device or devices based on a strategy selected in dependence upon its determined reliability value.

According to a third aspect of the present invention, there is provided a method of interoperating between devices, the method comprising the steps of: running a computer program for use on a first device, the program specifying a number of distinct operations to be carried out, the first device receiving information about the nature of and availability of remote programs offered by remote devices accessible to the first device, automatically analysing the computer program together with the received information about remote programs to select and

request the services of a plurality of remote programs and, in the event that a selected remote program fails, identifying plural possible recovery strategies and automatically analysing these in order to determine a reliability value in respect of each possible recovery strategy and selecting the recovery strategy having the highest determined reliability value.

According to a fourth aspect of the present invention, there is provided a device for interoperating with a plurality of remote devices, the device comprising means for causing a program specifying a number of distinct operations to be carried out by selecting one or more remote devices to carry out remote programs offered by the remote devices, the first device including means for receiving information about the nature of and availability of the remote programs offered by the remote devices accessible to the first device, means for automatically analysing the computer program together with the received information about remote programs to select and request the services of a plurality of remote programs and auto- recovery means for, in the event that a selected remote program fails, identifying plural possible recovery strategies and automatically analysing these in order to determine a reliability value in respect of each possible recovery strategy and selecting the recovery strategy having the highest determined reliability value.

It is clearly useful to be able to analyse a software system composed of WS-operations to dynamically compute: (i) the probability of successful execution of each WS-operation used by the software system, (ii) the probability of successful execution of the software system as a whole. The Reliability Analysis System, RAS, described in this report performs this task.

The reliability information computed by the RAS is useful both to the operational units responsible for the correct operation of software systems, and also, designers of such systems who strive to develop ever more reliable systems.

Note that in this application the terms 'software system' and 'application' are used interchangeably.

Further aspects of the present invention relate to computer programs for carrying out the methods of the invention and carrier means, especially tangible carrier means such as magnetic or optical storage disks or non-volatile solid-state memory devices etc., carrying such programs.

In the distributed processing systems considered by the [Chen, Huang, 1992] algorithm, each program is a unified entity residing and executing on a single processing element (e.g. a computer). The data files required by the programs are distributed on different processing elements. In our proposal, each program (application) consists of a number of components (Web Service operations) which can run on any one of a set of different processing elements. The size of

the set could be different for different components. Also, each processing element has a distinct probability to successfully execute the corresponding component. This probability may change in real-time while the program is executing. Furthermore, the number of processing elements for each component may increase and decrease in real-time too. In summary, the execution environment envisaged in preferred embodiments of the present invention is more complex and much more dynamic.

The [Genama, Rosenblum, Uchitel, 2005] paper has no concept of multiple, independent service-providers for each component (Web Service) which is inherent in preferred embodiments of the present invention. It also fails to describe a method for determining "component reliability information"; in preferred embodiments of the present invention, a "component reliability information" is calculated based on reliability information about the service providers of that component.

The [Kusiak, Zakarian, 1996, IEEE] paper does not address a situation in which there are multiple independent computing entities for performing each task (activity) of the workflow (process). Furthermore, it fails to consider a situation in which the number of computing entities associated with each task may change in real-time nor that the probability of a particular computer entity in successfully completeing a task may also change in real-time, rather it assumes that each task has a fixed probability of successful execution.

Brief Description of the Figures

In order that the present invention may be better understood, embodiments thereof will now be described, by way of example only, with reference to the accompanying drawings in which: Figure 1 is a schematic illustration of the relationship between Web Services, their instances and the Web Service instance providers

Figure 2 is a schematic illustration of the failure of a Web Service (WS) operation (WS-operation) and of the creation of recovery strategies or solutions for recovering from the failure; ,

Figure 3 is a schematic illustration of the architecture of an Automatic Recovery Handling System (ARHS) according to an embodiment of the present invention; Figure 4 is a schematic illustration of the transformation of initial conditions to a set of goals via the execution of an application;

Figure 5 is a schematic illustration similar to Figure 4 showing the transformation of initial conditions to a set of goals via the execution of an application showing intermediate computational states; Figure 6 is a schematic illustration of the transformation from an initial computational state to a final computational state via either one of two functionally equivalent WS compositions;

Figure 7 is a schematic illustration of a process of finding a set of functional equivalents of a WS-operation;

Figure 8 is a schematic illustration of the architecture of a reliability-based analysis module (RBAM) forming part of the ARHS of Figure 3; Figure 9 is a schematic illustration of an example of the structure of a software system containing parallel execution paths;

Figure 10 is a schematic illustration of an example of a linearization of a software system containing parallel execution paths; and

Figure 11 is a schematic illustration of an example of the structure of a software system containing two switch constructs;

Detailed Description of Embodiments Prior to a detailed description of embodiments of the present invention, there firstly follows a general discussion of Web-Based Software Systems.

Increasingly, the new generation of software systems, in particular business systems, are being constructed using the Service Oriented Architecture, SOA. In this approach, a software system is developed by using software components called Web Services.

Each Web Service provides the specification of a number of operations (WS-operations) and the syntax of request/reply messages that are exchanged to invoke those operations. The Web Service specification is realised in terms of a Web Service instance ( WS-instance) which acts as an independent, stand-alone software system and has a physical presence somewhere on a company's private Intranet (for use by the company's own software systems) or the global, public Internet (for paid/free public use). The number of instances of a Web Service is not limited to one; in the private Intranet scenario, there could be multiple instances of a Web Service in support of two aims: (i) to share the workload so that the required quality of service can be achieved, (ii) act as backup to enhance reliability of the software systems run by the company. In the public Interηet scenario, as a commercial market for Web Services gets

established and the Web Services become increasingly commoditised, different sen/ice providers (or even the same service provider) may offer multiple WS-instances of the same Web Service for public use; these WS-instances are identical m terms of functionality but may differ in terms of non-functional attributes such as price (for usage), performance (i.e. execution speed) and reliability (i.e. the probability of successful execution). The difference in the nonfunctional attributes would normally be the determining factor in choosing a particular WS-instance when executing a software system An example of a situation where multiple WS-instances of a Web Service are provided by multiple service providers is shown in figure 1. Here, two service providers 10, 12, in particular Service Provider 1 (reference numeral 10) and Service Provider 6 (reference numeral 12) provide three WS-instances 20, 22, 24 of the 'Web Service A' and one instance 26 of 'Web Service B'; Service Provider 1 offers one WS-instance 20 of 'Web Service A' while Service Provider 6 provides two WS-instances 22, 24 of 'Web Service A' and one WS-instance 26 of 'Web Service B'. Each WS-instance 20, 22, 24, 26 provides all the WS-operations specified by the corresponding Web Service specification 30, 32. Each WS-instance has only one service provider. To summarise, the figure illustrates the following relationships:

• (Web Service) specifies (WS-operation)

• (WS-instance) implements (Web Service)

• (Service Provider) provides (WS-instance)

• (WS-instance) executes (WS-operation) The figure also shows a software system 50 composed of four WS-operations (get stock price, open, write, close) which intends to write a stock price into a file. The specific WS-instances used by the application are also identified as is a network 40 which can be either the Internet or a private or public intranet of some sort.

Successful execution of the software systems composed of Web Services relies on successful execution of the WS-operations that they call upon. The failure of any of those WS-operations normally prevents the client system from achieving its stated goals. The likelihood of such failures cannot be ignored for (at least) three (main) reasons.

First, as stated above, the WS-instances may reside anywhere on an organisation's private Intranet (possibly with global span) or the global, public Internet. Accessing these WS-instances requires reliable, local and/or global communication links which may fail or provide unacceptable quality of service.

Second, WS-instances may run on hardware and software platforms whose reliability is unknown to, or hard to assess by, their users (client systems).

Third, the WS-instances may be withdrawn by their providers (i.e. organisations or individuals) either permanently, for example, due to commercial reasons, or temporarily, for instance, for maintenance. These reliability issues point to the fact that the likelihood of failure of software systems utilising Web Services is not negligible and is normally higher than a centralized one.

To handle the failure of a client software system caused by the failure of the WS-operations that it uses, fault handling actions can be built into the software system's logic. These typically result in a graceful termination of the software system in the event of a failure, but do not achieve the software system's stated goals. A more desirable fault handling scheme would be one which enables the software system to not only recover from the failure of a WS-operation but also achieve its stated goals. We term this capability recovery handling to distinguish it from normal fault handling.

Ideally, recovery handling should be performed automatically. The embodiment described herein provides a system for automatic recovery handling of software systems that are composed of WS-operations. We call this system the Automatic Recovery Handling System, ARHS.

The ARHS aims to provide recovery solutions in case of failure of any of the WS-operations used by a software system during the execution of that software system.

The ARHS offers the following capabilities as well: i. It provides an Execution Engine for executing the software systems that are composed of WS-operations. In addition to executing software systems, the Execution Engine maintains the computational states of a software system during the course of its execution. This is needed for recovery from failed WS-operations. The computational state of a software system normally changes as a result of the execution of each WS-operation used by that software system. At any point in time, the computational state comprises: the values of the variables used by the software system, the program counter, and any other piece of information necessary for the software system execution at that point in time, ii. The capability of finding the different instances of an individual

WS-operation, ranking them according to one or more ranking criteria and then identifying the highest ranked instance. This capability is used by the

Execution Engine in two situations: o During the normal execution of the software system: Here, the highest ranked instances, or consecutive sequences of instances, of the WS-operations used in the software system are found.

- o After a recovery solution or strategy has been found: In this case, the highest ranked instances, or consecutive sequences of instances, of the WS-operations used in the solution are found, iii. In some circumstances the ARHS may find multiple solutions or strategies for recovery from a failed WS-operation. The ARHS would rank these recovery solutions using its Recovery-Solution Ranking System, RSRS. The highest ranked recovery solution would then be used to recover from the failed WS-operation. iv. The RSRS includes a novel Reliability Analysis System (RAS) to allow ranking of the recovery solutions based on their reliability. The reliability measured by the RAS indicates the probability that the recovery solution can be successfully executed. This will be described in detail later, v. In addition to reliability, the RSRS is capable of taking into account other criteria (e.g. financial cost of using a recovery solution, execution speed of a recovery solution, etc.) for ranking the recovery solutions. It can evaluate the overall rank of each recovery solution by taking into account all the chosen criteria (by assigning a weight to each criterion), vi. The RSRS provides an extensible ranking system where new criteria can be introduced (and existing ones removed, if so desired) for ranking the recovery solutions.

Note that in the rest of this specification we use the terms 'software system' and 'application' interchangeably. The rest of this specification is organized as follows. First, using an example, an overview of the activities undertaken during the recovery from a WS-operation failure is presented. Second, the design of the ARHS is examined. Third, the process of forming recovery solutions is examined. Fourth, the concept of the reliability of the recovery solutions is introduced and a novel algorithm for computing their reliability is described in detail.

Overview of the Recovery Handling Activities,

The ARHS aims to provide recovery solutions in the event of the failure of any WS-operation used by a software system during the execution of that system. This section presents an overview of the recovery handling activities that will take place at the time of such failures.

As stated in the previous section, each Web Service specifies one or more WS-operations. Also, there may be multiple instances (WS-instances) of each Web Service, and by extension, multiple instances of each WS-operation. The instances of a WS-operation are identical in terms of the functions they perform (i.e. are functionally equivalent). Each instance is offered by a single service provider; when multiple instances are available, these may be offered by a single service provider or multiple ones. To execute a software system composed of a set of WS-operations, one instance of each of those WS-operations should be

identified and invoked. When such an invocation fails, the following actions will be undertaken by the ARHS: i. The computational state of the software system is reverted to the point before the failed WS-operation was invoked, ii. Another instance of the failed WS-operation (if any) will be found and then invoked. iii. If the new instance fails as well, step (ii) will be repeated until such time that there are no more instances of the failed WS-operation are available. At that point, step (iv) below will be undertaken. iv. The ARHS will attempt to find a recovery solution using composition of other WS-operations. We call each such composition a WS-composition. This type of recovery solution is based on the concept that it is possible to achieve the effects of a WS-operation through execution of two or more other WS-operations. In other words, a failed WS-operation can be replaced by a composition of two or more other WS-operations which collectively have the same effects as (i.e. are functionally equivalent to) the failed WS-operation.

For instance, a WS-operation, called open-read-close-file, may offer the capability to open a file, read a number of bytes from the file and then close the file. Let us assume that there are also three separate

WS-operations, namely, (i) open-file which opens a file, (ii) read-file which reads the required number of bytes from the file, and finally (iii) close-file which closes the file. A software system that needs to do a one-off read operation would usually be designed to use the open-read-close-file WS-operation. However, if that WS-operation fails during the execution of the software system, it would be possible to use the composition of the three separate WS-operations open-file, read-file, and close-file to achieve the same effects. The ARHS can find such compositions. This is achieved by taking into account the pre-conditions and effects of the individual WS-operations. This process is described in detail in later sections in this specification.

An example of the above recovery activities are illustrated in figure 2 which illustrates the case where the software system makes use of five WS-operations, WS0P1 , WS0P19, WSOP5, WSOP7 and WSOP78. To execute this software system one instance of each of these five WS-operations should be chosen and then invoked. Here, the invocation of the first instance of WS0P1 has failed. Two other instances of WSOP1 have been found and invoked but neither invocation has been successful. Subsequently, the ARHS has created two compositions of WS-operations, WS-composition- 1 and WS-composition-2, which are functionally equivalent to WSOP1 , and hence, can independently replace it. The composition WS-composition- 1 consists of three WS-operations (WS0P111 , WSOP112, WSOP113), and the WS-compositipn-2 comprises two WS-operations (WSOP131. WSOP113). Subsequently, the following steps (not shown in Figure 2) take place:

• The two composite recovery solutions are analysed to determine which one scores higher with respect to a set of criteria (e.g. reliability, cost, speed, etc.). The one with a higher score (e.g. WS-composition-2) is selected to replace WSOP1. • The computational state of the software system is reverted to the point before (any instance of) WS0P1 was invoked.

• The selected composition, WS-composition-2, is instantiated. The instantiation is accomplished by choosing an instance of each constituent WS-operation, i.e. WS0P131 and WS0P113. The instances may be chosen based on similar criteria as mentioned above, e.g. reliability, speed, cost, etc.

• Finally, the chosen instances of WSOP131 and WS0P113 will be invoked in that order. Design of Automatic Recovery Handling System (ARHS)

The entities used in the design of the ARHS are shown in Figure 3. Note that the software systems (i.e. the applications running on a first device) which are composed of WS-operations may be encoded in a procedural programming language such as BPEL4WS, Java or C++, These software systems are executed using an execution engine, an example of which is a computer system with network connectivity. The execution engine has the capability to parse the code of the software system and execute it. To execute the WS-operations used in the WS-composition, the execution engine typically sends requests (over Internet or Intranet) to the WS-instances that are capable of executing those WS-operations. Currently, there are several examples of either commercial or open-source execution engines for software systems encoded in BPEL4WS.

The role played by each entity illustrated in Figure 3 is as follows:

• Execution Engine Module: This module is responsible for executing software systems which are composed of WS-operations and are encoded in programming languages such as BPEL4WS, Java or C++. The execution involves invoking the individual WS-operations. This is achieved through sending an invocation request (over the Internet or Intranets) to an instance of each WS-operation, and possibly receiving a reply back. The replies may be modified, stored and subsequently used as input to other WS-operations. The WS-instances (to which the invocation requests are sent) are identified by the Instance Finder module described below. The following points should also be noted regarding the operation of the Execution Engine: o During the execution of a software system, the Execution Engine creates and maintains a read-only copy of the current computational state of the software system before each WS-operation is invoked. Following the successful invocation of a WS-operation, the current computational state is updated accordingly; however, the read-only copies are untouched. They

are kept until the execution of the software system is successfully completed. The read-only copies allow the computational state of the software system to be restored to the point prior to invocation of any of the constituent WS-operations in case the invocation of that WS-operation fails and a recovery solution is adopted which requires reversion back to a point in the execution of the software system prior to the successful completion of a WS-operation (for a discussion of which see below). o If the invocation of a WS-operations fails, the Execution Engine restores the computational state of the software system to the point before the failed WS-operation was invoked. It then attempts to find an alternative instance of the same WS-operation by sending a request to the Instance Finder module. If none of the alternative instances identified by the Instance Finder module can be successfully invoked, the Execution Engine requests a recovery solution from the Recovery-Solution Generation module. Once a recovery solution is received, it is instantiated and subsequently executed as a replacement for the failed WS-operation. The instantiation process finds a WS-instance for each WS-operation used in the recovery solution. o In many circumstances, the WS-operations used in a software system are tightly linked to each other. For example, let us consider the example in the previous section again. There, the output from the open-file WS-operation is normally a file handle which should be used as the input to both the read-file and close-file WS-operations. This requires that the same WS-instance issuing the file handle should be used (by the Execution Engine) for reading and closing the file. That is, the WS-instance chosen for executing one WS-operation may have to be used for executing one or more other WS-operations. It is the responsibility of designers of WS-operations to specify such 'WS-instance interdependence' between the WS-operations. The Execution Engine simply uses this information and does not infer this information itself. Instance Finder Module: This module is responsible for finding all the WS-instances capable of executing a WS-operation and then ranking them according to a set of criteria such as reliability, cost, execution speed, etc. The highest ranked WS-instance is identified. The requests sent to this module consist of: (i) the identity of a WS-operation (including its name, the list of its input and output parameters, and/or the identity of a Web Service specifying it), and optionally, (ii) the required ranking criteria, which should be a subset of the criteria this module can process.

When no ranking criteria are provided, this module uses its own pre-

defined criteria. In making its decisions, this module normally consults the databases containing attributes of different WS-instances of the WS- operations. • Recovery-Solution Generation Module: This module finds all recovery solutions that can substitute a failed WS-operation. It then uses the Recovery-Solution Ranking module (see below) to identify the best solution amongst them according to a set of ranking criteria. The recovery solutions created by this module are compositions of

WS-operations (WS-compositions). In the present embodiment they are formed dynamically when a failure occurs. In alternative embodiments however the recovery solutions for all WS-operations used by a software system might be created and stored before the execution of the software system commences. The recovery solutions for a specific WS-operation can then be selected from the stored solutions and then ranked.

In order to create the recovery solutions, the Recovery-Solution Generation module takes into account the pre-conditions and effects of all WS-operations to determine whether and how a subset of them may be composed into one or more recovery solutions. This process is detailed below.

• Recovery-Solution Ranking Module: This module performs the following functions: o Receives a set of recovery solutions, o Uses the individual ranking modules within the Index Calculation module to evaluate each recovery solution in the set, o Passes the results obtained from the individual ranking modules for each recovery solution to the Index Aggregation module to obtain the overall rank of that recovery solution, o Selects the recovery solution with the highest overall rank and passes it to the Recovery-Solution Generation module.

• Index Calculation Module: This module receives a recovery solution in the form of a WS-composition from the Recovery-Solution Ranking module and, in the present embodiment, measures three attributes of that recovery solution, namely (i) reliability, (ii) execution speed, and (iii) financial cost of execution. Each attribute is computed via a separate sub-module, namely, (i) Reliability-Based Analysis module, (ii) Execution Speed-Based Analysis module, and (iii) Cost-Based Analysis module.

The result of each computation is represented as a numerical index.

It is possible to incorporate new sub-modules capable of computing other attributes of a recovery solution. It is also possible to request that only a subset of the attributes (e.g. only reliability) be computed. The raw data required to compute the required attributes are obtained from external databases. Each sub-module may apply specific algorithms to the raw data to compute the required index.

Also, all sub-modules employed in the Index Calculation module take into account the concept of 'WS-instance inter-dependence' described earlier in the section on Execution Engine Module. o Reliability-Based Analysis Module: This module computes the reliability of the recovery solutions. The reliability of a recovery solution is measured as the probability of its successful execution.

This translates to measuring the probability of successful execution of all WS-operations used in a recovery solution. The details of the computations performed by this module are presented later in this report. o Execution Speed-Based Analysis Module: This module computes the minimum time required to execute a recovery solution. The computed value is the sum of the individual times required to execute the constituent WS-operations used in the recovery solution. When a constituent WS-operation has multiple

WS-instances capable of executing it, the time associated with the fastest WS-instance will be used in computing the overall time. o Cost-Based Analysis Module: This module computes the minimum financial cost of executing a recovery solution. The computed value is the sum of the individual costs that will incur to execute the constituent WS-operations used in the recovery solution. When a constituent WS-operation has multiple WS-instances capable of executing it, the lowest cost WS-instance is used in computing the overall cost.

Note that instead of having each of the above analysis modules act entirely independently in performing their analysis (since this could lead to a situation where say one instance is very fast, but very expensive, and another is very slow, but cheap and where the resulting calculated indices un realistically take the best of both worlds) in some embodiments a dependency can be specified and if a first analysis module (e.g. the speed based one) selects a particular instance for calculating its index, then the same instance must be used by any dependent analysis modules (e.g. the cost-based analysis module).

• Index Aggregation Module: This module aggregates the individual indices, computed by the Index Calculation module for a recovery solution, into an overall numerical index. The aggregation is accomplished by assigning a distinct weight to each individual index to take account of its relative importance, multiplying the index value by the weight and then adding up the weighted indices.

The number of individual indices that are input into the Index Aggregation module and the weight assigned to each one is configurable to take account of the requirements of different user communities or operational environments. This means that this module may rank the same recovery solution differently depending on its current configuration.

Recovery-Solution Generation Module

Generating recovery solutions in case of a WS-operation failure is a core activity of the ARHS. This activity is undertaken by the Recovery-Solution Generation Module (RSGM) of ARHS. The concepts and processes involved in finding the recovery solutions (in the form of WS-compositions) are detailed in the following sections.

Concepts Used in Finding Recovery Solutions

The concepts underlying the process of finding the recovery solutions are presented in the following sections.

Computational State

We define the computational-state of a software system/application to be the set of entities and the relationships between those entities, which are known to the application at any point in time during its execution.

The entities within the computational-state could represent any physical (e.g. travellers, airlines, flights, hotels) or logical (price, reservation) concepts that the application is concerned with. The entities in the computational-state are logically linked to each other through binary relationships of the form relationship(entityi, entity ₂ ). We call this an entity- relationship statement. The computational-state consists of a set of entity- relationship statements. The entities in an entity-relationship statement are similar to variables in a programming language such as Java. That is, each entity is declared to be of a specific data type (either a primitive type such as integer, string, boolean, real, etc. or a user-defined type) and can be assigned a value of that type. An entity's data type and value are defined using two pre-defined relationships, namely, hasType and hasValue, respectively.

NB. In the rest of this document, we use the convention that entity names are composed of a character string starting with '?', e.g. ?traveller, ?bookedHotel, etc. The hasType relationship is of the form hasType(entity, dataType) indicating that the named entity has the type dataType. For example, hasType(?traveller,

TravellerType) indicates that the entity ?traveller is of type TravellerType.

Similarly hasType(?international Flight, FlightType) indicates that

?intemationalFlight is an entity of type FlightType.

The hasValue relationship has two forms: hasValue(entity, value) and hasValue(entityi, entity2). An example of the former is: hasValue(?traveller, {<name>"John Smith"</name> <age>32</age>})

This indicates that the entity ?traveller has a value (of a user-defined type, and encoded in XML notation) consisting of the name and age components.

An example of the latter is: hasValue(?internationalFlight_1 , ?internationalFlight_2)

This indicates that the entity ?internationalFlight_1 has the same value as entity

?internationalFlight_2.

The following points should be noted about computational-states:

• Each entity in the computational-state must have both hasType and hasValue relationships. It might have other relationships as well.

• It is possible for two entities to have multiple relationships with each other, i.e., relationship 1 (entityi, entity ₂ ) and relationship2(entityi, entity ₂ ) can exist simultaneously. An example might be: isFriend(?person1 , ?person2) and isColleague(?person1 , ?person2).

Similarly, an entity might have the same relationship with multiple other entities, e.g., isFriend (?person1 , ?person2) and isFriend (?person1 ,

?person3).

• The only exceptions to the above rule are the relationships hasType and hasValue, for which only one instance is allowed per entity.

• It is assumed that any relationship not explicitly stated in the computational-state does not exist.

An example of a relationship, other than hasType and hasValue, could be the 'bookedHotel' relationship in the context of a holiday reservation application. This relationship links a traveller and a hotel, and is expressed as bookedHotel (?traveller, ^l ?bookedHotel).

The relationships between entities can be implemented in an object oriented language such as Java using an object class (e.g. BinaryRelationship in the example below) with three components, two of which would refer to two entities, while the third one will name the relationship, e.g.

Class BinaryRelationship { entityl EntityClass; entity2 EntityClass; relationship RelationshipClass; }

In the code fragment above, the entities and relationship are, in turn, represented as instances of object classes EntityClass and RelationshipClass. WS-Operation's Pre-Conditions and Effects

To use a WS-operation in an application, two aspects of the WS-operation should be taken into account, namely, its pre-conditions and its effects, as described below. Pre-conditions: Each WS-operation has a set of pre-conditions (possibly empty) that must be satisfied (be true) before the WS-operatioή can be invoked. The preconditions are defined via a set of entity-relationship statements. They describe the WS-operation's input parameters, their types, range of values, and their relationships with other entities (if any). These can be defined using the notation introduced in the previous section for describing the computational-states.

The pre-conditions should be satisfied by the client application's computational-state at the point in time the WS-operation is being invoked. The pre-conditions can be specified using the set theory notation that will be described later.

Effects: The effects of a WS-operation consist of the changes that will be made to the existing computational-state of the client application as a result of the successful execution of the WS-operation. The effects are defined via a set of entity-relationship statements, and can result in:

• Creation of a new entity. This is indicated using an entity-relationship statement of the form "hasType (entityName, entityType)". Note that an entity can only have one hasType relationship at any point in time. If a new hasType relationship is defined for an entity, its existing hasType relationship will be discarded.

• Removal of an existing entity: This is represented using the entity- relationship statement of the form "~ hasType (entityName, entityType)", which is the negation of the "hasType (entityName, entityType)" relationship. Note that: o Removal of an entity results in the removal of other entity- relationship statements defined for that entity. o Removal of a non-existent entity has no effect.

• Creation of a relationship between two entities. As stated earlier, this is achieved using entity-relationship statements of the form: relationship(entityi, entity ₂ ). If the relationship already exists in the computational-state, a duplicate one is not created.

• Removal of an existing relationship between two entities. This is achieved using entity-relationship statements of the form: ~relationship(entityi, entity ₂ ). Removal of a non-existent relationship has no effect.

• Assigning a value to an entity or updating its existing value. These are accomplished using entity-relationship statements of the form: hasValue(entityi, entity ₂ ), and hasValue(entity, value) as explained earlier. An entity can only have one hasValue relationship at any point in time. If a new hasValue relationship is defined for an entity, its existing hasValue relationship will be discarded.

An example of the above rules is now given. Let us assume that a client application's current computational-state and the effects of a WS-operation used by the application are as follows:

Application's current computational-state = {r^e^ e ₂ ), r ₂ (ei,e ₃ ), r ₃ (e ₇ ,e ₁₀ )} WS-operation's effects = {ri(ei, e ₂ ), ~r ₂ (e _{1 )} e ₃ ), T ₁₁ (e ₁₍ e ₂ ), ~ ri ₄ (e ₁₅ ,ei ₈ )} The application's new computational-state resulting from applying the WS-operation's effects to the current computational-state will be:

Application's new computational-state = {n(ei, e ₂ ), r ₃ (e ₇ ,eio), rn(ei, e ₂ )}. NB. The WS-operation's pre-conditions and effects are defined by the WS-operation designer, not the client applications.

Transformation of Application's Initial-Conditions to Goals

Each application has a set of initial-conditions and a set of goals. The initial- conditions represent the initial computational-state of the application, that is, the computational-state that should exist before the application can start its execution. The goals indicate what the application is expected to achieve. They represent a subset of the final computational-state of the application. The final computational-state will be generated by the successful execution of the application.

In summary: initial-conditions = initial computational-state goals isSubsetOf final computational-state

The execution of an application can be perceived as the means to transform the application's initial-conditions to its goals, as depicted in Figure 4.

The process of application execution normally consists of a number of separate execution steps, where at each step either an internal operation or a WS-operation is executed. These steps transform the application's initial computational-state to its final computational-state, as illustrated in Figure 5 which illustrates the transformation of an Application's Initial Conditions to its Goals via various intermediate computational states.

In the example illustrated in Figure 5, the application makes use of three WS-operations, WS0P1 , WSOP2, and WSOP3. These are defined by three Web Services, namely, Web Service 10, Web Service 20, and Web Service 30, respectively. Also, there is one WS-instance of each of these Web Services to which the execution requests are dispatched.

In the example shown in Figure 5, the first execution step (Internal OP1) transforms the initial computational-state to the first intermediate computational- state. Thereafter, each execution step (i.e. call WSOP1, call WSOP2, receive response for WS0P2, Internal OP2, call WS0P3, receive response for WSOP3) changes (through its effects) the current intermediate computational-state, resulting in a new intermediate computational-state. This process is repeated until the final computational-state is achieved. Note that when WS-operations are executed remotely by WS-instances, the response is dispatched either synchronously (immediately), as in the case of WSOP1 , or asynchronously (at a later time), as in the case of WSOP2 and WSOP3. The receipt of a response, which might, for example, carry the new value of an entity, would change the computational state of the application.

Atomic Execution of WS-operations

In the present embodiment, it is assumed that the execution of a WS-operation by a WS-instance is atomic, that is, either the execution is completed successfully, or no execution takes place. In practice, if the execution starts but fails to complete any changes already made by the execution steps to the computational-state of the client application are undone so that it is equivalent to the case where no execution steps have taken place.

WS-Compositions A WS-composition as the term is used in the present specification in respect of the present embodiment is defined to be a linear sequence of: (i) one or more WS-operations and (ii) zero or more internal-operations. The sequence starts and ends with a WS-operation. Apart from this, there is no restriction on the order in which the WS-operations and internal-operations may be placed in the sequence.

An example of a WS-composition is the following sequence of two WS-operations and one internal-operation:

WSOP2 lnternal-OP10

WSOP3

Such a WS-composition normally exists in the context of an application and is executed as part of the application. At the same time, it has both pre-conditions and effects. The pre-conditions may include none, some or all of the pre-conditions of each constituent WS-operation (e.g. WS0P2, and WSOP3 in the example above). The execution of the WS-composition, in the context of the enclosing application, can commence only if its pre-conditions are satisfied by (i.e. be a subset of) the active computational-state of the application. Its successful execution will generate a new computational-state for the enclosing application. The new computational-state is formed by applying the WS-composition's effects to the application's computational-state that existed at the point in time when the execution of the WS-composition was started. NB. It is possible to regard a WS-operation as a WS-composition that consists of only a single WS-operation.

Functional Equivalence of WS-Compositions

Two WS-compositions are functionally equivalent with respect to a specific initial computational-state if:

• The pre-conditions of the two WS-compositions are identical.

• The initial computational-state is transformed to the same final computational-state regardless of which of the two WS-compositions is executed. The above conditions are also applicable when considering the functional equivalence of a single WS-operation and a WS-composition. This is because, as stated earlier, a single WS-operation can be regarded as a WS-composition comprising only that WS-operation. An example of two functionally equivalent WS-compositions is illustrated in Figure 6, where the WS-compositions (WSOP2, lntemal-OP10, WSOP3) and (WSOP4, lntemal-OP13, WSOP5, lnternal-OP15, WSOP6) are functionally equivalent. In this example, each WS-operation is provided by a separate Web Service, but this is not a requirement for functional equivalence.

The execution steps taken by the two WS-compositions shown in Figure 6 are detailed below. The pre-conditions and effects of each WS-operation and internal-operation used in the WS-compositions of Figure 6 are as follows:

the initial computational-state under which the two WS-compositions start their execution is denoted by "Initial-CS". The intermediate computational-states

(detailed below) created by (WSOP2, lnternal-OP10, WSOP3) are represented by "lntermediate-CS-1.1", "lntermediate-CS-1.2", "lntermediate-CS-1.3".

Similarly, the intermediate computational-states created by

(WSOP4, lnternal-OP13, WSOP5, lnternal-OP15, WSOP6) are represented by

"lntermediate-CS-2.1", "lntermediate-CS-2.2", "lntermediate-CS-2.3" ₎

"lntermediate-CS-2.4" and "lntermediate-CS-2.5".

The execution of the WS-composition (WSOP2, lntemal-OP10, WSOP3) involves the following steps:

Before invoking WS0P2: PC2 isSubsetOf Initial-CS

Execution Step: Invoke WSOP2

After invoking WSOP2: lntermediate-CS-1.1 = EF2 appliedTo Initial-CS

Before executing lnternal-OP10: PC10 isSubsetOf lntermediate-CS-1.1

Execution Step: Execute I nternal-OP 10

After executing lnternal-OP10: lntermediate-CS-1.2 = EF10 appliedTo lntermediate-CS-1.1

Before invoking WSOP3: PC3 isSubsetOf lntermediate-CS-1.2 Execution Step: Invoke WSOP3 After invoking WSOP3: lntermediate-CS-1.3 = EF3 appliedTo lntermediate-CS-1.2 The execution of the WS-composition (VVS0P4, lnternal-OP13, WSOP5, Internal- OP15, WSOP6) involves the following steps:

Before invoking WSOP4: PC4 isSubsetOf Initial-CS Execution Step: Invoke WSOP4 After invoking WSOP4: lntermediate-CS-2.1 = EF4 appliedTo Initial-CS

Before executing lnternal-OP13: PC13 isSubsetOf lntermediate-CS-2.1

Execution Step: Execute lnternal-OP13

After executing lnternal-OP13: lntermediate-CS-2.2 = EF13 appliedTo lntermediate-CS-2.1

Before invoking WSOP5: PC5 isSubsetOf lntermediate-CS-2.2 Execution Step: Invoke WSOP5 After invoking WSOP5: lntermediate-CS-2.3 = EF5 appliedTo lntermediate-CS-2.2

Before executing lnternal-OP15: PC15 isSubsetOf lntermediate-CS-2.3

Execution Step: Execute lntemal-OP15

After executing lntemal-OP15: lntermediate-CS-2.4 = EF15 appliedTo lntermediate-CS-2.3

Before invoking WSOP6: PC6 isSubsetOf lntermediate-CS-2.4 Execution Step: Invoke WSOP6 After invoking WSOP6: lntermediate-CS-2.5 = EF6 appliedTo lntermediate-CS-2.4

The two WS-compositions (WSOP2, lntemal-OP10, WSOP3) and (WSOP4, lntemal-OP13, WSOP5, lnternal-OP15, WSOP6) are functionally equivalent if their final computational-states are identical, i.e. lntermediate-CS-1.3 = lntermediate-CS-2.5

NB. Two WS-compositions may be functionally equivalent with respect to one initial computational-state but not with respect to another one. This is the case if, for example, a WS-operation in one of the WS-compositions would remove an entity-relationship statement that no WS-operation in the other WS-composition would remove. Now, given an initial computational-state that includes the said entity-relationship statement, the final computational-states of the two WS- compositions will not be the same. This is demonstrated below using the two WS-compositions (WSOP2, Intemal-OPIO, WSOP3) and (WSOP4, lnternal-OP13, WSOP5, lnternal-OP15, WSOP6).

Here, if the initial computational-state includes the entity-relationship statement r ₂₂ 2(e ₆ , e ₈ ) then the final computational-state after execution of (WSOP2, Internal- OP10, WSOP3) will include T ₂₂₂ (e ₆ , e ₈ ), whereas the final computational-state ' after execution of (WSOP4, lnternal-OP13, WSOP5, lnternal-OP15, WS0P6) will not include that statement because WSOP4 will have removed it. However, if r ₂₂ 2(e ₆ , e ₈ ) is not in the initial computational-state, the final computational-states of the two WS-compositions will be the same, and they will be functionally equivalent.

Process of Finding Recovery Solutions

Based on the concepts described above, the algorithm for finding WS-compositions (recovery solutions) that are functionally equivalent to a (failed) WS-operation is now presented. Algorithm for Finding Functional Equivalents for WS-operations

Above, the concept of functional equivalence between two WS-compositions was discussed. It was also noted that a WS-operation can be regarded as a WS-composition, and therefore, functional equivalence could also exist between a WS-operation and a WS-composition. One consequence of this functional equivalence is that, in the case of applications composed of WS-operations, if a WS-operation cannot be successfully executed, it can be replaced by a WS-composition that is functionally equivalent to it. In this way the application can resume its execution and successfully achieve its goals. One obvious issue is how to find the WS-compositions that are functionally equivalent to a WS-operation used by an application. This calls for an algorithm that can create the functional equivalents of a WS-operation using a set of available WS-operations, as depicted in Figure 7 which shows the process of finding a set of functional equivalents for a WS-operation WSOP[i].

The manner in which this problem space can be formulated is firstly discussed below and then an appropriate algorithm for solving this problem once thus formulated is identified. Problem Formulation

The problem of finding the functional equivalents of a WS-operation (in the context of an application) in the present embodiment has three inputs:

• The application's computational-state that exists before the WS-operation is invoked,

• The application's computational-state that is expected to exist after the WS-operation is successfully executed, • A set of WS-operations each having its own pre-conditions and effects.

This set is mutually exclusive with the set of WS-operations used by the application in the present embodiment.

Given the above inputs, the problem can be formulated as: find sequences of WS-operations where each sequence has the same effect as the target

WS-operation on the application's computational-state. This problem can be treated as a planning problem. One well-known approach to describing the inputs to a planning problem is STRIPS [Fikes & Nilsson, 1971]. STRIPS defines the problem space using three components: i. an initial world model, ii. a goal state,

iii. a set of operations, where each one has a set of propositions describing its pre-conditions and effects.

Clearly, STRI PS's three components correspond directly to the three components of our problem as stated above. Therefore, in the present embodiment, the STRIPS algorithm is used to formulate the inputs to our problem (note there are many other ways in which this could be done however).

Algorithm

One algorithm that can automatically find solutions to the planning problems expressed in STRIPS is Graphplan search [Blum, Furst, 1997]. Graphplan identifies the sequence of operations (if any) for reaching the goal state from the initial world model.

In the present embodiment, Graphplan is used as the algorithm for finding the functional equivalents of WS-operations (note that there are also other algorithms which could be used for this purpose).

Reliability-Based Analysis Module (RBAM) The Reliability-Based Analysis Module (RBAM) is a component of the Index

Calculation Module of the ARHS (see above). It is responsible for computing the reliability of recovery solutions that are to replace failed WS-operations in the software systems composed of WS-operations. In order to compute the reliability of a recovery solution, the RBAM first computes the reliability of the individual WS-operations used in that recovery solution. It is also possible to use the RBAM to compute the reliability of a specific WS-operation with no reference to a recovery solution.

NB. The reliability concept and its different aspects, as detailed here and below, are comprehensive enough to be fully applicable to general strategy formation for entire software systems composed of WS-operations as well as recovery strategies or solutions. From the RBAM's perspective, a recovery solution is considered to be a simple case of a general strategy for a software system. Compared to a software system, a recovery solution usually employs a smaller number of WS-operations and makes limited use of control constructs. For this reason, this section and the next one consider the more general case of entire software systems rather than specifically referring to recovery solutions. However, all the concepts and algorithms are applicable to both software systems and recovery solutions.

In the rest of this section the term 'software system' should be taken to mean 'software system or recovery solution'. This applies to diagrams as well.

Definition of Reliability The reliability of a software system is measured as the probability of its successful execution. This translates to measuring the probability of successful

execution of a//WS-operations used in that software system. The reliability is represented by a positive, rational number in the range zero to one.

Factors Affecting Reliability of Software Systems This section identifies the factors influencing the reliability of the software systems. It also identifies the events that might lead to a change in reliability over time.

Each software system is constructed using WS-operations and internal operations (see Figure 5). It is assumed that the internal operations would always be successfully executed, and any unreliability is due to failure of the WS-operations.

Different software systems may share one or more WS-operations with each other, but there are always differences between them in terms of the

WS-operations used in their construction. These differences are the main source of variations between their reliabilities. This is elaborated further below.

i. Multiple WS-iπstances offering the same WS-operation: The reliability of each software system depends on the probability of successful execution of each of its constituent WS-operations. Each constituent WS-operation is executed by sending a request to a WS-instance which implements the Web Service that specifies the WS-operation. It is possible to have multiple WS-instances capable of executing the same WS-operation. In this case, if one of those WS-instances fails to execute the WS-operation successfully, another one may be used in its place. Clearly, a WS-operation with a larger number of WS-instances has a higher probability of successful execution. This in turn contributes to higher reliability for the software system as a whole. Multiple WS-instances of a Web Service (and by extension a

WS-operation) are normally provided by service providers which may be commercial enterprises or individuals. In either case, the service providers may be competing in an open, competitive market where a charge is levied for using the WS-instances. Alternatively, the providers may offer the WS-instances freely on the Internet for public usage.

The reason for availability of multiple WS-instances of the same Web Service may include: (i) market conditions, e.g. popularity of certain Web Services might encourage service-providers to bring more WS-instances to the market (ii) criticality of a specific Web Service might require employing more of its WS-instances compared to others. ii. Each WS-instance has a distinct success probability: The

WS-instances capable of executing a specific WS-operation each have a distinct probability to successfully execute that WS-operation. The difference between their success probabilities can be due to technical reasons such as: (i) the type and age of employed hardware platforms, (ii)

the maturity and robustness of their software platforms, (iii) speed and reliability of communications links providing access to the WS-instances. iii. Changing number of WS-instances: Given the possibility of multiple, independent WS-instances capable of executing a WS-operation, it is also possible that the number of WS-instances might change over time. This might even occur in real time during the execution of a software system using that WS-operation. The change might occur due to commercial or technical reasons. Examples of the former include: (i) forced withdrawal of one or more WS-instances due to, for example, bankruptcy of the service-providers, (ii) fluctuations in market demand for a Web Service might result in provision of a smaller or larger number of WS-instances of that Web Service. An example of the technical reasons is the failure of hardware/software platforms used by the WS-instances. iv. Changing reliability of WS-instances: Finally, in addition to each WS-instance having a distinct success probability, it is also possible that this probability might change temporarily or permanently over time. For instance, the probability might temporarily go down at peak demand periods, and conversely go up at off-peak periods; or alternatively, the reliability may go down permanently as the employed hardware gets older, and consequently, becomes less reliable.

The combination of above factors amount to a complex, dynamic execution environment in which the assessment of the reliability of the software systems composed of WS-operations is an important but non-trivial task. Any change in the above factors could dynamically alter the reliability of the software systems. The RBAM takes into account all these factors to compute the reliability of software systems.

Steps Taken by Reliability-Based Analysis Module (RBAM) The role of the RBAM is as follows: i. Analyses a given software system to identify its (i) structure, (ii) control constructs and execution paths, and (iii) the WS-operations used by it, ii. For each WS-operation used by the software system, the RBAM first identifies the Web Service which specifies that WS-operation, and then discovers all available WS-instances of that Web Service. Finally, for each of those WS-instances, the probability that the WS-instance can successfully execute the WS-operation is also obtained. So, at the end of this phase, there should be one or more statements of the form: <WS-instance X> can successfully execute <WS-operation Y> with probability Z>.

To obtain the success probability, i.e. probability Z>, three sources of information may be used: (i) Registries: these hold information about Web

Services, their WS-instances and their service providers; an example of such registries is the UDDI registry, (ii) Service Providers: each service-provider can supply information about the Web Services and WS-instances that it provides, (iii) Independent Organisations: these are responsible for monitoring the operation of the WS-instances and assessing their success probability. iii. Using the success probability obtained above, the RBAM computes the following:

• Reliability Index of each WS-operation used by the software system. This is measured as the probability of successful execution of the

WS-operation by at least one of the WS-instances capable of executing it. The term 'reliability index' refers to the numerical value of the reliability.

• Reliability Index of the software system as a whole. This is measured as the probability that all the WS-operations used by the software system can be successfully executed (by at least one of their ^J associated WS-instances).

Architecture of Reliability-Based Analysis Module (RBAM) The architecture of the RBAM is shown in Figure 8. The components within the box in the centre of Figure 8 constitute the RBAM. The components outside the box are used by RBAM but are not part of it. A description of the components in each category is given below. Components External to Reliability-Based Analysis Module

As shown in Figure 8, there are three main types of components which are external to the RBAM, namely the Code of the Software System (or a recovery solution), Service Providers, Web Service Instance Monitoring and Assessment systems and one or more repositories of Information Relevant to the calculation of a reliability index. These elements are discussed below.

Code of Software System: The input to the RBAM is the source code of a software system written in a procedural programming language such as BPEL4WS, Java or C++,. The code includes references to one or more WS-operations and zero or more internal-operations (see Figure 5). It may also include one or more control constructs specifying the flow of execution. The control constructs are limited to Sequence, Parallel, While and Switch as defined in below. Service Provider: Each service provider, in the present embodiment, provides (or should provide) information about the WS-instances it provides and the probability of successful execution of each of them.

Web Service Instance Monitoring & Assessment (M&A) System: This system monitors the operation of WS-instances and assesses the probability of successful execution of each WS-operation they provide against pre-defined

criteria. The assessment may take into account historical data, expected future work load for the WS-instances, and other relevant factors. Multiple, independent instances of this kind of M&A system might be accessible to the RBAM. It is envisaged that these M&A systems are normally owned and operated by enterprises independent of service providers to add credibility to their findings.

Repository for Information related to Reliability Index: This repository in the present embodiment holds the following information:

• list of all WS-instances (and their service providers) capable of executing a WS-operation,

• probability of successful execution of that WS-operation independently by each individual WS-instance in the above list.

The sources of the above information include: (i) service providers, and (ii) Web Service Instance Monitoring & Assessment Systems.

Components Internal to Reliability-Based Analysis Module

As shown in Figure 8, there are three main components which are interna] to the RBAM, namely a parser, a reliability index evaluator and a reliability index information collector; between them, these components operate to produce a reliability index as discussed beiow. the three main components of the RBAM are now discussed in a little more detail. Parser: The Parser module receives the source code of a software system (or of a recovery solution) and performs the following actions:

• Converts the parallel execution paths in the software system code to linear ones.

• Identifies all WS-operations used in the software system. • Identifies the While (loop) control constructs in the software system code and the WS-operations used within them.

• Identifies the Switch constructs (e.g. 'if-then-else' and 'switch' statements as defined in languages such as Java, C++, or the <switch> activity in BPEL4WS) in the software system code and the WS-operations within them.

• identifies the different execution paths within each Switch construct. NB. Given the possibility of using different programming languages/notations for encoding a software system (BPEL4WS, Java or C++,), the Parser module is able to deal with these languages and notations.

Reliability Index Information Collector: This module receives the identification of a WS-operation and returns: (i) the list of all WS-instances (and their service providers) that can execute that WS-operation, and (ii) for each WS-instance in above list, its probability to successfully execute the given WS-operation.

The information returned by this module is obtained from one or more of the following sources: (i) service providers, (ii) Web Service Instance Monitoring & Assessment Systems, (iii) Repositories for Information related to Reliability Index NB. The list of WS-instances returned by this module for a given WS-operation may change on each invocation of this module. This is because new WS-instances may start operation for the first time, existing ones may halt (temporarily or permanently) their operation or resume their operation after a pause. Even when the list remains unchanged, the 'success' probability associated with each list member may change over time. This could be due to factors such as: changes in work load, changes in the quality of service (QoS) of the networks providing access to a WS-instance, and/or changes in the hardware/software platforms used by the WS-instances.

Reliability Index Evaluator: This module requires the following information to perform its task. These are:

• List of WS-operations used in the software system. This is obtained from the Parser module.

• List of different execution paths within the software system, and the WS-operations utilized by each path. This is obtained from the Parser module.

• For each WS-operation used in the software system, the list of WS-instances that can execute that WS-operation.

• For each pair (WS-operation, WS-instance) obtained above, the probability of successful execution of the WS-operation by the WS-instance. This information is supplied by the 'Reliability Index

Information Collector' module.

Using the above information, the Reliability Index Evaluator computes the reliability index of each WS-operation. These index values and the information about the software system's execution paths are then combined to compute the reliability index of the software system. The details of this computation are presented later.

Computing Reliability Indexes This section describes the concepts behind and the procedure for computing the reliability index of both the recovery solutions (or of a general strategy or a software system, etc.) and their constituent WS-operations (as stated above the concepts and the procedure are comprehensive enough to not only deal with recovery solutions but also be fully applicable to computing the reliability index of the software systems and their constituent WS-operations - since a recovery solution can be considered to be a simple case of a software system, this section will focus on the more general case of the software systems rather than recovery solutions). In the rest of this section the term 'software system' should be taken to mean 'software system or recovery solution'. This applies to diagrams as well.

Reliability Index Definition

The 'Reliability Index, Rl' of a WS-operation indicates the probability that the WS-operation can be successfully executed by at least one of the WS-instances capable of executing it. The Rl of the WS-operation is computed in the context of the software system of which it is a part.

This implies that the same WS-operation may have a different Rl value in the context of each individual software system employing it. The reason for this will be explained later. The 'Reliability Index, Rl' of a software system indicates the probability that the software system can successfully complete its execution given the possibility of failure of the WS-instances that can execute its constituent WS-operations. More precisely, the software system's Rl indicates the probability that each constituent WS-operation can be successfully executed by at least one WS-instance.

Characteristics of Software System & its Execution Environment

To compute the Rl value of a software system, it is assumed that the software system and its execution environment have the characteristics described below: i. The software system is composed of a set of n WS-operations, each uniquely identified by an Operation IDentity (OID). For example, the software system in Figure 1 is composed of four WS-operations, 'Get Stock Price', 'Open', 'Write' and 'Close', ii. There is a set of w Web Services. Each WS-operation is part of (specified by) exactly one of those Web Services. The n WS-operations used by a software system may be part of one or more (at most n) Web Services. For example, the four WS-operations used by the software system in Figure 1 are part of two different Web Services, 'Web Service A' and 'Web Service B'. iii. Each Web Service (and by inference each WS-operation) is implemented by one or more WS-instances. Each WS-instance implements exactly one

Web Service. For example, in Figure 1 the 'Web Service A' is implemented by three WS-instances. iv. There is a set of s WS-instances. Each WS-instance is capable of executing one or more of the n WS-operations used in the software system. For example, in Figure 1 there are four WS-instances in total.

These include three WS-instances of the 'Web Service A', where each one can execute three of the WS-operations used in the software system. The fourth WS-instance is the only instance of 'Web Service B' and can execute only one of the WS-operations used in the software system. Figure 1 also shows that two WS-instances (one of 'Web Service A' and one of 'Web Service B') are selected to execute all four WS-operations of the software system. Each WS-instance is uniquely identified by a WS-instance ID, HD. v. There is a set of k service-providers which provide the s WS-instances. For example, in Figure 1 there are two service-providers providing the four

WS-instances. Each service-provider is uniquely identified by a Provider ID, PID. vi. Each WS-instance has exactly one service-provider. However, each service-provider may provide one or more (and possibly all) of the s WS-instances. By inference, each service-provider may provide (indirectly through its WS-instances) one or more (and possibly all) of the n WS-operations that the software system is composed of. vii. To summarise:

• (k service-providers) provide (s WS-instances) (1 =< k =< s) • (s WS-instances) implement (w Web Services) (w =< s)

• (w Web Services) define (n WS-operations) (1 =< w =< n)

• (r WS-instances) execute (n WS-operations) (1 =< r =< s) & (1 =< r =< n) viii. There is a many-to-many relationship between the members of the WS- operation set and the members of the WS-instance set, as detailed by the next two points. ix. Each WS-operation can be executed by a subset of r WS-instances, where 1 =< r =< s. NB. r may have a different value for each WS-operation. For example, in Figure 1 the WS-operation 'Open' can be executed by three WS-instances (which are provided by two service-providers, 'Service

Provider 1 ' and 'Service Provider 6') whereas WS-operation 'Get Stock Price' can be executed by only one WS-instance (which is provided by 'Service Provider 6'). x. Each WS-instance can execute a subset of m WS-operations of the software system, where 1 <= m <= n. NB. m may have a different value for each WS-instance. For example, in Figure 1 each instance of the 'Web Service A' can execute three WS-operations, 'Open', 'Write' and 'Close', used by the software system; however, each instance of the 'Web Service B' can execute only one WS-operation, 'Get Stock Price'. xi. The service-providers operate independently of each other. Also, the

WS-instances operate independently of each other. The only exception is when a service-provider provides multiple WS-instances, in which case the failure (or withdrawal) of that service-provider causes the failure (or withdrawal) of all of its WS-instances. xii. The WS-operations are executed independently of each other. That is, the probability of a WS-operation to be successfully executed by a WS-instance is independent of the probability of other WS-operations to be successfully executed by the same WS-instance. Similarly, the probability of a WS-operation to be successfully executed by a WS-instance is independent of the probability of the same WS-operation to be successfully executed by other WS-instances. xiii. The probability of a WS-instance (uniquely identified by an UD) to successfully execute a WS-operation (uniquely identified by an Ol D) is represented by ^p< " ^{D 0ID)} .

xi v. Conversely, the probability that a WS-instance (identified by HD) fails to execute a WS-operation (identified by OID) is represented by ^ ^{~ Pu} '° ^{0ID) )} .

Pre-requisites of Reliability Index Computation

^■ Assumptions about Software System Code

The RBAM needs to analyse the software system. To allow this, the software system must satisfy the following conditions: i. It must be syntactically and semantically correct. The RBAM is not intended to verify the syntactic and semantic correctness of software systems, nor can it undertake its analysis on syntactically or semantically incorrect software systems, ii. If the software system contains a Switch control construct, the software system developers might optionally provide the probability associated with the selection of each execution path (branch) within the Switch construct during the software system execution. If this information is not provided, it will be assumed that all execution paths have equal probability of selection for execution. iii. If the software system contains a While (loop) control construct, the software system developers might optionally provide the number of times the While construct will be executed during the software system execution. If this information is not provided, an iteration count of one will be assumed. ■ Permissible Control Constructs in Software Systems

We assume only four types of control constructs will be used within a software system. These are:

Sequence: This construct consists of a sequence of WS-operations that will be executed in the order in which they are specified. The sequence may include multiple instances of the same WS-operation.

While: This construct consists of a conditional expression and a set of WS-operations. The conditional expression is evaluated and if its value is 'true' the set of WS-operations will be executed. This process will be repeated until the conditional expression evaluates to 'false'.

Switch: This construct consists of an ordered sequence of one or more conditional execution paths. Each path is associated with a conditional expression which would evaluate to 'true' or false'. There is also an optional 'otherwise' path with no associated conditional expression. The conditional expressions are evaluated in the order in which they are given. The first execution path whose condition expression evaluates to 'true' is executed. No other paths would be executed afterwards. If none of the conditional expressions evaluates to 'true' then the 'otherwise' path is executed. The execution of Switch construct is complete when the execution of the selected path is finished.

Parallel: This construct consists of two or more sets of WS-operations where these sets can be executed in parallel. Note that each 'set of WS-operations' used in the above control constructs is itself constructed using one of the four given control constructs. In this way complex, nested control constructs can be created.

^■ Finding Linear Execution Paths within Software Systems A software system can include multiple control constructs, namely, Sequence, While, Switch, and Parallel. These constructs can be arbitrarily nested inside each other to any depth. They are identified by the Parser module of the RBAM.

However, the RBAM can only compute the reliability index of the Sequence construct directly. Other constructs should first be converted into equivalent Sequence constructs before their reliability index can be computed.

When this conversion is complete, the resulting Sequence constructs are concatenated, in the order they have been originally specified in the software system, to create a single, end-to-end Sequence construct representing the whole software system.

There is an exception to above process and that is when the software system contains a Switch construct. The conversion of a Switch construct normally results in two or more Sequence constructs (this will be elaborated later). When these are concatenated with other Sequence constructs they would form multiple, end-to-end Sequence constructs representing the whole software system.

We shall now consider the conversion process for each type of control construct.

Conversion of Sequence Constructs

As stated above the Sequence control construct does not need any conversion. A software system consisting of a single sequence of WS-operations can be directly analysed by RBAM to compute its reliability index.

Conversion of Parallel Constructs: Linearising Parallel Execution Paths A software system may include Parallel control constructs resulting in two or more parallel execution paths. Each path includes a set of WS-operations, and the paths can be executed in any order with respect to each other. Such parallel execution paths are simply an optimisation of the sequential execution of those paths.

An example of a software system with three parallel execution paths is shown in Figure 9. Here, the three paths include three sets of operations, {WSOP102, WSOP103}, {WSOP104} and {WSOP105} respectively. These sets can be executed in parallel after the WSOP1 has been executed.

Before such a software system can be analysed by RBAM, each collection of parallel execution paths is converted to a single sequential (linear) path. Figure 10 depicts one possible linear version of the software system of Figure 9 where the three parallel execution paths are executed in the following order: {WS0P105} followed by {WSOP102, WSOP103} followed by {WSOP104}.

Note that the kind of parallelism discussed here is fundamentally different from the alternative execution paths in a Switch construct. The Switch construct provides alternative paths depending on the value of some expression. Those paths are not meant to be executed in parallel.

• Conversion of While Constructs

The While control construct indicates that the control constructs within it should be executed zero or more times, according to the value of some conditional expression. In general, it is not possible to determine the number of iterations of a While construct simply by parsing the software system's code. There are two possibilities as regards the number of iterations: i. The software system's designer/operator may indicate to RBAM how many times the While construct will be executed, e.g. the repetition-count \s r. ii. In the absence of the information from designer/operator, the RBAM assumes that the construct will be executed only once, i.e. its repetition-count is 1.

Here, we also assume that the set of WS-operations inside the While construct is a Sequence construct, or has been converted to one. More complex cases will be considered later. As an example, let us consider the following While construct: While construct 1 { Sequence construct 1 WS-operation 1 WS-operation 2 WS-operation n

} / ^* Sequence construct ends ^* / } /* While construct ends*/

Given the repetition-count of 1 or r, the above While construct will be converted to an equivalent Sequence construct which is composed of the 'Sequence construct 1' repeated 1 or r times, i.e.

Sequence construct 1

WS-operation 1

WS-operation 2 ...

WS-operation n

} /* Sequence construct ends*/

Sequence construct 1 WS-operation 1 WS-operation 2

WS-operation n } / ^* Sequence construct ends*/ ...

Sequence construct 1 (r ^th instance of the sequence) WS-operation 1 WS-operation 2

WS-operation n

} / ^* Sequence construct ends ^* /

The situation where the While construct contains one or more Switch constructs will be discussed later.

• Conversion of Switch Constructs

This section describes the process of converting a Switch control construct into two or more Sequence control constructs. A Switch construct defines at least two alternative execution paths.

The conversion process consists of two parts: i. It identifies the distinct execution paths contained within the Switch construct. This is straightforward when the software system is encoded in notations such as:

• BPEL4WS where the 'case' and 'otherwise' statements explicitly identify the individual execution paths within BPEL4WS's Switch statement. • the Java programming language where the 'case' and 'otherwise' statements identify the alternative execution paths within Java's Switch statement. Also, the if-then-else statement of the Java programming language serves the same purpose. ii. For each execution path identified above, the probability of its selection during the execution of that Switch construct is computed. To clarify the conversion process, we use an example of a software system containing two Switch constructs, one enclosing the other, as shown in figure 11. In Figure 11 , the software system contains two switch constructs, 'Switch

Construct 1' and 'Switch Construct 2', where the former contains the latter. The 'Switch Construct 1 ' introduces three execution paths, starting at WSOP11 , WS0P12 and WSOP13. The 'Switch Construct 2' defines two execution paths starting at WSOP22 and WSOP23. Accordingly, the combination of two Switch constructs results in multiple Sequence constructs each comprising a linear sequence of WS-operations. These are:

Sequence construct 1 : (WS0P11 , WSOP21 ) Sequence construct 2: (WS0P12, WSOP22, WSOP31) Sequence construct 3: (WSOP12, WSOP23, WSOP31 ) Sequence construct 4: (WS0P13)

Each of the above Sequence constructs is a valid. execution path within the scope of the Switch construct 1.

The next task is to compute for each individual Sequence construct its probability of being selected ior execution at execution time. To do this, we need to know the probability of selection of the individual execution paths at each Switch construct in the software system. These probability values can be obtained in two ways: i. The software system designer/operator can explicitly provide this information to the RBAM. The sum of the probability values for each Switch construct should be 1. ii. In the absence of this information being provided by the software system designer/operator, the RBAM will assume that there is equal probability of selecting any of the alternative execution paths at a Switch construct. For instance, in the software system shown in the figure above, the probability of opting for any of the three WS-operations WS0P11 , WS0P12 and WS0P13 in the 'Switch construct 1' would be 1/3; similarly, in the 'Switch construct 2' the probability of opting for WSOP22 or WSOP23 would be

1/2.

Using these probability values, the RBAM computes the probability of each specific Sequence construct to be selected for execution at the execution time as follows.

Sequence construct 1 : (WS0P11 , WS0P21 ):

(Probability of WS0P11 be selected) x (Probability of WSOP21 be selected) =

1 I

3 x 1 = 3

The selection probability for the other three Sequence constructs will be:

L i I

Sequence construct 2: (WS0P12, WSOP22, WS0P31): 3 x 2 x i ₌ 6

Sequence construct 3: (WS0P12, WSOP23, WS0P31): 3 x 2 x i = 6 .

1 1 Sequence construct 4: (WS0P13): ³ = ³

Note that the sum of the individual probabilities of these four Sequence constructs is

I 1 1 I

_{1 =} 3 + 6 + 6 + 3

Selection Probability of End-to-End Sequence Control Constructs

The previous section presented a scheme for computing the probability of a Sequence construct, which lies within a Switch construct, to be selected during the software system execution. The same scheme can also be used for computing the selection probabilities of the Sequence constructs spanning the application end-to-end. These end-to-end Sequence constructs represent alternative, complete execution paths of the software system during its execution. In other words, the application execution would follow one of these execution paths.

In the context of the application in Figure 1 1 , the end-to-end Sequence constructs and their respective selection probabilities are: i. End-to-end Sequence construct 1 : (WS0P1 , WS0P11 , WS0P21 , WSOP41 ) Probability of selection: 1/3 ii. End-to-end Sequence construct 2: (WSOP1 , WS0P12, WSOP22, WSOP31 , WSOP41) Probability of selection: 1/6 iii. End-to-end Sequence construct 3: (WS0P1 , WS0P12, WSOP23,

WSOP31. WSOP41) Probability of selection: 1/6 iv. End-to-end Sequence construct 4: (WSOP1 , WS0P13, WSOP41 ) Probability of selection: 1/3

• Conversion of Switch Constructs within While Constructs The conversion of a Switch construct which is within a While construct requires special attention. We consider two possibilities:

i. The While construct will be executed only once. This is the situation when the RBAM has not received any information about the repetition-count from the software system designer/operator. In this situation, the While construct is equivalent to a Sequence construct, hence, the conversion of the enclosed Switch construct does not require any special consideration and can be performed using the scheme presented in the previous section. ii. The While construct is to be executed more than once based on the repetition-count information supplied by the software system designer/operator. This is further considered below.

When a While construct is to be executed multiple times the Switch construct contained within it will be executed multiple times as well. This is semantically equivalent to having multiple instances of the Switch construct concatenated to

each other. For example, if a While construct is to be executed twice, then the enclosed Switch construct will be executed twice as well, as shown below:

While construct 1 { / ^* to be executed twice ^* / Switch construct 1 {

WSOP70 or

WSOP80 or

WSOP90

} / ^* Switch ends ^* / } / ^* While ends ^* / is equivalent to:

Switch construct 1 { WSOP70 or WSOP80 or WSOP90 } / ^* Swicth ends ^* / Switch construct 1 {

WSOP70 or

WSOP80 or

WSOP90

} /* Swicth ends ^* /

The number of alternative Sequence constructs (execution paths) resulting from this concatenation would be the result of the following multiplication: the number of alternative execution paths of the first Switch construct instance x the number of alternative execution paths of the second Switch construct instance = 3 >< 3 = 9

The actual Sequence constructs will be: (WSOP70, WSOP70) (WSOP70, WSOP80) (WSOP70, WSOP90)

(WSOP80, WSOP70)

(WSOP80, WSOP80) (WSOP80, WSOP90)

(WSOP90, WSOP70) (WSOP90, WSOP80) (WSOP90, WSOP90)

Reliability Index of WS-Operations within Sequence Constructs: Disjoint WS-instance Sets

The previous sections described how the different types of control constructs used in a software system can be converted into individual Sequence constructs and then merged to create one or more end-to-end Sequence constructs each representing a distinct end-to-end execution path within the software system. We also noted earlier that the RBAM can directly compute the reliability index of the Sequence constructs. This section describes how the reliability index of individual WS-operations used within a Sequence construct can be computed.

We start with the simpler case where the WS-instances that can execute a WS-operation cannot execute any other WS-operations used by Sequence construct. That is, the set of WS-instances that can execute a WS-operation is disjoint from the sets that can execute other WS-operations. An example of such a scenario is now given. Let us consider a sample Sequence construct consisting of a sequence of three WS-operations whose OIDs are WS0P1 , WSOP10, and WSOP20, that is:

Sequence construct : (WS0P1 , WSOP10, WSOP20)

Each WS-operation in this construct can be executed by multiple WS-instances as shown in scenario 1 below (WSI = WS-instance):

Scenario 1 :

WS0P1 : WSM, WSI2, WSI3, WSI4 WSOP10: WSM 1 , WSI12 WSOP20: WSI21 , WSI22, WSI23

In this scenario, the probability that a WS-operation such as WSOP20 is successfully executed by at least one of its WS-instances, namely, WSI21 , WSI22 or WSI23 is represented by Rlwsop2o and computed as:

R1 _WSOP10 = 1 - [probability that WSOP20 cannot be excuted successfully)

RI wsopio ⁼ 1 ~ {probability that all WS ins tan ces of WSOP20 fail to execute it)

RI _WSOP20 = \ - ( (probability that WSI2 \ fails to execute WSOP2θ] x (probability that WSI 22 fails to execute WSOP2θ) x [probability that WSl 23 fails to execute WSOP2θ) )

If the probability of a WS-instance, denoted by HD, to successfully execute a WS- operation, denoted by OID, is represented by P(HD. OID) then the above equation can be represented as:

To generalise the above equation, we use the following notation:

• The id of the i ^th WS-operation used by the Sequence construct is denoted - by WSOP[i]

• List_Of_W SI[WSOP[i]] represents the list of ids of WS-instances capable to execute the WS-operation WSOP[i], • The number of ids in List_Of_WSI[WSOP[i]] is obtained by sizeOf(List_Of_WSI[WSOP[i]])

• getWSI(r, List_Of_WSI[WSOP[i]]) returns the r ^th WS-instance id from the List_Of_WSI[WSOP[i]] list.

The probability that the i ^th WS-operation, with id WSOP[i], is successfully executed by at least one of its WS-instances is computed by:

Equation 1. Reliability Index of i WS-operation of a Sequence construct: Each WS-instance can execute only a single WS-operation used by the Sequence construct.

RIi indicates the probability that the i ^th WS-operation can be successfully executed by at least one of its WS-instances. In practice, to execute a WS- operation, one of its WS-instances is randomly selected and then is requested to execute the WS-operation. In case that WS-instance fails, another WS-instance will be randomly selected and invoked. This is repeated until either the WS-operation is successfully executed or no WS-instance is left.

Note that if a new WS-instance, with the same attributes as the existing ones, is added to the list of WS-instances that can execute a WS-operation, the probability that the WS-operation is successfully executed (i.e. its reliability index) will increase. This is because

[l - ø ) would have an extra ^{l {} "° ^{υlD) l} term on the right hand side, which will result in a larger Rl, value. This can be shown in the case of the scenario 1 above, if, for instance, WSOP20 is given a new WS-instance, namely, WSI24, i.e. WSOP20: WSI21, WSI22, WSI23, WSI24

Then the reliability index of WSOP20, denoted by RIW _S O _P20> would increase to:

This shows the obvious case that as the number of WS-instances capable to execute a WS-operation grows, the reliability index of the WS-operation (i.e. the probability that the WS-operation is successfully executed by at least one of its WS-instances) increases as well. This, in turn, increases the probability that the Sequence construct (or software system) utilising that WS-operation can be successfully executed.

Reliability Index of WS-Operations within Sequence Constructs: Overlapped WS-instance Sets

In this section we show how the reliability index of a WS-operation is computed if the disjointness constraint imposed in the previous section is removed. This means that it is possible for a single WS-instance to execute multiple WS-operations used by the Sequence construct. This is the case, for example, when the software system includes multiple WS-operations defined by a single Web Service, in which case, each WS-instance implementing that Web Service can execute all those WS-operations. We shall show how Equation 1 above should be transformed to take account of this more general case. To figure out the details of the transformation, a comparison is made of the impact of WS-instances executing only one vs. multiple WS-operations, on the reliability of the Sequence construct as a whole.

Let us assume that a Sequence construct comprises n WS-operations and each WS-operation can be executed by one or more WS-instances. New WS-instances are then introduced under two separate regimes as follows: i. Only one WS-instance capable of executing all WS-operations will be introduced.

ii. A separate WS-instance will be introduced for each WS-operation (i.e. in total of n new WS-instances will be introduced).

It is also assumed that the probability of successful execution of any of the WS-operations by the new WS-instances is the same in both of the above cases. It is intuitively obvious that in terms of Sequence construct's reliability as a whole case (i) improves the reliability less than case (ii). This is because in case (i) the failure of the newly introduced WS-instance will remove one WS-instance from each WS-operation, whereas in case (ii) the failure of one of the newly introduced WS-instances only affects one WS-operation. To have the same situation as in case (i) all the newly introduced WS-instances in case (ii) should fail. However, simultaneous failure of n WS-instances is normally (much) less likely than failure of a single WS-instance. Clearly Equation 1 should be transformed in such a way to take into account the impact of the exact number of WS-operations that can be executed by each WS-instance. This is because in addition to the cases (i) and (ii) above, there are other cases where a WS-instance can execute more than one, but not all of the WS-operations used by the Sequence construct. To exemplify these various cases, let us consider the introduction of new WS-instances in the context of the sample Sequence construct presented previously as Scenario 1 , and repeated below:

Scenario 1 : There is a Sequence construct consisting of three WS-operations: WS0P1 , WS0P10 and WSOP20, and there are multiple WS-instances that can execute each of those WS-operations. Also, the set of WS-instances associated with each WS-operation is disjoint from others:

WS0P1 : WSM , WSI2, WSI3, WSI4 WSOP10: WSM 1 , WSI12

WSOP20: WSI21 , WSI22, WSI23

We derive three new scenarios from the above scenario by introducing new WS-instances for each WS-operation.

Scenario 2: The WS-instance, WS1100, is added for all three WS-operations:

WS0P1 : WSH , WSI2, WSI3, WSI4, WSϊI00 WSOP10: WSM 1 , WSh 2, WSMOO WSOP20: WSI21 , WSI22, WSI23, WSMOO

In this scenario, there are a total of 60 (=5x3x4) distinct permutations of WS-instances of the form (WSI i, WSI j, WSI k) that can execute the three WS-operations. The failure of WSMOO will eliminate the following 45 permutations (of which WS1100 is a member):

Scenario 3: The WS-instance, WS1100, is added for WSOP1 and WSOP10, and another one, WSI300, is added for WSOP20:

WSOP1 : WSM, WSI2, WSI3, WSI4, WSI100

WSOP10: WSH 1 , WSI12, WSI100 WSOP20: WSI21 , WSI22, WSI23, WSI300 In this case the failure of WSI100 will only eliminate the 28 permutations shown below:

Scenario 4: In this case, three distinct service-providers, WSHOO, WSI200 and WSI300 are added for the three WS-operations: WSOP1 : WSH, WSI2, WSI3, WSI4, WSI100

WSOP10: WSH 1 , WSI12, WSI200 WSOP20: WSI21, WSI22, WSI23, WS1300

In this case the failure of WSHOO will only eliminate the 12 permutations shown below:

Reliability Index of WS-Operations within Sequence Constructs: Scale-Down Factor

From the scenarios presented in the previous section, it can be deduced that as the number of WS-operations (of a Sequence construct) that a single WS-instance can execute increases, the number of WS-instance permutations that would be eliminated by that WS-instance's failure also increases, and

consequently, the less contribution that WS-instance can make towards the reliability of the Sequence construct. In other words, the contribution of a WS-instance is inversely proportional to the number of WS-operations that it can execute. We propose to represent this inverse relationship via the scale-down factor as defined below.

Assumptions:

• A Sequence construct consists of n WS-operations, which can be executed by s WS-instances. • The r ^th WS-instance, WSI[r], can execute a subset of size, say m, where

1 <= m <= n, of WS-operations of the Sequence construct. The number m is obtained via the function getOpCount(WSI[r]).

• For each of the m WS-operations, the r ^th WS-instance, WSI[r], has a separate probability to successfully execute it. This probability value is supplied to the RBAM by external entities.

• The probability of successful execution of each of those m WS-operations by WSI[r] is independent of each other. Action to take: • The probability (supplied by external entities) that a WS-operation (from the subset of m WS-operations) can be successfully executed by the WSI[r] should be adjusted (multiplied) by a scale-down factor before that probability vaiue is used in computing the reliability index of that WS-operation. The 'scale-down factor' is defined as follows: c , _n ., , (n + 1)- getOpCountjWSI[r])

Scale Down Factor - - — n n is the number of WS-operations used by the Sequence construct. getOpCount(WSI[r]): is the number of WS-operations (used by the Sequence construct) that can be executed by the r ^th WS-instance, WSI[r]

Equation 2. 'Scale-Down Factor' for adjusting the probability of a WS-instance, WSI[r], to successfully execute a WS-operation used by a Sequence construct.

Reliability Index of WS-Operations within Sequence Constructs: Applicable to both Disjoint & Overlapped WS-instance Sets

The following formula, which is a generalisation of Equation 1 defined earlier, incorporates the scale-down factor and defines how the reliability index of a WS-operation can be computed given the possibility that each WS-instance capable of executing that WS-operation may be capable of executing zero or more other WS-operations (used by the Sequence construct) as well:

Equation 3. Reliability Index of i WS-operation of a Sequence construct when: (i) the Sequence construct consists of a sequence of n WS-operations , and Qi) each WS-instance may execute one or more WS-operations of the Sequence construct

Note that Equation 3 subsumes Equation 1.

To test Equation 3, we apply it to the scenarios 2, 3 and 4 defined earlier.

Scenario 2:

WSOP1: WSM, WSI2, WSI3, WSI4, WSI100

WSOP10: WSM 1, WS112, WSIIOO

WSOP20: WSI21 , WSI22, WSI23, WSH 00

Scenario 3:

WSOP1: WSM ₁ WSI2, WSI3, WSI4, WSM 00 WSOP10: WSM 1, WSI12, WSMOO WSOP20: WSI21, WSI22, WSI23.WSI300

Scenario 3:

WSOP1: WSM, WSI2, WSI3, WSI4, WSI100 WSOP10: WSMI, WSI12, WSI100 WSOP20: WSI21, WSI22, WSI23. WSI300

Scenario 4:

WSOP1 : WSH, WSI2, WSI3, WSI4, WSI100 WSOP10: WSH 1 , WS112, WSI200 WSOP20: WSI21 , WSI22, WSI23, WSI300

Assuming the following is true:

• WS1100 and WSI200 have equal probability to successfully execute

WSOP1 0,i.e.

• WS1100 and WSI300 have equal probability to successfully execute

WoUrdU, '-®- Piwsnoo, WSOP20) ⁼ Pwsπoo. WSπPIO)

then it can be seen that the individual terms ^λ/wso/>1 , ^RIMOP\O _an d ^{RI W} SOPJ ^O have increasingly larger values in the three scenarios, that is:

Reliability Index of a Sequence Construct

The previous sections described how the reliability index of the individual WS-operations used within a Sequence construct can be computed. This section described how these individual probabilities can be combined to compute the reliability index of the enclosing Sequence construct. The reliability index of a Sequence construct consisting of n WS-operation is computed via the following equation:

As a usage example, we apply Equation 4 to the scenarios 2, 3 and 4 defined in the previous section. Recall that these three scenarios deal with the same Sequence construct, namely, (WSOP1 , WSOP10, WSOP20); they only differ with respect to the WS-instances capable of executing the constituent WS-operations WS0P1 , WSOP10, and WSOP20. The result of applying Equation 4 is:

(Scenario 2 ^λ/» w ) < (Scenario 3 ^RI » ^SOPI ) < (Scenario 4 ^RI wson ) (Scenario 2 ^RI ^OP\O ) < (Scenario 3 ^RI ^OPIO ) < (Scenario 4 ^RI wso ^p _i0 )

Rl Rl RI

(Scenario 2 ^λ 'H™/ ^> 2O ) < (Scenario 3 ^λ1 »«o«o) ₌ (Scenario 4 ^λ1 »wo«o) It can be concluded that:

Reliability Index of Sequence construct defined by Scenario 2 < Reliability Index of Sequence construct defined by Scenario 3 < Reliability Index of Sequence construct defined by Scenario 4

Reliability Index of Software Systems

As described earlier, in order to compute the reliability index of a software system, first each of its control constructs are converted to a Sequence control construct. The resulting Sequence constructs are then merger together generating one of the following two alternatives: i. a single, end-to-end Sequence control construct representing the software system. This is the case when the software system does not include any Switch control constructs. The reliability index of software systems is computed using equation 4. ii. multiple, end-to-end Sequence control constructs, each representing one alternative, end-to-end execution path of the software system. This occurs when the software system contains at least one Switch construct. Note that it is possible for pair of alternative execution paths to have one or more WS-operations in common. The following section describes how the reliability index of software system containing Switch constructs is computed.

Reliability Index of Software Systems with Switch Constructs

As stated above, a software system containing one or more Switch control constructs consists of two or more alternative, end-to-end Sequence control constructs. The execution of the software system follows one of those Sequence control constructs (execution paths). The reliability index of the whole software system is computed according to the following steps: i. Identify the alternative, end-to-end Sequence control constructs (execution paths) contained within the software system, ii. The reliability index of each alternative Sequence construct (execution path) is computed using Equation 4. iii. The resulting reliability index for each Sequence construct is multiplied (adjusted) by the probability of that Sequence construct to be selected for execution during the execution of the software system, iv. Add the result of the above multiplications together.

These steps are summarised in the following equations:

Reliability Index of an end-to-end Sequence construct within an application containing Switch constructs =

(Probability of that end-to-end Sequence construct be selected for execution) X (Reliability Index of that end-to-end Sequence construct) =

(Probability of that end-to-end Sequence construct be selected for execution) x

•th where RI ₁ indicates the reliability index of the i WS-operation used by the end-to-end Sequence construct. RI ₁ is computed according to Equation 3.

The probability of an end-to-end Sequence construct to be selected for execution = Product of the probabilities of selecting the branch associated with the end-to-end Sequence construct at each Switch point of the software system.

Equation 5. Reliability Index of an end-to-end Sequence construct within an application containing Switch constructs

Equation 6. Reliability Index of a software system containing Switch constructs

As an example of using Equation 5 and Equation 6, the reliability index of the software system in Figure 11 is now computed. First, we need to compute the reliability indexes of the four alternative, end-to-end Sequence constructs of the application:

End-to-end S , WS0P41) reliability inde So the reliability index of the software system as a whole would be computed according to Equation 6 as follows:

WS-lnstance Inter-dependence

During the execution of an end-to-end Sequence construct (within a software system), the output obtained from one WS-operation may be used (possibly after some modification) as the input to one or more other WS-operations. The source WS-operation can be regarded as the provider of data and the others as its consumers. In some situations, it may be possible to use different WS-instances for the provider and consumer WS-operations. An example is given below: currentStockPrice = øefStoc/cP/7ce(stockNarne) buyStock(stockName, quantityToBuy, currentStockPrice),

The WS-operation getStockPrice() is the provider of the currentStockPrice information, and the WS-operation buyStock() is its consumer. Here, it would be possible to use different WS-instances to execute these two operations, assuming that all WS-instances offering the getStockPrice() operation return the same stock price for a given stock name. However, in the case of some provider WS-operations and their associated consumer WS-operations, it would not be possible to use different WS-instances. As an example, consider a Web Service comprising three WS-operations as follows: fileHandle = openF/7e(fileName) readBuffer = reacF//e(fileHandle) c/oseF/7e(fileHandle)

Here the openFileø operation returns a 'file handle' that is used by the readFileO and closeFileO operations. Clearly, the file handle obtained from one

WS-instance cannot be used to invoke readFileO on another WS-instance, and

closeFileO on yet another one. To achieve correct result, the same WS-instance must be used for all three WS-operations.

We call this kind of inter-dependence between the WS-operations WS-instance-inter-dependence. The designer or provider of a WS-operation should state whether that WS-operation has WS-instance-inter-dependence on any others, and if so, identify those WS-operations.

Reliability Index of Software Systems with WS-lnstance Inter-dependence The existence of WS-instance-inter-dependence between a set of WS-operations requires that the same WS-instance be used to execute all of them. So even if multiple WS-instances may be available for each of the inter-dependent WS-operations, once a WS-instance has been selected for one of them, the possibility of selecting other WS-instances for the remaining inter-dependent WS-operations is eliminated. This means that the remaining WS-operations would have only one WS-instance to execute them. We show this through an example below.

Let us consider scenario 5 (a revised version of scenario 1 presented earlier)

Scenario 5:

WSOP1 : WS1100, WSI200, WSI300

WSOP10: WS1100, WSI200, WSI300

WSOP20: WS1100, WSI200, WSI300

Here the software system consists of a Sequence construct comprising three

WS-operations:

WSOP1 , WSOP10, WSOP20

We assume these three WS-operations are WS-instance-inter-dependent on each other, i.e. all should be executed by one WS-instance. This means that there would be only three valid permutations of WS-instances for executing the

Sequence:

(WSh 00, WSH 00, WSM 00), (WSI200, WSI200, WSI200) and (WSI300, WSI300,

WSI300)

Note that in each case the WS-instance selected for executing one of the WS-operations has to execute the other two as well. Using Equation 3 the reliability index of each of the above WS-operations can be computed as follows.

WSOPl) I

The reliability index of the software system (WSOP1 , WSOP10 and WSOP20) as a whole will be computed using Equation 4 as follows:

References

[Active Endpoints]: A runtime environment capable of executing systems (process definitions) encoded in the Business Process Execution Language (BPEL) standard; http.V/www.active-endpoints.com/active-bpel-enqine- overview.htm.

[Andrews, 2003]: Andrews, Tony & et al. (2003): "Business Process Execution Language for Web Services Version 1.1 (BPEL4WS, v1.1)"; 5 May 2003; ftp://www6.software.ibm.com/software/developer/library/ws-bp el.pdf

[Chen, Huang, 1992]: Chen, Deng-Jyi & Huang, Tien-Hsiang (1992): "Reliability Analysis of Distributed Systems Based on Fast Reliability Algorithm", IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 2, March 1992.

[Cheung, 1980]: Cheung, R. C. (1980): A User-Oriented Software Reliability Model. In IEEE Transactions on Software Engineering, volume 6(2), pages 118— 125. IEEE, Mar. 1980.

[Genama, Rosenblum, Uchitel, 2005]: Genama, Rodrigues & Rosenblum, David & Uchitel. Sebastian (2005): "Using Scenarios to Predict the Reliability of Concurrent Component-Based Software Systems", M. Cerioli (Ed.): FASE 2005, LNCS 3442, pp. 111-126, 2005, Springer-Verlag Berlin Heidelberg 2005.

[Kusiak, Zakarian, 1996, IEEE]: Kusiak, Andrew & Zakarian, Armen (2006): "Reliability Evaluation of Process Models", IEEE Transactions on Components, Packaging and manufacturing Technology - Part A, Vol. 19, No. 2, June 1996. Fikes, R.E. and Nilsson, NJ. (1971): STRIPS : A New Approach to the Application of Theorem Proving to Problem Solving, Technical Note 43r. Al Center, SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025, May 1971. SRI Project 8259; http://www.ai.sri.com/pubs/files/tnO43r-fikes71.pdf Blum, A. and Furst, M. (1997). Fast planning through planning graph analysis; Artificial intelligence. 90:281-300; http://www.cs.cmu.edu/~ayrim/Papers/graphplan.pdf

Glossary

ARHS Automatic Recovery Handling System

HD Instance ID (or WS-instance ID)

OID Operation ID (WS-operation ID) PID Provider ID (or Service-Provide ID)

QoS Quality of Service

RBAM Reliability-Based Analysis Module

Rl Reliability Index

RSGM Recovery-Solution Generation Module

SOA Service Oriented Architecture

UDDI Universal Description, Discovery, and Integration

WS-composition Web Service composition

WS-instance Web Service instance

WS-operation Web Service operation

Previous Patent: DIAGNOSTIC MARKERS OF WOUND INFECTION

Next Patent: A METHOD OF OPERATING A TELECOMMUNICATIONS NETWORK