Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AN APPARATUS FOR ESTABLISHING METADATA
Document Type and Number:
WIPO Patent Application WO/2009/003762
Kind Code:
A1
Abstract:
An apparatus for establishing provenance metadata for use with one or more data processing systems, eachoperable to process data, the apparatus comprising: a metadata generator, responsive to a first data processing system processing the data, for generating first provenance metadata associated with the processing; a constructor for constructing a first data structure comprising the data and the first provenance metadata; and a transmitter for transmitting the first data structure to a second data processing system operable to store the first provenance metadata in a storage component.

Inventors:
BILLER ALEXIS STANISLAS (GB)
IBBOTSON JOHN BRYAN (GB)
NIDD MICHAEL ELTON (CH)
STANFORD-CLARK ANDREW JAMES (GB)
Application Number:
PCT/EP2008/056258
Publication Date:
January 08, 2009
Filing Date:
May 21, 2008
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
IBM (US)
BILLER ALEXIS STANISLAS (GB)
IBBOTSON JOHN BRYAN (GB)
NIDD MICHAEL ELTON (CH)
STANFORD-CLARK ANDREW JAMES (GB)
International Classes:
G06F19/00
Other References:
The technical aspects identified in the present application (Art. 15 PCT) are considered part of common general knowledge. Due to their notoriety no documentary evidence is found to be required. For further details see the accompanying Opinion and the reference below.
Attorney, Agent or Firm:
SEKAR, Anita (Intellectual Property LawHursley Park, Winchester Hampshire SO21 2JN, GB)
Download PDF:
Claims:
CLAIMS

1. An apparatus for establishing provenance metadata for use with one or more data processing systems, each operable to process data, the apparatus comprising: a metadata generator, responsive to a first data processing system processing the data, for generating first provenance metadata associated with the processing; a constructor for constructing a first data structure comprising the data and the first provenance metadata; and a transmitter for transmitting the first data structure to a second data processing system operable to store the first provenance metadata in a storage component.

2. An apparatus as claimed in claim 1, further comprising an extractor for extracting the data and the first provenance metadata from the first data structure.

3. An apparatus as claimed in claim 2, wherein the transmitter is operable to transmit the extracted data to a third data processing system.

4. An apparatus as claimed in claim 1, wherein: the transmitter is operable to transmit the first data structure to a fourth data processing system; in response to the fourth data processing system processing the data, the metadata generator is operable to generate second provenance metadata associated with the processing; and the constructor is operable to construct a second data structure comprising the data, the first provenance metadata and the second provenance metadata.

5. An apparatus as claimed in claim 1, further comprising means for determining whether to store the first provenance metadata.

6. An apparatus as claimed in claim 1, wherein the first provenance metadata comprises at least one of: a fourth identifier associated with the first data processing system; a fifth identifier associated with processing of the data or a timestamp.

7. An apparatus as claimed in claim 1, wherein the constructor is operable to append further provenance metadata to the first data structure.

8. An apparatus as claimed in claim 1, wherein the first data processing system and the fourth data processing system comprise intermittent network connections.

9. A method for establishing provenance metadata for use with one or more data processing systems, each operable to process data, the method comprising the steps of: generating, in response to a first data processing system processing the data, first provenance metadata associated with the processing; constructing a first data structure comprising the data and the first provenance metadata; and transmitting the first data structure to a second data processing system operable to store the first provenance metadata in a storage component.

10. A method as claimed in claim 9, further comprising the step of: extracting the data and the first provenance metadata from the first data structure.

11. A method as claimed in claim 10, further comprising the step of: transmitting the extracted data to a third data processing system.

12. A method as claimed in claim 9, further comprising the steps of: transmitting the first data structure to a fourth data processing system; generating, in response to the fourth data processing system processing the data, second provenance metadata associated with the processing; and constructing a second data structure comprising the data, the first provenance metadata and the second provenance metadata.

13. A method as claimed in claim 9, further comprising the step of: determining whether to store the first provenance metadata.

14. A method as claimed in claim 9, wherein the first provenance metadata comprises at least one of: a fourth identifier associated with the first data processing system; a fifth identifier associated with processing of the data or a timestamp.

15. A method as claimed in claim 9, further comprising the step of: appending further provenance metadata to the first data structure.

16. A method as claimed in claim 9, wherein the first data processing system and the fourth data processing system comprise intermittent network connections.

17. A computer program comprising program code means adapted to perform all the steps of any of claims 9 to 16 when said program is run on a computer.

Description:

AN APPARATUS FOR ESTABLISHING METADATA

FIELD OF THE INVENTION

The present invention relates to an apparatus for establishing metadata.

BACKGROUND OF THE INVENTION

"Provenance" can be defined as "the place of origin or history especially of a work of art etc." (The Concise Oxford Dictionary, Eighth edition, Clarendon press, 1991).

Today, the term provenance can also be associated with computing environments. For example, in some computing systems, provenance metadata can be associated with data and in this case, it is important to establish a "trace" of provenance metadata right back to its originating source. The establishment of provenance metadata associated with data improves

"confidence" associated with the data.

Provenance metadata can be used for a number of purposes, for example: showing that a process that was used to create the data was compliant with a set of rules or confirming whether different executions of data analysis were associated with the same semantic data.

A system (100) for supporting collection and analysis of provenance metadata is depicted in figure 1. The system (100) comprises a number of data processing systems (105, 110, 115) for processing data and a storage component (120) for storing provenance metadata. Preferably, each data processing system is operable to generate provenance metadata in response to one or more functions associated with the data.

In an example, a first data processing system (105) generates data and in response, generates provenance metadata which the first data processing system (105) stores in the storage component (120). The first data processing system (105) passes the data to a second data processing system (110). The second data processing system (110) transforms the data and in response, generates provenance metadata which the second data processing system (110)

stores in the storage component (120). The second data processing system (110) passes the data to a third data processing system (115). The third data processing system (115) also transforms the data and in response, generates provenance metadata which the third data processing system (115) stores in the storage component (120).

It should be understood that each of the data processing systems is also operable to generate and store provenance metadata in response to receipt of the data.

Typically, the provenance metadata is persisted in the storage component (120) and can be subsequently queried.

In a system (100) such as the one shown in figure 1, the data processing systems (105, 110, 115) are for example, computing devices associated with a managed infrastructure, capable of providing rich functionality and requiring a number of resources to be able to reliably store provenance metadata. For example, reliable network connections to the storage component (120) and reliable and sufficient CPU, memory and battery power etc. such that e.g. a connection can be made to the storage component (120).

Other more "ad-hoc" systems comprise for example, portable devices with limited resources having a pervasive nature (e.g. unreliable network connections). Such a system can be dynamic and heterogeneous in its configuration, having a mixture of device types e.g. routers; mobile phones; wireless connections and satellite up/down links. Reliable establishment of provenance metadata in such a system is difficult to achieve.

There is a need for an improved solution for supporting the establishment of provenance metadata for use with such an ad-hoc system.

DISCLOSURE OF THE INVENTION

According to a first aspect, the present invention provides an apparatus for establishing provenance metadata for use with one or more data processing systems, each operable to process data, the apparatus comprising: a metadata generator, responsive to a first data

processing system processing the data, for generating first provenance metadata associated with the processing; a constructor for constructing a first data structure comprising the data and the first provenance metadata; and a transmitter for transmitting the first data structure to a second data processing system operable to store the first provenance metadata in a storage component.

According to a second aspect, the present invention provides a method for establishing provenance metadata for use with one or more data processing systems, each operable to process data, the method comprising the steps of: generating, in response to a first data processing system processing the data, first provenance metadata associated with the processing; constructing a first data structure comprising the data and the first provenance metadata; and transmitting the first data structure to a second data processing system operable to store the first provenance metadata in a storage component.

According to a third aspect, the present invention provides a computer program comprising program code means adapted to perform all the steps of the method described above when said program is run on a computer.

It should be understood that in an environment wherein a data processing system operable to process data has for example, intermittent network connections and limited computing, reliable establishment of provenance metadata is difficult to achieve.

Advantageously, the present invention allows for reliable establishment of provenance metadata in such an environment.

Advantageously, the present invention allows for provenance metadata associated with a path over which data is transmitted to be established.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will now be described, by way of example only, with reference to preferred embodiments thereof, as illustrated in the following drawings:

Figure 1 is a block diagram of a prior art system for establishing provenance metadata;

Figure 2 is a block diagram of a system for establishing provenance metadata according to the preferred embodiment;

Figure 3 A is a block diagram of an apparatus associated with a first data processing system;

Figure 3B is a block diagram of an apparatus associated with a second data processing system;

Figure 4 is a block diagram of an apparatus associated with a proxy data processing system; and

Figure 5 is a flow chart showing the operational steps involved in a process for establishing provenance metadata.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

A system (200) for supporting establishment of provenance metadata according to a preferred embodiment is shown in figure 2. The system (200) comprises a first data processing system (205) and a second data processing system (210). In a first example, each of the first data processing system (205) and the second data processing system (210) has limited resources (e.g. an unreliable network connection). In the first example, the first data processing system (205) comprises a blood pressure monitor and the second data processing system (210) comprises a mobile telephone.

The first data processing system (205) is depicted in more detail in figure 3A and comprises a number of components, namely, a first data processor (300; a first provenance metadata generator (305); a first constructor (310); a first transmitter (315) and a first storage component (320) for storing format data.

The second data processing system (210) is depicted in more detail in figure 3B and comprises a number of components, namely, a first receiver (325); a first extractor (330); a second data processor (335); a second provenance metadata generator (340); a second constructor (345); a second transmitter (350) and a second storage component (355) for storing format data.

It should be understood that the first data processing system (205) can comprise a receiver and an extractor.

The system (200) also comprises a first proxy data processing system (215). In the first example, the first proxy data processing system (215) has sufficient resources and a sufficiently reliable network connection in order to reliably store provenance metadata.

The first proxy data processing system (215) is depicted in more detail in figure 4 and comprises a number of components, namely, a second receiver (400); a second extractor

(405); an instructor (410) and a third transmitter (415).

The system (200) also comprises a plurality of conventional data processing systems, namely, a third data processing system (225) and a fourth data processing system (230) (e.g. as described with reference to figure 1). Each of the third and fourth data processing systems

(225 and 230 respectively) is operable to generate provenance metadata as is known in the art.

The system (200) also comprises a fourth storage component (220) for storing provenance metadata, wherein the fourth storage component (220) is accessible by the first proxy data processing system (215) and each of the third and fourth data processing systems (225 and 230 respectively).

A process according to the preferred embodiment will now be described with reference to figure 5.

In the first example, the first data processor (300) monitors blood pressure of a patient and in response generates (step 500) blood pressure data. An example of the blood pressure data is shown below:

80 mmHg

In the first example, preferably, each of the provenance metadata generators (305 and 340) is operable to generate provenance metadata in response to processing data.

In response to the generation of the blood pressure data, the first provenance metadata generator (305) generates (step 505) first provenance metadata in accordance with first format data stored in the first storage component (320).

An example of the first format data is shown below, wherein, for a generate function, the format comprises an identifier associated with the data processing system that generates the data (e.g. the first data processing system (205)); a timestamp associated with a time of generation of the data and a function identifier associated with generation of the data:

If function = generate, then identifier = client id timestamp = generate time function identifier = "generate"

An example of the first provenance metadata is shown below:

First thin client; 14:00; "generate"

At step 510, the first constructor (310) constructs a first data structure comprising the blood pressure data having an associated first identifier and the first provenance metadata having an associated second identifier. At step 515, the first transmitter (315) transmits (e.g. using a wireless connection) the first data structure to the first receiver (325) of the second data processing system (210).

In response to receipt of the first data structure, the first extractor (330) extracts the blood pressure data using the first identifier and extracts the first provenance metadata using the second identifier.

In response to the extraction, the second data processor (335) transforms (step 520) the blood pressure data (e.g. into another format).

An example of the transformed blood pressure data is shown below, wherein the blood pressure data is converted to a particular data structure:

Blood_pressure.data_structure

In response to the transformation of the blood pressure data, the second provenance metadata generator (340) generates (step 525) second provenance metadata in accordance with second format data stored in the second storage component (355).

An example of the second format data is shown below, wherein, for a transform function, the format comprises an identifier associated with the data processing system that transforms the data (e.g. the second data processing system (210)); a timestamp associated with a time of transformation of the data and a function identifier associated with transformation of the data:

If function = transform, then identifier = client id timestamp = transform time function identifier = "transform"

An example of the second provenance metadata is shown below:

Second thin client; 14:05; "transform"

At step 530, the second constructor (345) constructs a second data structure comprising the transformed blood pressure data having an associated third identifier; the first provenance metadata having the associated second identifier and the second provenance metadata having an associated fourth identifier.

At step 535, the second transmitter (350) transmits (e.g. using a wireless connection) the second data structure to the second receiver (400) of the first proxy data processing system (215).

In response to receiving the second data structure, the second extractor (405) extracts (step

540) the transformed blood pressure data using the third identifier, extracts the first provenance metadata using the second identifier and extracts the second provenance metadata using the fourth identifier.

In the first example, in response to the extraction, preferably the third transmitter (415) transmits (step 545) the transformed blood pressure data to the third data processing system (225). In response to receiving the transformed blood pressure data, the third data processing system (225) preferably processes the transformed blood pressure data (e.g. after which the third data processing system (225) can transmit the data to the fourth data processing system (230)) and generates and stores associated provenance metadata in the fourth storage component (220).

In response to the extraction, the instructor (410) instructs the fourth storage component (220) to store (step 550) (e.g. using a wired connection) the first provenance metadata and the second provenance metadata in the fourth storage component (220).

Advantageously, a trace of provenance metadata can be provided right back to an originating source of the data (e.g. the first data processing system (205)). The provenance metadata can subsequently be queried and analysed.

Advantageously, each data processing system can append provenance metadata to data whilst the data is in transit through a network. Furthermore, provenance metadata can be

extracted and stored in a storage component once the data reaches a data processing system which is operable to reliably store the provenance metadata. Thus advantageously, the preferred embodiment allows for provenance metadata associated with data to be established and subsequently analysed even if the system in which the data is processed comprises data processing systems which are unable to reliably store the provenance metadata.

It should be understood that in some systems, it may be impractical to store provenance metadata associated with each function associated with each data processing system. For example, if a number of data processing systems in a system is particularly large, an amount of associated provenance metadata will also be large and it may be impractical or not possible to store the large amount of provenance metadata. In another example, content of the data may not warrant overhead associated with a large amount of provenance metadata being stored (e.g. unlike in the first example wherein data associated with a patient probably would warrant an associated large amount of provenance metadata being stored).

Preferably, in a system that requires a reduction in the amount of provenance metadata being stored, "trust" data is associated with each data processing system (and preferably, with each data processing system). For example, the trust data comprises a value associated with whether data generated by a data processing system is grossly out of range compared to data generated by another data processing system; a value associated with a state of a data processing system or a value associated with data processed by a data processing system meeting a pre-confϊgured quality threshold. Alternatively, the trust data can comprise any number of a plurality of values.

In a second example, preferably, trust data is associated with a data processing system (e.g. by an administrator; by a system; by using statistical data). Preferably, when the first proxy data processing system (215) receives a data structure comprising provenance metadata, the first proxy data processing system (215) uses (e.g. by comparing the trust data with reference trust data) the trust data in order to determine whether the associated provenance metadata should be stored in a storage component.

For example, the first proxy data processing system (215) preferably does not store provenance metadata associated with a data processing system having a "high degree" of associated trust. If the degree of trust decreases, preferably the first proxy data processing system (215) stores provenance metadata associated with the data processing system.

Advantageously, although this mechanism allows for provenance metadata to be captured, it also allows for selective capture of provenance metadata.

It will be clear to one of ordinary skill in the art that all or part of the method of the preferred embodiments of the present invention may suitably and usefully be embodied in a logic apparatus, or a plurality of logic apparatus, comprising logic elements arranged to perform the steps of the method and that such logic elements may comprise hardware components, firmware components or a combination thereof.

It will be equally clear to one of skill in the art that all or part of a logic arrangement according to the preferred embodiments of the present invention may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the method, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application-specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.

It will be appreciated that the method and arrangement described above may also suitably be carried out fully or partially in software running on one or more processors (not shown in the figures), and that the software may be provided in the form of one or more computer program elements carried on any suitable data-carrier (also not shown in the figures) such as a magnetic or optical disk or the like. Channels for the transmission of data may likewise comprise storage media of all descriptions as well as signal-carrying media, such as wired or wireless signal- carrying media.

The present invention may further suitably be embodied as a computer program product for use with a computer system. Such an implementation may comprise a series of computer- readable instructions either fixed on a tangible medium, such as a computer readable medium, for example, diskette, CD-ROM, ROM, or hard disk, or transmittable to a computer system, via a modem or other interface device, over either a tangible medium, including but not limited to optical or analogue communications lines, or intangibly using wireless techniques, including but not limited to microwave, infrared or other transmission techniques. The series of computer readable instructions embodies all or part of the functionality previously described herein.

Those skilled in the art will appreciate that such computer readable instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Further, such instructions may be stored using any memory technology, present or future, including but not limited to, semiconductor, magnetic, or optical, or transmitted using any communications technology, present or future, including but not limited to optical, infrared, or microwave. It is contemplated that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation, for example, shrink-wrapped software, pre-loaded with a computer system, for example, on a system ROM or fixed disk, or distributed from a server or electronic bulletin board over a network, for example, the Internet or World Wide Web.

In an alternative, the preferred embodiment of the present invention may be realized in the form of computer implemented method of deploying a service comprising steps of deploying computer program code operable to, when deployed into a computer infrastructure and executed thereon, cause said computer system to perform all the steps of the described method.

It will be clear to one skilled in the art that many improvements and modifications can be made to the foregoing exemplary embodiment without departing from the scope of the present invention.