Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A DATA PROCESSING APPARATUS AND METHOD AND A DATA CONTAINER STRUCTURE
Document Type and Number:
WIPO Patent Application WO/2017/118474
Kind Code:
A1
Abstract:
The invention relates to a data processing apparatus and a method and a corresponding data container structure. The data processing apparatus (300) is configured to process a data stream (301) comprising a plurality of data elements (301 a, b) arranged in a chronological order. The data processing apparatus (300) comprises a processor (303) configured to generate a plurality of data container structures (305 a, b) on the basis of the data stream (301), wherein each data container structure (305 a, b) comprises a subset of the plurality of data elements (301a,b) of the data stream (301) in chronological order, wherein the processor (303) is further configured to provide each data container structure (305 a, b) with metadata (307 a, b), wherein the metadata (307 a, b) defines the chronological order of each data container structure (305 a, b) relative to the other data container structures (305 a, b) of the plurality of data container structures (305 a, b).

Inventors:
TUDORAN RADU (DE)
BRASCHE GOETZ (DE)
Application Number:
PCT/EP2016/050053
Publication Date:
July 13, 2017
Filing Date:
January 05, 2016
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
TUDORAN RADU (DE)
BRASCHE GOETZ (DE)
International Classes:
G06F17/30
Domestic Patent References:
WO2009024926A12009-02-26
Foreign References:
US20110093491A12011-04-21
US20060277230A12006-12-07
US7383253B12008-06-03
US20080256326A12008-10-16
US20110307545A12011-12-15
US20140012852A12014-01-09
US20150363464A12015-12-17
US8983952B12015-03-17
US20020009172A12002-01-24
Other References:
ARVIND ARASU ET AL: "CQL: A Language for Continuous Queries over Streams and Relations", 3 February 2004, DATABASE PROGRAMMING LANGUAGES; [LECTURE NOTES IN COMPUTER SCIENCE;;LNCS], SPRINGER-VERLAG, BERLIN/HEIDELBERG, PAGE(S) 1 - 19, ISBN: 978-3-540-20896-9, XP019002360
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

1 . A data processing apparatus (300) for processing a data stream (301 ) comprising a plurality of data elements (301 a,b) arranged in a chronological order, wherein the data processing apparatus (300) comprises: a processor (303) configured to generate a plurality of data container structures (305a,b) on the basis of the data stream (301 ), wherein each data container structure (305a,b) comprises a subset of the plurality of data elements (301 a,b) of the data stream (301 ) in chronological order, wherein the processor (303) is further configured to provide each data container structure (305a,b) with metadata (307a,b), wherein the metadata (307a,b) defines the chronological order of each data container structure (305a,b) relative to the other data container structures (305a,b) of the plurality of data container structures (305a,b).

2. The data processing apparatus (300) of claim 1 , wherein each data container structure (305a,b) comprises at least one data window comprising at least one data element of the subset of the plurality of data elements and/or at least one data sub- container comprising at least two data sub-windows, wherein each data sub-window comprises at least one data element of the subset of the plurality of data elements.

3. The data processing apparatus (300) of claim 1 or 2, wherein the metadata (307a,b) of each data container structure (305a,b) comprises a pointer to the data element of the subset of the plurality of data elements (301 a,b), which is the first data element in the chronological order of the subset of the plurality of data elements (301 a,b).

4. The data processing apparatus (300) of any one of the preceding claims, wherein the processor (303) is further configured to provide each data container structure (305a,b) with statistical data associated with the subset of the plurality of data elements (301 a,b) of the data container structure (305a,b).

5. The data processing apparatus (300) of claim 4, wherein the metadata (307a,b) of each data container structure (305a,b) comprises a pointer to the statistical data associated with the subset of the plurality of data elements (301 a,b) of the data container structure (305a,b).

6. The data processing apparatus (300) of any one of the preceding claims, wherein the processor (303) is further configured to adjust each data container structure (305a,b) by adding data elements of the data stream (301 ) to the subset of data elements of the data container structure (305a,b) or by removing data elements from the subset of data elements of the data container structure (305a,b).

7. The data processing apparatus (300) of any one of the preceding claims, wherein the data processing apparatus (300) further comprises a memory (309) configured to store the plurality of data container structures (305a,b).

8. A data container structure (305a,b) configured to store a plurality of data elements (301 a,b) arranged in a chronological order in a data stream (301 ), wherein the data container structure (305a,b) comprises: a subset of the plurality of data elements (301 a,b) of the data stream (301 ) in

chronological order; and metadata (307a,b) defining the chronological order of the data container structure (305a,b) in the data stream (301 ).

9. The data container structure (305a,b) of claim 8, wherein the data container structure (305a,b) comprises at least one data window comprising at least one data element of the subset of the plurality of data elements (301 a,b) and/or at least one data sub-container comprising at least two data sub-windows, wherein each data sub-window comprises at least one data element of the subset of the plurality of data elements (301 a,b).

10. The data container structure (305a,b) of claims 8 or 9, wherein the metadata (307a,b) of the data container structure (305a,b) comprises a pointer to the data element of the subset of the plurality of data elements (301 a,b), which is the first data element in the chronological order of the subset of the plurality of data elements (301 a,b).

1 1 . The data container structure (305a,b) of any one of claims 8 to 10, wherein the data container structure (305a,b) further comprises statistical data associated with the subset of the plurality of data elements (301 a,b) of the data container structure (305a,b).

12. The data container structure (305a,b) of claim 1 1 , wherein the metadata (307a,b) of the data container structure (305a,b) comprises a pointer to the statistical data associated with the subset of the plurality of data elements (301 a,b) of the data container structure (305a,b).

13. The data container structure (305a,b) of any one of claims 8 to 12, wherein the data container structure (305a,b) is configured to be adjustable by adding data elements of the data stream (301 ) to the subset of data elements of the data container structure (305a,b) or by removing data elements from the subset of data elements of the data container structure (305a,b).

14. A data processing method (400) for processing a data stream (301 ) comprising a plurality of data elements (301 a,b) arranged in a chronological order, wherein the data processing method (400) comprises: generating (401 ) a plurality of data container structures (305a,b) on the basis of the data stream (301 ), wherein each data container structure (305a,b) comprises a subset of the plurality of data elements of the data stream (301 ) in chronological order; and providing (403) each data container structure (305a,b) with metadata (307a,b), wherein the metadata (307a,b) defines the chronological order of each data container structure (305a,b) relative to the other data container structures (305a,b) of the plurality of data container structures (305a,b). 15. A computer program comprising program code for performing the method (400) of claim 14 when executed on a computer.

Description:
A DATA PROCESSING APPARATUS AND METHOD AND A DATA CONTAINER

STRUCTURE

TECHNICAL FIELD

Generally, the present invention relates to the field of data processing. More specifically, the present invention relates to a data processing apparatus and method for processing a stream of data as well as a corresponding data container structure. BACKGROUND

In today's information-rich environment, quickly handling massive volumes of data can be both challenging and important. Such data is often provided in the form of data streams, i.e. continuous or semi-continuous streams of data with, in many instances, data elements being generated in real-time, as events occur. For example, sensors used in radio- frequency identification (RFID) in tracking and access applications can provide streaming data on locations of objects being tracked. Responding to particular signals in the streaming data quickly is often a critical aspect of many applications. As an example, network monitoring systems configured to detect security threats need to detect and report events represented in streams of data collected through monitoring.

Conventionally, processing on streaming data was performed by first storing the data in a database. The database could then be queried to retrieve the data for further processing. Therefore, analyzing the data in real-time was difficult, because of the limits imposed by database access times, particularly for streams with high data rates. To address this issue, several traditional software technologies, such as main memory database management systems, have been redesigned.

More recently, techniques known as "complex event processing" or "event stream processing" have been developed. By means of these techniques events, manifested as meaningful patterns within the data streams, can be detected as a result of processing the data stream. In this context, a new class of infrastructure has emerged in the form of stream processing engines, such as "Aurora", "STREAM", "TelegraphCQ", to specifically support high-volume, low-latency data stream processing applications. The conventional stream processing paradigm involves a set of operations that are applied on all the elements/events of a data stream. In a lot of scenarios processing a data stream involves specific computations, such as computing a sliding average value, that have to be computed repeatedly in time on the basis of the continuously changing data elements or events of the data stream. Differently put, often new data elements or events that are being received as part of the data stream have to be put in perspective with older data elements of the data stream. Alternatively, in certain application scenarios, each new data element triggers repeating the computation on the basis of all available data elements of the data stream, i.e., older data elements and new data elements.

Such a situation typically involves partitioning the data stream into distinct windows, i.e. time intervals, to which the computation is applied. A window is a delimitation with respect to time or the sequence defined by the data elements or events of the data stream that contains a subset of the data elements or events. This concept is exemplified in figure 1 . A window is used by a conventional stream processing engine to apply the processing function (e.g., a computational operation) to the stream of data or events. The same mechanism is used if the events are received in real time or if they are stored in a database and processed iteratively as they are replayed to mimic the stream behavior. A particular aspect is that a conventional window is transitive. It is formed for the purpose of applying the computational operation to the data element or events of the data stream, but once this is completed, the window is not preserved. Because the windows corresponding to the processing of the stream are generally discarded, any successive operation on the data stream will require recreating and re-computing the window(s). This happens even for small updates of the windows such as a time recalibration. This is a common operation in practice considering that for many analysis scenarios, the partition of the stream into a plurality of windows is done with respect to the time instant of the analysis. As time progresses, the reference time of the system progresses as well and this requires readjustments of the windows.

An exemplary scenario, which might occur in a conventional stream processing engine, is illustrated in figure 2. For instance, at a time instance X the data elements or events of a data stream can be grouped into windows having a size of 1 hour starting from the current moment in time and going back 6 months. On the basis of this grouping of the data elements or events into windows different statistical measures can be computed, such as the average value of the data elements or events of a window. Once a new data element or event arrives as part of the data stream at a time instance Χ+Δ, these statistical measures have to be computed anew, as the windows have to be re-created from scratch for the time instance Χ+Δ. In the light of the above there is a need for an improved data processing apparatus and method as well as an improved data container structure, which allow, in particular, harnessing the temporal structure of the data elements in the data stream in an improved way. SUMMARY

It is an object of the invention to provide an improved data processing apparatus and method as well as an improved data container structure, which allow, in particular, harnessing the temporal structure of the data elements in the data stream in an improved way.

The foregoing and other objects are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect, the invention relates to a data processing apparatus for processing a data stream comprising a plurality of data elements arranged in a

chronological order within the data stream. The data processing apparatus comprises a processor configured to generate a plurality of data container structures on the basis of the data stream, wherein each data container structure comprises a subset of the plurality of data elements of the data stream in chronological order. The processor is further configured to provide each data container structure with metadata, wherein the metadata defines the chronological order of each data container structure relative to the other data container structures of the plurality of data container structures, i.e. the chronological order of each data container structure in the data stream.

Thus, an improved data processing apparatus is provided, which allows, in particular, harnessing the temporal structure of the data elements in the data stream. On the basis of the metadata a time-based management of the data container structures by the data processing apparatus is provided. The performance benefits provided consists in saving computational cycles for re-occurring stream operations triggered by updates. The amount of computational cycles saved varies depending on the specific operation and context, reaching in some cases more that 90% of conventional options.

In a first possible implementation form of the data processing apparatus according to the first aspect as such, each data container structure comprises at least one data window comprising at least one data element of the subset of the plurality of data elements and/or at least one data sub-container comprising at least two data sub-windows, wherein each data sub-window comprises at least one data element of the subset of the plurality of data elements.

Providing for such a hierarchy of data container structures, data sub-containers, data windows and/or data sub-windows allows, for instance, utilizing different time granularities at different hierarchy levels. In a second possible implementation form of the data processing apparatus according to the first aspect as such or the first implementation form thereof, the metadata of each data container structure comprises a pointer to the data element of the subset of the plurality of data elements, which is the first data element in the chronological order of the subset of the plurality of data elements. In an implementation form the metadata of each data container structure can further comprise a data container structure identifier and/or a start time or end time of the data container structure.

In a third possible implementation form of the data processing apparatus according to the first aspect as such or the first or second implementation form thereof, the processor is further configured to provide each data container structure with statistical data associated with the subset of the plurality of data elements of the data container structure. In an implementation form, the statistical data can comprise at least one of the following: a count, a sum, an average, a median, a variance, a standard deviation and/or a coefficient of variation of the plurality of data elements. The statistical data can comprise a plurality of pairs, wherein each pair comprises a statistical function and the value of the statistical function applied to the data elements within the data container structure or a window thereof.

In a fourth possible implementation form of the data processing apparatus according to the third implementation form of the first aspect, the metadata of each data container structure further comprises a pointer to the statistical data associated with the subset of the plurality of data elements of the data container structure.

In a fifth possible implementation form of the data processing apparatus according to the first aspect as such or any one of the first to fourth implementation form thereof, the processor is further configured to adjust each data container structure by adding data elements of the data stream to the subset of data elements of the data container structure or by removing data elements from the subset of data elements of the data container structure. In an implementation form, the processor is configured, when updating or adjusting a data container structure, to employ a lazy update scheme or an active update scheme for the statistical data associated with the data elements of a data container structure.

In a sixth possible implementation form of the data processing apparatus according to the first aspect as such or any one of the first to fifth implementation form thereof, the data processing apparatus further comprises a memory configured to store the plurality of data container structures.

According to a second aspect the invention relates to a data container structure configured to store a plurality of data elements arranged in a chronological order in a data stream. The data container structure comprises a subset of the plurality of data elements of the data stream in chronological order, and metadata defining the chronological order of the data container structure in the data stream, i.e. the chronological order of the data container structure relative to other container structures based on the data stream.

In a first possible implementation form of the data container structure according to the second aspect as such, the data container structure comprises at least one data window comprising at least one data element of the subset of the plurality of data elements and/or at least one data sub-container comprising at least two data sub-windows, wherein each data sub-window comprises at least one data element of the subset of the plurality of data elements.

In a second possible implementation form of the data container structure according to the second aspect as such or the first implementation form thereof, the metadata of the data container structure comprises a pointer to the data element of the subset of the plurality of data elements, which is the first data element in the chronological order of the subset of the plurality of data elements.

In a third possible implementation form of the data container structure according to the second aspect as such or the first or second implementation form thereof, the data container structure further comprises statistical data associated with the subset of the plurality of data elements of the data container structure.

In a fourth possible implementation form of the data container structure according to the third implementation form of the second aspect, the metadata of the data container structure comprises a pointer to the statistical data associated with the subset of the plurality of data elements of the data container structure.

In a fifth possible implementation form of the data container structure according to the second aspect as such or any one of the first to fourth implementation form thereof, the data container structure is configured to be adjustable by adding data elements of the data stream to the subset of data elements of the data container structure or by removing data elements from the subset of data elements of the data container structure. According to a third aspect, the invention relates to a data processing method for processing a data stream comprising a plurality of data elements arranged in a chronological order. The data processing method comprises the steps of generating a plurality of data container structures on the basis of the data stream, wherein each data container structure comprises a subset of the plurality of data elements of the data stream in chronological order, and providing each data container structure with metadata, wherein the metadata defines the chronological order of each data container structure relative to the other data container structures of the plurality of data container structures.

The data processing method according to the third aspect of the invention can be performed by the data processing apparatus according to the first aspect of the invention. Further features of the data processing method according to the third aspect of the invention result directly from the functionality of the data processing apparatus according to the first aspect of the invention and its different implementation forms described above. According to a fourth aspect the invention relates to a computer program comprising program code for performing the data processing method according to the third aspect of the invention or any of its implementation forms when executed on a computer. The invention can be implemented in hardware and/or software.

BRIEF DESCRIPTION OF THE DRAWINGS

Further embodiments of the invention will be described with respect to the following figures, wherein:

Fig. 1 shows a schematic diagram illustrating an aspect of conventional stream processing engines; Fig. 2 shows a schematic diagram illustrating an aspect of conventional stream processing engines;

Fig. 3 shows a schematic diagram illustrating a data processing apparatus for processing a data stream according to an embodiment;

Fig. 4 shows a schematic diagram illustrating a data processing method for processing a data stream according to an embodiment;

Fig. 5 shows a schematic diagram illustrating a plurality of data container structures according to an embodiment provided by a data processing apparatus according to an embodiment;

Fig. 6 shows a schematic diagram illustrating different aspects of a data container structure according to an embodiment provided by a data processing apparatus according to an embodiment;

Fig. 7 shows a schematic diagram illustrating different aspects of a data container structure according to an embodiment provided by a data processing apparatus according to an embodiment; Fig. 8 shows a schematic diagram illustrating different aspects of a data processing apparatus according to an embodiment and a plurality of data container structures according to an embodiment; and Fig. 9 shows a table illustrating the performance of a data processing apparatus and method according to an embodiment.

DETAILED DESCRIPTION OF THE EMBODIMENTS In the following description, reference is made to the accompanying drawings, which form part of the disclosure, and in which are shown, by way of illustration, specific aspects in which the present invention may be placed. It is understood that other aspects may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description, therefore, is not to be taken in a limiting sense, as the scope of the present invention is defined be the appended claims.

For instance, it is understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if a specific method step is described, a corresponding device may include a unit to perform the described method step, even if such unit is not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise. Figure 3 shows a schematic diagram illustrating a data processing apparatus 300 for processing a data stream 301 according to an embodiment. The data stream 301 is made up of a plurality of data elements 301 a,b arranged in a chronological order within the data stream 301 , for instance, due to the arrival time of a respective data element 301 a,b at the data processing apparatus 300.

The data processing apparatus 300 comprises a processor 301 configured to generate a plurality of data container structures 305a,b on the basis of the data stream 301 , wherein each data container structure 305a,b comprises a subset of the plurality of data elements 301 a,b of the data stream 301 in chronological order. A memory 309 can be provided as part of the data processing apparatus 300 for storing the plurality of data container structures 305a,b. The processor 301 is further configured to provide each data container structure 305a,b with metadata 307a,b, wherein the metadata 307a,b defines the chronological order of each data container structure 305a,b relative to the other data container structures 305a,b of the plurality of data container structures 305a,b, i.e. the chronological order of each data container structure 305a,b in the data stream 301 . For instance, in the embodiment shown in figure 3, the metadata 307b of the data container structure 305b can define the chronological order of the data container structure 305b relative to the data container structure 305a. Thus, the data container structures 305a,b shown in figure 3 are configured to store the data elements 301 a,b arranged in a chronological order in the data stream 301 . Each data container structure 305a,b comprises a subset of the plurality of data elements 301 a,b of the data stream 301 in chronological order as well as metadata 307a,b defining the chronological order of the data container structure 305a,b in the data stream 301 , i.e. the chronological order of the data container structure 305a,b relative to other container structures 305a,b based on the data stream 301 .

Figure 4 shows a schematic diagram illustrating a data processing method 400 for processing a data stream, for instance the data stream 301 shown in figure 3, according to an embodiment, wherein the data stream 301 comprises a plurality of data elements 301 a,b arranged in a chronological order.

The data processing method 400 comprises a step 401 of generating a plurality of data container structures 305a,b on the basis of the data stream 301 , wherein each data container structure 305a,b comprises a subset of the plurality of data elements 301 a,b of the data stream 301 in chronological order, and a step 403 of providing each data container structure 305a,b with metadata 307a,b, wherein the metadata 307a,b defines the chronological order of each data container structure 305a,b relative to the other data container structures 305a,b of the plurality of data container structures 305a,b.

Further implementation forms, embodiments and aspects of the data processing apparatus 300, the data container structures 305a,b and the data processing method 400 will be described in the following. As already described above, the present invention provides a data container structure, such as the data container structures 305a,b shown in figure 3, that allows organizing the data elements or events of the data stream 301 into windows for storage. This structure encapsulates the notion of grouping the data elements or events into data container structures or windows, and preserving such partition when storing the data. In an embodiment, as internal references for delimitation, management, access and navigation, the data processing apparatus and the data container structure employ time-based references. In an embodiment, the organization and holding of the events or data elements in windows is done in a hierarchical fashion. That is, in an embodiment, a data container that holds a window will be composed of sub containers that correspond to smaller granularity/partitioning of the data. In an embodiment, these granularities can go down to the level of an event or data element. In an embodiment, the data container structure can have time boundaries as a delimiters as well as navigation references. Additionally, in an embodiment, the inner structures that form the data container structure can also have their own corresponding time references. In an embodiment, the data processing apparatus 300 is configured to apply a plurality of operations on the data container structures 305a,b. The purpose of this is to save compute cycles when updates are required either on the partition of the events into windows or on the statistics that are computed on them. By preserving the window partition of a stream, when an update needs to be applied, windows can be simply readjusted according to the specific operation rather than re-created from the raw data. Moreover, the organization of the data in windows is flexible in the sense that it enables to combine windows to obtain new partitions of data or to re-adjust the window time boundaries to delete, add, and insert new events or containers. To achieve this later goal, the data container structures that hold data can be extendible and allow internal alteration. In addition to this, the data container structure can record predefined statistics about the windows, which are linked with the metadata that enables navigation to the corresponding container. In an embodiment, these statistics can be maintained and updated as new updates are produced. This allows fast access to common synthetic information that is typically computed on stream of events. Several of the above aspects are described in more detail further below.

In an embodiment of the data processing apparatus 300, each data container structure 305a,b can comprise at least one data window comprising at least one data element of the subset of the plurality of data elements and/or at least one data sub-container comprising at least two data sub-windows, wherein each data sub-window comprises at least one data element of the subset of the plurality of data elements. This is described in the following in more detail in the context of figure 5, which shows a hierarchical structure of a plurality of data container structures 305a-c according to an embodiment. The data container structures 305a-c shown in figure 5 can hold either other containers (herein referred to as data sub-containers) or a plurality of events or data elements. Each upper level in the hierarchy will correspond thus to a window corresponding to a certain time size granularity. This structure encapsulates time references and meta-information that enables to navigate through the data based on time.

Thus in an embodiment, the metadata 307a,b of each data container structure 305a,b can comprise a pointer to the data element of the subset of the plurality of data elements, which is the first data element in the chronological order of the subset of the plurality of data elements. In an implementation form the metadata of each data container structure can further comprise a data container structure identifier and/or a start time or end time of the data container structure.

Thus, the data container structures 305a-c shown in figure 5 are able to hold any type of partition that is created on the basis of a stream, such as the data stream 301 shown in figure 3. In an embodiment, this metadata about the window, time granularity organization and the required navigation information is collected in the metadata 307a.

In addition to the metadata 307a, information about certain statistics computed based on the windows can also be stored for fast retrieval. Thus, in an embodiment, the processor 303 is further configured to provide each data container structure 305a-c with statistical data 31 1 a associated with the subset of the plurality of data elements of the respective data container structure 305a-c. For the sake of clarity in figure 5 only the statistical data 31 1 a of the data container structure 305 has been indicated.

In an embodiment, the statistical data 31 1 a can comprise at least one of the following: a count, a sum, an average, a median, a variance, a standard deviation and/or a coefficient of variation of the plurality of data elements. The statistical data 31 1 a can comprise a plurality of pairs, wherein each pair comprises a statistical function and the value of the statistical function applied to the data elements within the data container structure or a window thereof.

In an embodiment, the metadata 307a of the data container structure 305a further comprises a pointer to the statistical data 31 1 a associated with the subset of the plurality of data elements of the data container structure 305a. At each level in the structure hierarchy, manipulation, extensions and readjustments are possible. Some of these operations are supported simply by readjusting the metadata information without actual access to the data itself. Moreover, being able to manipulate at any granularity the respective data container structures as independent objects, makes it possible to re-use the data container structure when updates are required, such as recalibration of windows, insertion/removal of events, data elements or windows and the like. Moreover, having a mechanism to loosely link windows, makes it possible to perform windows-based operations such as composition or decomposition of windows to obtain other time granularities.

Thus, in an embodiment, the processor 303 is further configured to adjust each data container structure 305a-c by adding data elements of the data stream 301 to the subset of data elements of the data container structure 305a-c or by removing data elements from the subset of data elements of the data container structure 305a-c.

Figure 6 shows the metadata 307a of a data container structure 305a according to an embodiment. As already described above, a main purpose of the metadata 307a is to hold information about the time partitioning of the stream of events, i.e. about the delimitation of each window itself. Additionally the metadata 307a can contain references to the containers that hold the corresponding data in a window and to the statistical data associated therewith. To hold such information each element in the metadata 307a can be configured as a record (i.e., an n-tuplet) with each field in the record holding a specific type of this information: window ID, start time, end time, pointer to head of container, pointer to end of container, pointer to statistic data, and reference id to the encapsulating window. The organization of the records when implemented can take multiple forms, such as a hashtable or a tree.

Figure 6 shows, moreover, the exemplary statistical data 31 1 a associated with the data elements of the data container structure 305a. If statistics are to be computed, than for each window (or enclosed window) a list of pairs of function type and value can be kept as part of the statistical data 31 1 a. When new statistical functions are applied on the windows of a data stream, their type and the resulting value can be recorded in the statistical data 31 1 a. Examples of such statistics functions are: count, sum, average, median, variance, standard deviation, coefficient of variation, and the like. Each window (across the different layers of granularity used) can have its own such list of pairs. The collection of all these lists of pairs can be organized, as in the case of the metadata 307a in different forms, ranging from hashtables to trees or lists. The key point is that each item in the statistical data 31 1 a (i.e., the list of pairs) must allow being referenced, such that a pointer to it can be maintained in the metadata 307a. When updates will be applied on the data container structures (e.g., recalibration, new items, deletion ...) two options are available for the statistical data 31 1 a: lazy updates or active updates. Which of this is to be used, can be set when the structure is created and can be updated during its lifetime. Lazy evaluation will imply that when any operation is applied on the data container structures (or on a data container structure) a dirty flag will be set to mark that the corresponding statistical data 31 1 a are not updated. When one of the functions recorded in the statistical data 31 1 a will be re-computed in the future the updates will be recorded and the dirty flag set to clear. In case of the active updates option, each update that is applied to a data container structure or window will trigger a new computation of all statistical values. In this way, correct statistical information will be available immediately in the future about the windows in the data stream.

The data container structure is responsible for holding the events in a stream organized in windows. The data container structure will contain the events that are in between certain boundaries. Typically the delimitation boundaries refer to time. As a data container structure can hold a window, the time boundaries will refer to the time delimitations of the window. As a stream is partitioned in multiple windows, each such window will be recorded into a distinct container, preserving thus the time delimitation that was applied. These containers can be linked to each other based on the same logical order that the input stream has. The time granularity to define a window can be either the one used when the partition was done or is set when operations on the data container structure are applied, such as combination of distinct data container structures or re-adjustment of the windows to new time granularities.

In an embodiment, the data processing apparatus 300 is configured to provide time-based management of data. To this end, the start and end time delimitation of the data container structure are not the only time references. Within a data container structure it is possible to have time references corresponding to finer granularities. This internal granularity is set when the structure is created or when operations that alter the time organization in the structure are applied. In this case a hierarchical structure is obtained, as already explained further above. Containers can thus be internally organized based on sub- containers, which would correspond to time divisions of the time interval of the encapsulating container. At the lowest granularity a sub-container will hold just a data element or event. Figure 7 shows a hierarchy of data container structures according to an embodiment with an exemplary hierarchical depth of 3. However, based on the specific needs of the scenario and the succession of operations applied, larger hierarchy levels can be created using the same principle.

Thus, in an embodiment, a container can be formed either as a collection of other interlinked containers (i.e. sub-containers) or as a collection of events or data elements. The top level data container structures can themselves be inter-linked with each other capturing the partition of the stream. Each data container structure can have a

corresponding record in the metadata 307a and optionally corresponding statistical data 31 1 a. In an embodiment, this is not the case for events or data elements, as this would lead to a large number of metadata records. At any granularity, the data container structures support expansion or compression by simply moving the time boundaries in the metadata 307a and by updating the links of the impacted sub-set of sub-containers or events. From the implementation point of view, the functionality described above for a data container structure can be obtained through a double linked list of fixed-size events or data elements, with a chronological order of the elements. As already described above, a primary object of embodiments of the invention is to provide a mechanism to save the partitioning of a stream in structures with the purpose of increasing the performance. This is achieved by re-using the existing computation. To do so, a number of operations should be enabled on the proposed format to perform time- based operation on the created data container structures. Additionally, performance improvements can be obtained also by enabling to operate on the data container structure directly at the metadata level, such as retrieving pre-computed statistical data. Finally, another benefit brought is the option to manage the data container structure in the form of windows, in an isolated mode or across other similar structures. Each such functionality is exposed through a specific operation. Examples of such operations are shown in figure 8. In the following, some exemplary operations from each category are described in the context of figure 8.

As already described above, embodiments of the invention allow to avoid re-computing the partition of a data stream each time an update is available. Such updates concern insertion/addition, deletion, update, appending of new events or data elements in the stream. These operations can be supported by creating the needed adjustments at the level of the metadata and by performing the event or window movements at the level of the data container structures. For example, at the level of the metadata, the modifications can imply an adjustment of the time boundaries of a data container structure and possibly the reference for the encapsulation container for some of the sub-containers. At the level of a data container structure the operations can involve to move part of the data in the data container structure to a neighboring data container structure. As an optimization, the fine grain time-based access to the events or data elements can be used in order to reduce the amount of data read and then written on the storage backend, such as the memory 309 shown in figure 3, where the data container structure is kept.

Moreover, embodiments of the invention allow for time manipulation of the stored data. To this end, a data container structure can leverage time references for the partitioned data, which can be used for fine grain navigation. However, the windows that partition the data have a static time dimension. The data container structure according to the present invention supports operations to change this time partition or to group it into larger intervals. The operation to change the time partition is generally meant for dividing a data container structure or window into subintervals. This will imply creating sub-containers, in which the events will be grouped as the data elements will be moved into them. Although a similar mechanism is possible for restructuring the data into a new time partition, one will have to appreciate that this is equivalent to creating a totally new data container structure that corresponds to a new partition of the stream. Alternatively, in what concerns the grouping of the windows into larger intervals, this can be done by creating new containers in which the existing ones will be grouped. This will imply also to create the corresponding fields for the metadata. This is an important operation as it allows getting a higher view summary on the stream.

As previously mentioned, this statistical data allow to record mathematical functions and operators that are applied on the stream and record their values. This enables to get fast access to this pre-computed data. Operations that will be enabled at the level of the statistics operations are to read the values, to search for a pre-computed value of a function, to record/update a pair of function type and result and to mark if the result is valid (particularly for the case of lazy evaluation).

A set of operations that is enabled is to apply functions of one data container structure over the other. Such functions include concatenation, merging, differences, comparisons and the like. Depending on the specific function, the containers of the source structures will be copied into a resulting structure (can be one of the source ones) and linked according to the function logic. This will imply also updates on the level of metadata and the statistical data. In order to highlight some of the benefits provided by embodiments of the present invention, an exemplary scenario will be described in the following in the context of figure 8. In this example, the respective averages of bank accounts are computed every 5 minutes for the last 6 months for each 1 hour interval. This is one example of a processing required in fraud prevention analytics (for credit card payments, bank money laundry prevention, etc.). The values computed per each window are used as ground reference values for validating new transactions. Such analyses are common across many business domains, which could thus be optimized by this solution which optimizes fundamental processing operations. Conventionally, the above scenario would be handled in the following way. All events are stored in a table. When the new computation is triggered, the stream processing engine will start to read each event, starting with the most recent one. All 1 hour windows will be recomputed as the events are read. The average in each window is recomputed. The cost of these operations, where N= number of events is: O(N) read from disk operations, O(N) "+" operations and 4392 (6 * 30.5 * 24) 7" operations.

According to embodiments of the invention, the above scenario can be handled in the following way. All events are stored and formatted with respect to the data container structure according to an embodiment with 1 hour window containers with internal time references of 5 minutes. The metadata of each window container is read and for each container a remove and add operation is done for a range of 5 minute timeframe. The cost is: number of read from disks is reduced to 8% (only 1 out of the 12 5 minutes timeframes in each 1 hour window is read) and then write (depending on implementation, if all data is kept in the same file than read and writes can be fully removed and replaced only with updates in metadata table). Number of "+"operation is reduced according to the number of events in the border 5 minutes windows (can be considered 8% of all data).

Embodiments of the invention provide, for instance, for the following advantages. An apparatus and method is provided enabling the notion of time for stored data originating from a data stream. A data container structure is provided for stream partition in data windows that allows performing compute cycle optimization on streams through memorization. A data container structure is provided for efficient handling of discrete updates on streams. A data container structure for a windows based manipulation is provided. A data container structure supporting a hierarchical format is provided for organizing stream events. A data container structure for supporting event functions and statistics is provided.

While a particular feature or aspect of the disclosure may have been disclosed with respect to only one of several implementations or embodiments, such feature or aspect may be combined with one or more other features or aspects of the other implementations or embodiments as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "include", "have", "with", or other variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprise". Also, the terms "exemplary", "for example" and "e.g." are merely meant as an example, rather than the best or optimal. The terms "coupled" and "connected", along with derivatives may have been used. It should be understood that these terms may have been used to indicate that two elements cooperate or interact with each other regardless whether they are in direct physical or electrical contact, or they are not in direct contact with each other. Although specific aspects have been illustrated and described herein, it will be

appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific aspects shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific aspects discussed herein.

Although the elements in the following claims are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence. Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein. While the present invention has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto without departing from the scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.