Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
STREAM PROCESSING DEVICE AND METHOD OF PROCESSING A STREAM OF DATA
Document Type and Number:
WIPO Patent Application WO/2019/161931
Kind Code:
A1
Abstract:
The present invention provides a stream processing device (100) with a processor (110) for processing a stream of data (120) inside a streaming window (130). The processor (110) is configured to generate a higher order representation (140) from the content of the streaming window (130), and the processor (110) is configured to process the higher order representation (140).

Inventors:
BORTOLI STEFANO (DE)
TUDORAN RADU (DE)
AXENIE CRISTIAN (DE)
BRASCHE GOETZ (DE)
LI HAILIN (DE)
Application Number:
PCT/EP2018/054668
Publication Date:
August 29, 2019
Filing Date:
February 26, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
BORTOLI STEFANO (DE)
International Classes:
G06F17/30
Other References:
PARIS CARBONE ET AL: "Apache flink", BULLETIN OF THE IEEE COMPUTER SOCIETY TECHNICAL COMMITTEE ON DATA ENGINEERING, 1 January 2015 (2015-01-01), XP055515353, Retrieved from the Internet
DO LE QUOC ET AL: "Approximate Stream Analytics in Apache Flink and Apache Spark Streaming", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 9 September 2017 (2017-09-09), XP080819754
UNKNOWN ET AL: "Stream Learner : Distributed Incremental Machine Learning on Event Streams: Grand Challenge", DISTRIBUTED AND EVENT-BASED SYSTEMS, 1 January 2017 (2017-01-01), 2 Penn Plaza, Suite 701New YorkNY10121-0701USA, pages 298 - 303, XP055514956, ISBN: 978-1-4503-5065-5, DOI: 10.1145/3093742.3095103
KSHITIJ KUMAR ET AL: "Stream Processing with StreamSQL, Apache Kafka", 1 December 2017 (2017-12-01), XP055518107, Retrieved from the Internet [retrieved on 20181023]
JONAS TRAUB ET AL: "I2: Interactive Real-Time Visualization for Streaming Data", 1 January 2017 (2017-01-01), XP055518162, Retrieved from the Internet DOI: 10.5441/002/edbt.2017.61
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
Claims

1. Stream processing device (100) with a processor (110) for processing a stream of data (120) inside a streaming window (130), wherein

the processor (110) is configured to generate a higher order representation (140) from the content of the streaming window (130), and

the processor (110) is configured to process the higher order representation (140).

2. Stream processing device (100) according to claim 1, wherein

the processor (110) is configured to generate the higher order representation (140) by performing at least one of: Discrete Fourier Transformation; Discrete Wavelet Transformation; Piecewise Aggregate Approximation; Symbolic Aggregate Approximation; Bezier Curve generation; Shape Description Alphabet; Singular Value Decomposition; Autoregressive Moving Average; indexing e.g. iSAX2, ADS; hashing functions; and Segmented Sum of Variation.

3. Stream processing device according to claim 1 or 2, wherein

the processor (110) is configured to process the higher order representation (140) by performing functions of at least one of: filter; map; FlatMap; time-series clustering/matching methods allowing e.g. GroupBy and Union implementation; higher order group reduction allowing e.g. Reduce and Fold Group functions to create groups of first-order entities; correlation/matching functions allowing e.g. Filter and Multi-filter functions; power spectrum transformation e.g. Fourier transformation, square root, cube root, log; seasonal adjustments; trends amplification or removal; smoothing; filtering events to simplify correlations; and join of stream based on time-series matching.

4. Stream processing device according to one of claims 1 to 3, wherein

the processor (110) is configured to generate the higher order representation (140) triggered by a trigger event, wherein buffered data of the streaming window (130) is used to generate a higher order representation (140) of the state of the streaming window (130) at a triggering time.

5. Stream processing device according to one of claims 1 to 4, wherein the processor (110) is configured to return to first-order stream processing by processing metadata and/or context of the generated higher order representation (140).

6. Stream processing device configured to claim 5, wherein

the processor (110) is configured to pass a logic for returning a higher order object as a parameter.

7. Stream processing device according to one of claims 1 to 6, wherein

the processor (110) is configured to provide an interface to retrieve and to query the generated higher order representation (140), and to store the higher order representation (140), context and/or metadata related to the generation of the higher order representation (140).

8. Stream processing device according to claim 7, wherein

the metadata includes a description of the streaming window (130) in terms of boundaries and/or a reference to the processing of the streaming window (130).

9. Stream processing device according to one of claims 1 to 8, wherein

the processor (110) is configured to dynamically update the generated higher order representation (140) according to the state of the streaming window (130).

10. Stream processing device according to one of claims 1 to 9, wherein

the processor (110) is configured to pass the type of the higher order representation (140) during generation of the higher order representation (140) as a parameter.

11. Stream processing device according to one of claims 1 to 10, wherein

the processor (110) is configured to generate and manipulate a higher order representation (140) comprising at least one statistical model suitable to detect changes in distribution, wherein the processor (110) is configured to initiate an update of the statistical model according to the changes in distribution.

12. Stream processing device according to one of claims 1 to 11, wherein the processor (110) is configured to generate a higher order representation (140) in form of a second-order or third-order representation.

13. Method of processing a stream of data (120) inside a streaming window (130), comprising

generating a higher order representation (140) from the content of the streaming window (130), and

processing the higher order representation (140).

Description:
STREAM PROCESSING DEVICE AND METHOD OF PROCESSING A

STREAM OF DATA

TECHNICAL FIELD

The present invention relates to a stream processing device and a method of processing a stream of data.

BACKGROUND

The present invention targets the area of Big Data distributed stream processing. Streams are sequences of events i.e. tuples containing various types of data that are generated by various sources like e.g. sensors, machines or humans in a chronologically ordered fashion. The stream processing paradigm involves applying analytic functions over the events in the stream. A typical approach to stream processing assumes accumulating such events within certain boundaries at a given time and applying analytic functions on the resulting collection. Such transient event collections are termed windows.

Stream processing engines are providing the tools to process events on-the-fly (as they come in the system). In terms of data ingestion techniques, stream processing engines are able to support both data arriving in real-time from the stream source, as well as loading data that was pre-stored in a storage media. The data is typically referred to as events and represents an aggregation of different pieces of data, possibly with different logical meaning. This data is/was generated and received in the system in a certain order. We can consider at least three notions of time (or at least the notion of sequence order in case no timing source is available), associated to each event: arrival-time is the time to which the event is notified to the processing engine (or buffering queue); event- time is the actual time of the generation of the event; processing-time is the time at which the event is actually processed by the system. Typically the processing can be triggered either at regular time intervals (e.g. based on the notion of wall-clock time watermark), or at the arrival of each event. The logic of the processing is typically handled by a specific triggering function. Most functions applied on data stream require at any given moment in time a sub-set of the overall processed events from stream. Namely, the functions are applied over a window, where a window is a delimitation, with respect to time or to the logical sequence of the events, that contains the events within that given boundaries (e.g. 2 hours preceding the current time). The content of these windows varies in time as new events arrive and old events fall out of the boundaries of the window and are removed. Typically a window and the processing function to be applied are assigned to be executed on a machine. However, as the data sizes can rapidly grow, particularly in the case of large window boundaries, there is a high interest in designing an efficient method to perform pattern matching on such windows. Yet, window stream operators require a large amount of resources to iteratively compute pattern matching verification over large windows. Window operator holds all the events in memory and at each triggering moment all elements are (re-)processed iteratively to compute the window functions, as shown in Figure 1. Noticeably, computing pattern matching for large windows can require both keeping large states in memory as well as the iterative re-computation over windows with millions of events which makes it hard to keep up with the (near) real-time requirements. This is a major issue that prevents obtaining a very efficient and general stream processing with very low latencies and computing times for Big Data applications.

The major problem is that existing stream technologies do not provide default solutions for implementing complex event processing and pattern matching with very low latency over event windows. This would imply to enable the description and execution of pattern matching computation over windows of events while still preserving the timing constraints of real-time systems. The description and analysis of patterns are expensive, inefficient and complex to be expressed in terms of rules or user defined functions. However, there exist convenient higher-order representations for describing patterns and executing pattern-matching, but at the current stage they are not available on stream engines, and they do not cope with real-time requirements as they are typically executed on warm or cold data.

Online complex event processing and pattern matching represent a challenge because it may require to iteratively reprocess large windows with possibly millions of events and execute pattern matching as the stream progresses. The computation of pattern matching, when expressed according to the available stream operators e.g. UDF (User-Defined Function) or a rule with a long list of clauses must be recomputed over windows of events that slide, i.e. progress with the stream and might share events between successive instances. Flowever, data intensive applications lead to windows with millions of elements e.g. # of credit card transaction in the past 6 months in a country, thus increasing the cost of computation. This leads to performance degradation as the resource and computation costs grow linearly with the number of elements to be aggregated. Another relevant aspect is the fact that complex patterns with many clauses are very difficult to capture and represent both as rules or processing clauses in a purposely defined function. In fact, current tools support CEP (Complex event processing) pattern matching for rather small number of clauses expressed as regular expression.

The present invention overlaps with several areas of data management and processing. As discussed next, none of these technologies provide or compose an integrated solution to the specific problem solved by this invention. Existing stream engines and related mechanisms focus on using features that require their own window operator with dedicated window functions and keep all events in the window buffer. Processing Function can work for some types of functions, but require the re-computation over the window state for varying functions. This obviously affects the real-time constraints and resource usage when scaling to a larger set of events and high-frequency streams and long/large windows. There is no mechanism, stream operator or solution that enables complex event and pattern matching: 1) relying on second-order data model, 2) enabling online operations on such higher-order data model, 3) operate with very low latencies. The most representative technologies and concepts to this invention are related to stream processing.

Stream engines (e.g. Spark Streaming, Flink, Storm, Samza, and Dataflow) are the main stream technologies discussed here. Stream engines have the role of processing data on- the-fly, i.e. on data in movement. They provide computing capabilities based on the time ordering of the stream. Depending on the specific engine, the time can be further set to refer to event time, processing time, computer time, or arrival time of the events. Most of the stream engines allow some form of grouping the events in windows. Depending on the API of the stream engine, different flexibility levels to define and to drive the computation on the window exist. The main limitation is that window operators work with user defined functions and thus they are not optimized based on function properties. Additionally, all windows keep all the data that falls within the window scope typically in memory even if only part of it would be used by the window function.

Most of the stream engine support CEP in form of SQL (Structured Query Language) query expressions. Spark and Ignite support time-series analytics in batch mode, but do not in the streaming mode. Druid system is designed for fast ingestion and time-series analytics with high-throughput and low latency, however does not deal with streaming processing issues and requires ETL (Extract, Transform, Load) to input data correctly. Druid is used as streaming sink for subsequent query execution, thus is outside the real time pipeline.

Current stream handling approaches treat pattern matching as user defined window functions or as a rather simple state machine, an approach which is plagued by three main limitations. The first limitation is that the target pattern description is left to a custom user implementation or a regular expression describing a sequence of events types. The second limitation stems from the fact that each window function keeps the whole sequence of events in the window buffer (or in a state) that grows with the number of events included in the window. The third limitation is that pattern matching requires to iterate through the whole data at each trigger event. Hence, the definition of the pattern matching can be costly and the computation time can fail to meet the real-time performance requirements, unless the pattern is very simple.

Typically, a set of mles are defined to model the behavior of a user based on different aggregations (sum, count, average, etc.) computed in a specific window of time. However, complex patterns require many complex rules to be defined and validated based on the analyst interpretation over the fraud scheme.

In a default approach, large windows are created to hold all historic data in the case of the global feature computation or sub-domains of the data in the case of features over certain partitions. Each feature will require its own windows. Results will be computed by going through the entire data set and re-computing the pattern matching or rules verification from scratch, leading to a slow result computation or limited verification capacity. Furthermore, small variation in scale or trend over the pattern may result in false negative detection. A first option is to define a User Defined Function capturing the pattern. The function should scan the whole window to verify iteratively, clause by clause the corresponding matching between the current state of the window and the target pattern. The main limitation of such an approach is that it requires large amounts of resources to describe the pattern as well as to execute its verification. Furthermore, it is prone to errors due to high complexity as well as not scalable as the whole logic would be embedded in a unique specialized function. Another option is to rely on CEP API defining a state machine. E.g. elementl=x FOLLOWED BY element2=y FOLLOWED BY element3=z, etc. The main limitation of such method is that it works only for rather simple analytics and cannot be distributed, i.e. is limited to a single node state machine.

SUMMARY

In view of the above-mentioned problems and disadvantages, the present invention aims to improve Big Data distributed stream processing. The present invention has thereby the object to provide a stream processing device and a method of processing a stream of data, which operate with better performance compared to the corresponding solutions known in the art.

The object of the present invention is achieved by the solution provided in the enclosed independent claims. Advantageous implementations of the present invention are further defined in the dependent claims.

The present invention proposes a new window-like processing system that on the one hand provides a mechanism to transform windows of incoming stream of data into second-order representations, and on the other hand provides a mechanism to process such windows’ second-order representations as the stream progresses. More specifically, there are multiple scenarios such as fraud prevention, outlier detection, or monitoring, in which complex pattern matching processes must be executed with very low latencies. Moreover, in such scenarios, the second-order representations can be used to embed statistical properties or mathematical models of the events in a window, and such tools become useful to support online machine learning modules as well as predictive analytics.

According to the present invention a novel system is proposed that enables efficient and effective pattern matching and CEP over streaming windows in order to be able to tackle a wide range of Big Data scenarios. For this a dedicated system is created that transforms the window of events in a second-order data model (e.g. a time-series index) and defines a set of higher-order operators capable of performing operations based on such representations.

A first aspect of the present invention provides a stream processing device with a processor for processing a stream of data inside a streaming window, wherein the processor is configured to generate a higher order representation from the content of the streaming window, and the processor is configured to process the higher order representation.

The present invention is based on two new operator types. First, a transformation-window operator generates a second-order representation for example time-series and implements thereby an elevate method. Different types of representations can be generated according to different elevate logics. Second, higher-order-operators are second or third-order functions implemented as streaming operators taking second-order representations as input and parameters. In principle, all existing first-order operator types will have a corresponding implementation as higher-order-operators.

The present invention has the advantage of overcoming the complexity of iterative approaches applied for pattern matching execution on windows, the de-facto approach in state-of-the-art, by computing pattern matching, and other second-order functions, relying on efficient second-order representations. This allows to: a) improve the computational costs of pattern matching (e.g. log complexity of indexes search vs. polynomial complexity of multiple iterations); b) save resources reducing data dimensionality and; c) boost accuracy by applying statistical methods to reduce noise.

The present invention has further the advantage of performance improvement by reducing complexity of pattern matching from polynomial to logarithmic complexity over the number of events in the window (plus a constant costs for building higher order representation). Complex events processing and pattern matching can be executed on highly dimensional and complex data without having to specify fine grained rules or complex programs representing patterns. The present invention has the further advantage of resource/cost savings using a more efficient representation, with high-dimensional data compression up to 1000 times. Methods can be applied natively on second-order representation to reduce dimensionality and therefore memory usage for window representations during pattern matching.

The present invention has the further advantage of an accuracy boost by improved pattern matching precision reducing noise. Statistical methods normalize or accentuate specific features of the data in the window making it adaptive with respect to changes in the windowed events without the needs of correcting query parameters (e.g. smoothing).

The present invention has also the advantage of functionality enrichment. At least 2x operators compared to the current state of the art streaming engine. The present invention enables a whole new set of real-time analytics process proposing a system leveraging all the tools existing for time-series analysis and processing (e.g. spectral, intervention, explanative and forecasting analysis).

The present invention can be applied to a large number of scenarios that require the execution of pattern matching over large data streams of data. Immediate domains of applicability are IoT (Internet Of Things), finance, fraud detection, risk analysis, as well as futuristic areas such as AI and real-time intelligent decision making.

The primary advantage of the system of the first aspect is to enhance the streaming system capacity of dealing with complex event processing and pattern matching as well as to simplify the programming model for such analytics. In fact, second-order representation can be optimized to improve performances of pattern matching execution over large windows. The system will pay the cost of building the second-order representation, but at each iteration it will be possible to verify the matching towards target patterns by simply executing a transformations and comparisons over such optimized data structures. For example, a second-order representation could be a time- series index like SAX or ADS. Such time-series representation guarantee logarithmic complexity over the number of points for the match execution. Furthermore, they accommodate also the inclusion of geometrical distance between points to accommodate time drifts in similar events. The invention is highly relevant for the current stream processors landscape as it addresses the issue that current stream engines require to apply an iterative approach to perform pattern matching on windows of incoming events. This implies the usage of window operators that possibly buffer large amounts of events that must be iterated each time. On the one hand, the expression and verification of complex pattern matching as iterative algorithm is rather complex in current streaming settings. On the other hand, as data quantity increases in time, the required resource budget increases as well up to a point where it can make computation too costly, with respect to real-time requirements, or even unfeasible (e.g., given the computation resources of the operator executing the processing). Additionally, different scenarios may require different types of pattern matching techniques, with different tolerance in different aspects of the problem. Each of the different scenarios would have to be addressed with a purposely defined pattern matching function, increasing costs of analytics and time to deploy of novel analytics.

The present invention tackles use cases requiring real-time pattern matching and complex events processing (CEP) with high dimensional data. The present invention proposes a novel system to simplify the definition of complex analytics enabling the usage of techniques typical of time-series mining in stream processing. The present invention describes an innovative system based on two novel operator types, leveraging a mechanism for online transformation of time-series (windows) of events into second- order representations e.g. patterns (1), to enable seamless stream processing based on such representations (2).

The system of the first aspect enables a novel paradigm for data stream analysis, providing the foundation for stream processing operating directly on patterns. Examples of such operators are: stream correlation detectors, outlier detectors, match operators, filters, map and grouping functions.

The higher-order representations would enable the application of existing toolkits and methods for pattern analysis in the context of a streaming engine. An example of application is online machine learning, as some time-series representations embed a large number of relevant statistics e.g. autoregressive moving average that are relevant also in the financial and trading market. In an implementation form of the first aspect, the processor is configured to generate the higher order representation by performing at least one of: dimensionality reduction transformation e.g. Discrete Fourier Transformation, Fast Fourier Transformation, Inverse Fast Fourier Transformation, Discrete Wavelet Transformation; Piecewise Aggregate Approximation, Symbolic Aggregate Approximation, Bezier Curve generation, Shape Description Alphabet, Singular Value Decomposition, Autoregressive Moving Average, Martingale Difference Sequence, indexing methods e.g. iSAX2, ADS, hashing functions, and Segmented Sum of Variation.

The higher order representation can be of a second or third order. Even higher orders are also possible. The above transformations or functions are a non-exhaustive list. All suitable functions may be utilized to generate the higher order representation.

In a further implementation form of the first aspect, the processor is configured to process the higher order representation by performing functions of at least one of: filter and multi-filter function based on time-series correlation/matching functions e.g. Euclidean distance and Dynamic time wrapping; map; flat-map; time-series clustering/matching methods allowing e.g. group-by and union function implementation; higher order group reduction allowing e.g. reduce and fold group functions to create groups of first-order entities; power spectrum transformation e.g. Fourier transformation, square root, cube root, log; seasonal adjustments; trends amplification or removal; smoothing; filtering events to simplify correlations; join of stream based on time-series matching; time-series functions to represent time-series model prior, likelihood, support search strategy using probabilistic programs and extrapolation of parametric and parametric models for regression e.g. Automatic Bayesian Covariance Discovery; extract features from time- series to support prediction based on Feature Derivation Window, Forecast point and Forecast Window and Distance; synthesis metrics for time-series to derive descriptive features e.g. rolling mean, rolling max, rolling min, Bollinger band and statistics, rolling entropy for categorical features, rolling majority for categorical features, rolling text statistics.

The above functions or higher-order operators are a non-exhaustive list. All suitable first- order functions may be adapted to process the higher order representation. The output of such a higher-order operator can either be the input of another higher-order operator, or can be an input for a first-order operator relying on a demote method. When the higher order representation or time-series is too dense, it is possible to apply a dimensionality reduction method e.g. time-series smoothing to speed up the pattern matching execution with minimal lost in accuracy.

In a further implementation form of the first aspect, the processor is configured to generate the higher order representation triggered by a trigger event, wherein buffered data of the streaming window is used to generate a higher order representation of the state of the streaming window at a triggering time.

Such a trigger event may be based on a time period, a number of events in a window or the like.

In a further implementation form of the first aspect, the processor is configured to return to first-order stream processing by processing metadata and/or context of the generated higher order representation.

The processor or higher-order operators may implement a demote method that, given an output of function, produces a related output based on the metadata and/or context of the entities for which the higher order representation was generated. By demoting, results, representations and the like can be returned from a higher order to the first order for further processing, evaluation or the like.

In a further implementation form of the first aspect, the processor is configured to pass a logic for returning a higher order object as a parameter.

Providing the logic as a parameter increases flexibility.

In a further implementation form of the first aspect, the processor is configured to provide an interface to retrieve and to query the generated higher order representation, and to store the higher order representation, context and/or metadata related to the generation of the higher order representation.

Providing such a standardized interface eases data transfer and function calls. In a further implementation form of the first aspect, the metadata includes a description of the streaming window in terms of boundaries and/or a reference to the processing of the streaming window.

Metadata eases data handling and reduces communication bandwidth thereby increasing overall performance.

In a further implementation form of the first aspect, the processor is configured to dynamically update the generated higher order representation according to the state of the streaming window.

Such a dynamic update increases flexibility and decreases processing needs as the already generated higher order representation may be adapted with regard to the streaming window.

In a further implementation form of the first aspect, the processor is configured to pass the type of the higher order representation during generation of the higher order representation as a parameter.

Providing the type of the higher order representation as a parameter increases flexibility.

In a further implementation form of the first aspect, the processor is configured to generate and manipulate a higher order representation comprising at least one statistical model suitable to detect changes in distribution, wherein the processor is configured to initiate an update of the statistical model according to the changes in distribution.

Such an implementation allows an easy and quick adaption to changing distribution without the need of a new setup.

In a further implementation form of the first aspect, the processor is configured to generate a higher order representation in form of a second-order or third-order representation. Even higher order representations are possible, but may incur a higher processing effort. A second aspect of the present invention provides a method of processing a stream of data inside a streaming window, comprising generating a higher order representation from the content of the streaming window, and processing the higher order representation.

The same advantages as described above for the system of the first aspect apply also to the method of the second aspect. Further, the method of the second aspect can have implementation forms corresponding to the implementation forms of the first aspect described above. Thus, the method can also achieve the advantages described above for these implementation forms.

It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities.

Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF THE DRAWINGS

The above described aspects and implementation forms of the present invention will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which:

Figure 1 shows an example of a higher order stream processing device according to an embodiment of the present invention.

Figure 2 shows a first example of higher order stream processing. Figure 3 shows a second example of higher order stream processing.

Figure 4 shows a third example of higher order stream processing.

Figure 5 shows a flowchart of higher order stream processing.

Figure 6 shows a flowchart of a method of processing a stream of data according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Figure 1 shows an example of a higher order stream processing device 100 according to an embodiment of the present invention. The stream processing device 100 includes a processor 110 for processing a stream of data 120 inside a streaming window 130. The processor 110 may be implemented in hard and/or software. The processor 110 may be part of the server or a DSP. This stream of data 120 includes events or data items, which pass subsequently through the streaming window 130.

The processor 110 is configured to generate or elevate a higher order representation 140 from the content of the streaming window 130. The higher order could be a second or third order, even higher orders are encompassed. The higher order might be a second order in form of a time- series.

The processor 110 is configured to generate the higher order representation 140 by performing at least one of: dimensionality reduction transformation e.g. Discrete Fourier Transformation, Fast Fourier Transformation, Inverse Fast Fourier Transformation, Discrete Wavelet Transformation; Piecewise Aggregate Approximation, Symbolic Aggregate Approximation, Bezier Curve generation, Shape Description Alphabet, Singular Value Decomposition, Autoregressive Moving Average, Martingale Difference Sequence, indexing methods e.g. iSAX2, ADS, hashing functions, and Segmented Sum of Variation.

The processor 110 is further configured to process the higher order representation 140 for example using one or more higher order operators. The higher order operators can be of the same order as the higher order representations 140. The processor 110 is configured to process the higher order representation 140 by performing functions of at least one of: filter and multi-filter function based on time-series correlation/matching functions e.g. Euclidean distance and Dynamic time wrapping; map; flat-map; time-series clustering/matching methods allowing e.g. group-by and union function implementation; higher order group reduction allowing e.g. reduce and fold group functions to create groups of first-order entities; power spectrum transformation e.g. Fourier transformation, square root, cube root, log; seasonal adjustments; trends amplification or removal; smoothing; filtering events to simplify correlations; join of stream based on time-series matching; time-series functions to represent time-series model prior, likelihood, support search strategy using probabilistic programs and extrapolation of parametric and parametric models for regression e.g. Automatic Bayesian Covariance Discovery; extract features from time-series to support prediction based on Feature Derivation Window, Forecast point and Forecast Window and Distance; synthesis metrics for time-series to derive descriptive features e.g. rolling mean, rolling max, rolling min, Bollinger band and statistics, rolling entropy for categorical features, rolling majority for categorical features, rolling text statistics.

Figure 2 shows a first example of higher order stream processing in which a stream of data 120 inside a streaming window 130 is processed. The stream processing is carried out in a device 100 according to an embodiment of the present invention that builds on the device 100 shown in Figure 1.

First, events are buffered in a streaming window 130 or a streaming window operator. Then, at a trigger event, the buffered events are used to generate a second-order representation 140 of the state of the window 130 at triggering time. The logic of the transformation is executed in a elevate method.

A transformation-window operator 200 generates in an elevating step the second-order representation 140, i.e. time-series, and implements the elevate method. Different types of representations 140 can be generated according to different elevate logics. Second- order representations 140 offer an API to retrieve and query the generated representation and store the time-series representation, context and metadata related to the entity subjects of the elevation process. In reference to the higher-order processing as proposed by this invention, the logical phase or step that transforms a sequence of events in a window in a second-order representation is named ELEVATE. The second-order representation 140 may contain as well some metadata and reference to the generated window, enabling down-stream operators to process the generated data model consistently. Metadata may include, for example, a description of the elevated window in terms of boundaries, and a reference to the entity subject of the windowed processing.

One or more higher-order-operators 210 are second or third-order functions implemented as streaming operators taking second-order representations 140 as input and parameters. In principle, all existing first-order operator types will have a corresponding implementation as second-order operators. The higher-order-operator 210 operates on the second-order representations 140 and delivers an output in the second-order domain.

Additionally or alternatively, the higher-order operator 210 implements a demote method that, given an output of function, will produce a related output 230 based on the metadata, context of the entities for which the second order representation was generated. The output 230 is again in the first order domain.

The higher-order streaming operator type that takes as input second-order representations 140 and performs functions like for example business logic over them. With respect to the single events, these can be third-order operators, however assuming the second-order representation 140 is generated by a non-invertible function, we can implement such operators relying on second-order functions dealing the complex and rich representation. Aiming at extending the current stream engines, the output of a higher-order representation 140 can either be in term of second-order e.g. a higher-order map or filter function, or in terms of the original first-order processing based on the metadata attached to the second-order representations. The logical phase or step in which the higher-order operator output goes back to the first-order stream processing is named DEMOTE.

Once the second-order representation 140 are input to the higher-order streaming operator 210 executing for example some business logic according to the defined semantic e.g. map, GroupBy, reduce, filter, etc. The output of the higher-order operator can either be the input of another higher-order operator, or can be an input for a first-order operator relying on the demote method. Figure 3 shows a second example of higher order stream processing. The stream processing is carried out in a device 100 according to an embodiment of the present invention that builds on the device 100 shown in Figure 1. Figure 3 particularly shows an example a higher-order GroupBy operator used to group entities based on their behaviour in a specific time frame. In a first phase, events are accumulated in windows l30a, l30b based on some key identifier (partitioned streaming). Then at trigger time, each special window operator generates a specific second-order representation l40a, l40b each input to the higher-order GroupBy operator 210. This operator 210 then performs clustering operations based on the patterns defined by the entity behaviour in the considered time window, and outputs groups or clusters 220a, 220b of second-order representations according to their similarity.

Figure 4 shows a third example of higher order stream processing in which a second order filter is employed. The stream processing is carried out in a device 100 according to an embodiment of the present invention that builds on the device 100 shown in Figure 1.

In the following the use of the system for higher-order stream processing is described. The proposed operators will be instantiated for each partition of the data considered and will operate on bounded amount of resources. The second order representation 140 is generated at a trigger event, and the comparison can be executed on such representation, for example by a second-order filter operator 400. The target pattern 410 could be represented according to the chosen second-order representation model (e.g. SAX or ADS) and then compared with respect to some distance metric and filtered based on a threshold parameter. The one or more target patterns 410 may be stored in a storage and retrieved when needed.

At each trigger time, the generated second-order representation 140 would be matched with the target pattern 410 once for all the events contained in the window 130. When the match is satisfied the corresponding metadata/key related to the entity to which the events are associated can be output. Many concurrent windows can generate second-order representations 140 and be filtered in parallel, making the system scalable and distributed. Furthermore, the higher-order streaming system can build and manipulate second-order representations 140 suitable to detect changes in distribution e.g. ARIMA models, that can be used for online machine learning algorithms as well as to trigger the need to update train the deployed models. These types of tools, designed as a stream-based processing component for moving complex features i.e. mathematical functions, statistics, enable online continuous machine learning by optimizing the computation of complementary mathematic functions to feed and fine-tune the lifecycle of the deployed model.

A further use case of outlier detection monitoring a large scale cloud infrastructure for fault detection and root cause analysis is considered. Currently, infrastructure events are analyzed in batch to extract traces (sequences of operations) grouped by fingerprints (a set of characteristics providing context for trace comparison) and then analyzed and compared to individualize traces that could indicate ongoing problems (outlier detection). Outlier traces are then shown to an operator for further analysis. The monitoring does not rely on any distributed framework for data analysis, and any further operation over the traces should be custom implemented.

With higher-order streaming operators according to the present invention, the monitoring process becomes a streaming process where groups of time-series generated by groups of traces could be analyzed and compared for monitoring purposes. Number of false positive warnings to the operator could be reduced by triggering further analysis and comparison with normalized correlated indicators e.g. removing trends. Furthermore, as the monitoring tools can leverage a distributed streaming engine, the heavy lifting of the scalability of the monitoring capabilities would be managed by the data processing framework.

Figure 5 shows a flowchart or workflow diagram of a higher order stream processing. The workflow of the defined system starts with step 500. The workflow may be carried out in a device 100 according to an embodiment of the present invention that builds on the device 100 shown in Figure 1. According to step 510 the continuous stream reads and accumulates data in a window like operator, when the trigger event is reached e.g. emission time is reached or a default trigger is invoked. In step 520 data of the events of the stream are collected.

Next, in step 530 is checked whether the second order representation can be built incrementally. In case the second order representation is an index or a compressed representation of the collected events, an incremental build is possible. Then in step 540 the second order representation is build. In step 550 metadata and a key are added as a reference.

In step 560 is checked, whether a trigger event is present. If no, the flow reverts back to the event data collection of step 520. If yes, the flow continues to step 570. There is checked whether the higher order representation is ready. If yes, the flow continues. If no, the flow reverts back to the second order representation build of step 540.

The above description focuses on the incremental build of second order representations. For a non-incremental build of second order representations, a no is decided at step 530 and the method carries on to step 560 and from there to 570 in case of a trigger event. There is decided that the higher order representation is not ready. Accordingly, it is branched back to step 540 where the second order representation is built non- incrementally. From there, the procedure works as described above.

With the emission of the one or more second order representations in step 580 is the elevate part of the method completed. Now in step 582 is the second order representation collected by or transferred to a higher order stream operator. The one or more higher order stream operators execute second or higher order functions for example of statistical or business nature on the one or more second order representations.

In step 586 is decided whether higher order processing should be continued. If yes, the flow reverts back to the emission of the second order representation build of step 580. If no, the flow continues to step 588. There is decided whether the output is the sink or final. If yes, the flow continues to the sink in step 590 and the end 592 of the procedure. If no, the flow continues to step 594 where the demote method is started. In step 594 the key and the metadata are extracted from the second order representation as they are needed for demoting back to the first order. In step 596 the first order object like a row or tuple is emitted as a result of the demoting method. In step 598 the first order processing is continued or finished. Then, the flow continues to the sink in step 590 and the end 592 of the procedure.

The elevate method is executed and the second-order representation is output to a higher- order operator that executes the defined logic or function. Then, the higher-order operator can be programmed to demote and switch back the processing to a first-order event processing, or continue the high-order processing with further higher-order operators executed in pipeline. The higher-order streaming job typically ends with a sink where the output of the complete pipeline is emitted.

The elevate method of the special window operator generating the time-series representation will take the following formats as an input.

A list of numbers that represent the notification moments when partial results should be output as second-order object, the default is related to the trigger mechanism of the window operator.

A parameter indicating whether a new second-order output stream is generated for each specified timing or only one time-series needs to be created and updated. Update mode must be supported by the target second order representation.

A parameter indicating a specific implementation for the generation of the second-order representation from the time series to allow different types of transformation suitable for different types of analytics e.g. pattern matching, statistics extraction, etc., an incremental enumeration of the transformation types can be defined and implemented relying on existing methods adapted to be executed in the window environment according to the type of operator chosen to be extended with the elevate method.

For each logical window generating a second-order representation, e.g. sliding/tumbling/rolling/session window can be used, both the metadata and actual parameters of the window will be attached to the second-order representation as context. Furthermore, a specific parameter pointing part of original record/row/tuple accumulated in the window target analysis will be attached as well. In principle, this can be of type Tuple, Row or a standard POJO.

There exist efficient time-series indexes representations, e.g. iSAX2+ builds a tree at bit- level to support approximate search, ADS+ is the most novel algorithm for indexing. Currently stream operators work with row or tuples, whereas time-series operator will work on special time-series objects for which we’ll have to define serialization mechanisms to avoid POJOs serialization overhead.

The higher-order operators output will be input to the higher-order operators that will apply the logic accordingly, e.g. second-order map function will transform the second- order representation objects applying appropriate functions (e.g. smoothing), the second- order filter will implement the filter based on a pattern matching method, etc.

Fligher-order operator results are either output directly through the demote method or pass to another operator the second-order representation as in standards data processing streaming topology. As second-order representations will be associated typically to entities e.g. customers, sensors, etc. the grouping key for the window kept with the second-order time-series representation will be typically used to integrate the process with normal first-order stream processing.

Figure 6 shows a flowchart of a method of processing a stream of data according to an embodiment of the present invention. In a first step 600 a stream of data 120 is processed inside a streaming window 130. Such processing may for example include buffering, providing and executing a trigger event. In a second step 610 a higher order representation 140 is generated from the content of the streaming window 130. Such generation may be started by the trigger event. In a third step 620 the higher order representation 140 is processed, for example by a higher order operator.

As a use case example one can consider the case of automated fraud detection, which generically attempts to capture and verify user behaviour according to some pattern i.e. a fraud signature. This scenario is inspired from detecting fraud rules for credit card transactions validation. Fraud detection systems typically employ many concurrent indicators to build up a complex detection system. Typically, a set of mles are defined to model the behaviour of a user based on different aggregations like sum, count, average, etc. computed in a specific window of time. Flowever, complex patterns require many complex rules to be defined and validated based on the analyst interpretation over the fraud scheme. The transformation into a higher order reduces complexity and allows to handle big data sets or streams in real-time.

The present invention will ground the development of a full set of higher-order stream operators, including adaptations of the most relevant second-order functions already available for stream processing e.g. map, reduce, GroupBy as well as other new operators that would not be possible otherwise.

Each of the higher-order representations 140 will be classified according to properties of the embedded functions e.g. allow to update or not, and it will be responsibility of the developer to create pipelines of second-order transformations and corresponding higher- order processing such that the chosen representations are compatible with the processing purposes.

The present invention has been described in conjunction with various embodiments as examples as well as implementations. Flowever, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word“comprising” does not exclude other elements or steps and the indefinite article“a” or“an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.