Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AN EFFICIENT METHOD FOR AGGREGATING AND MONITORING IN LARGE SCALE DISTRIBUTED SYSTEMS
Document Type and Number:
WIPO Patent Application WO/2018/103839
Kind Code:
A1
Abstract:
According to a first aspect of the invention there is provided a monitoring server for monitoring performance metric values by dynamically controlling a resolution model, comprising: a memory storing a code, at least one hardware processor coupled to the memory for executing the code, the code comprising: instructions to aggregate data points representing measured values of a performance metric, the data points sorted to value ranges according to a resolution model, the data points received from a plurality of monitored nodes that received the resolution model from the monitoring server, instructions to calculate a required percentile value range by calculating in which of the value ranges a required percentile is located; and instructions to determine a modified resolution model based on a required resolution of the required percentile value range, and to send the modified resolution model to the plurality of monitored nodes.

Inventors:
HOROVITZ SHAY (DE)
ARIAN YAIR (DE)
WU WENLIANG (DE)
Application Number:
PCT/EP2016/080092
Publication Date:
June 14, 2018
Filing Date:
December 07, 2016
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
HOROVITZ SHAY (DE)
International Classes:
G06F11/30; G06F11/34
Domestic Patent References:
WO2016160008A12016-10-06
Foreign References:
US6901442B12005-05-31
US20100279622A12010-11-04
US20140280880A12014-09-18
US7743136B12010-06-22
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

A monitoring server for monitoring performance metric values by dynamically controlling a resolution model, comprising:

a memory storing a code;

at least one hardware processor coupled to the memory for executing the code, the code comprising:

instructions to aggregate data points representing measured values of a performance metric, the data points sorted to value ranges according to a resolution model, the data points received from a plurality of monitored nodes that received the resolution model from the monitoring server;

instructions to calculate a required percentile value range by calculating in which of the value ranges a required percentile is located; and

instructions to determine a modified resolution model based on a required resolution of the required percentile value range, and to send the modified resolution model to the plurality of monitored nodes.

The monitoring server of claim 1, wherein the code comprising instructions to modify the resolution model for the calculated required percentile value range, and in response to sending the modified resolution model to the monitored nodes, to receive and aggregate data points representing measured values within the required percentile value range, sorted to value subranges according to the modified resolution model.

The monitoring server of claim 1, wherein the code comprising instructions to send the modified resolution model exclusively to monitored nodes with data points within the calculated required percentile value range.

The monitoring server of claim 1, wherein the code comprising instructions to receive in response to sending the modified resolution model exclusively data points within the calculated required percentile value range.

The monitoring server of claim 1, wherein the code comprising instructions to modify the resolution model to include higher resolution in proximity to the required percentile value range, than in other value ranges.

The monitoring server of claim 1, wherein the code comprising instructions to modify the resolution model by calculating a logarithmic resolution around the middle value of the required percentile value range, without changing the number of value ranges.

7. The monitoring server of claim 1, wherein the aggregation comprises receiving from each of the plurality of monitored nodes an indication of a quantity of data points located in each of the value ranges and summing the quantities of data points in each of the value ranges.

8. The monitoring server of claim 1, wherein the code comprising instructions to repeat the aggregating, calculating and determining until the required percentile value range has the required resolution.

9. The monitoring server of claim 1, wherein the required resolution is determined according to a predefined required accuracy for a monitored quantity.

10. The monitoring server of claim 1, wherein the code comprising instructions to repeatedly aggregate the data points, calculate the required percentile value range and determine the modified resolution model in each of successive time intervals, wherein in each time interval the data points are aggregated according to the resolution model determined in the previous time interval.

11. The monitoring server of claim 10, wherein the modified resolution model in each time interval is predefined, determined according to a desired number of repetitions or determined according to a predefined resolution increase in each repetition.

12. The monitoring server of claim 10, wherein the code comprising instructions to calculate a rate of change of the required percentile value range according to the required percentile value ranges calculated in successive time intervals, to predict based on the calculated change rate an estimated required percentile value range of a next interval, and to modify the resolution model to include higher resolution in proximity to the estimated required percentile value range than in other value ranges.

13. The monitoring server of claim 12, wherein the code comprising instructions to predict in each time interval an estimated required percentile value range by an auto-regressive model dependent on the number of previous time intervals.

14. The monitoring server of claim 12, wherein the code comprising instructions to determine lengths of the time intervals according to the rate of change of the required percentile value range, so that the estimated required percentile value range is within a predetermined range from the required percentile value range calculated in the previous interval.

15. The monitoring server of claim 1, wherein an initial resolution model is determined by a user configuration.

16. The monitoring server of claim 1, wherein the code comprising instructions to receive from the monitored nodes the measured values of the performance metric located in a value range in case the number of measured values in the value range is smaller than a predetermined threshold.

17. The monitoring server of claim 1, wherein the resolution model includes different resolutions in different value regions.

18. A method for monitoring performance metric values by dynamically controlling a resolution model, comprising:

aggregating data points representing measured values of a performance metric, the data points sorted to value ranges according to a resolution model, the data points received from a plurality of monitored nodes that received the resolution model;

calculating a required percentile value range by calculating in which of the value ranges a required percentile is located;

determining a modified resolution model based on a required resolution of the required percentile value range; and

sending the modified resolution model to the plurality of monitored nodes.

19. The method of claim 1, further comprising modifying the resolution model for the calculated required percentile value range, and in response to sending the modified resolution model to the monitored nodes, receiving and aggregating data points representing measured values within the required percentile value range, sorted to value sub-ranges according to the modified resolution model.

20. The method of claim 18, further comprising sending the modified resolution model exclusively to monitored nodes with data points within the calculated required percentile value range.

21. The method of claim 18, further comprising receiving in response to sending the modified resolution model exclusively data points within the calculated required percentile value range.

22. The method of claim 18, further comprising modifying the resolution model to include higher resolution in proximity to the required percentile value range, than in other value ranges.

23. The method of claim 18, further comprising modifying the resolution model by calculating a logarithmic resolution around the middle value of the required percentile value range, without changing the number of value ranges.

24. The method of claim 18, further comprising receiving from each of the plurality of monitored nodes an indication of a quantity of data points located in each of the value ranges and summing the quantities of data points in each of the value ranges.

25. The method of claim 18, further comprising repeating the aggregating, calculating and determining until the required percentile value range has the required resolution.

26. The method of claim 18, further comprising determining the required resolution is according to a predefined required accuracy for a monitored quantity.

27. The method of claim 18, further comprising repeatedly aggregating the data points, calculating the required percentile value range and determining the modified resolution model in each of successive time intervals, wherein in each time interval the data points are aggregated according to the resolution model determined in the previous time interval.

28. The method of claim 27, further comprising predefining the modified resolution model in each time interval, determined according to a desired number of repetitions or determined according to a predefined resolution increase in each repetition.

29. The method of claim 27, further comprising calculating a rate of change of the required percentile value range according to the required percentile value ranges calculated in successive time intervals, predicting based on the calculated change rate an estimated required percentile value range of a next interval, and modifying the resolution model to include higher resolution in proximity to the estimated required percentile value range than in other value ranges.

30. The method of claim 29, further comprising predicting in each time interval an estimated required percentile value range by an auto-regressive model dependent on the number of previous time intervals.

31. The method of claim 29, further comprising determining lengths of the time intervals according to the rate of change of the required percentile value range, so that the estimated required percentile value range is within a predetermined range from the required percentile value range calculated in the previous interval.

32. The method of claim 18, wherein an initial resolution model is determined by a user configuration.

33. The method of claim 18, further comprising receiving from the monitored nodes the measured values of the performance metric located in a value range in case the number of measured values in the value range is smaller than a predetermined threshold.

34. The method of claim 18, wherein the resolution model includes different resolutions in different value regions.

Description:
AN EFFICIENT METHOD FOR AGGREGATING AND MONITORING IN LARGE SCALE

DISTRIBUTED SYSTEMS

BACKGROUND

The present invention, in some embodiments thereof, relates to a monitoring server for monitoring the performance of distributed computing systems, and more specifically, but not exclusively, to monitor performance by dynamically modifying the resolution of a distribution model representing the measured performance of distributed computers.

Cloud computing and data center services are popular storage and computing solutions for large organizations. However, monitoring the performance metrics of multiple monitored nodes presents a technical challenge to operators. An application in a Cloud computing service may run on thousands or tens of thousands monitored nodes, for example VMs, Containers, computers, and the like. To monitor system wide performance, each of the associated monitored nodes, referred to herein as monitored entities, may report performance data to a monitoring server, which aggregates the performance data and calculates the system performance. The required performance of the system is often specified in a service level agreement (SLA) between the service provider and the client. Client applications often have dynamic requirements for resources, which may be allocated by the Cloud and/or data center using advanced auto scaling. However, scaling of resources complicates monitoring of performance in real time. Feedback is required from the monitored entities only when they are allocated to a specific client in order to calculate the effect of auto scaling actions on the SLA of that client.

For example, an SLA may specify reporting the 99th percentile of the response time of all monitored entities. In order to determine the 99th percentile of an application executing on a cloud and/or data center, every monitoring entity must transmit all response times to a monitoring server which aggregates the data and calculates the 99 th percentile response time.

Calculating performance metrics often may result in the monitored entities transmitting a quantity of data that overloads the local network, reducing the quality of service to the client and/or reducing the availability of the cloud operator network. The quantity of reported data may cause processing load on the monitoring server which may increase reporting latency. Existing solutions exist for collecting performance data include NewRelic, AppDynamics, Dynatrace, and Sysdig, however these solutions generate volumes of performance data that may reduce the quality of service to a client. Existing solutions for reducing the volume of performance data include the Q-Digest Algorithm. However, the Q-Digest algorithm cannot process dynamic allocation of monitored entities, requires predefining an appropriate compression parameter, and the accuracy is limited according to the compression factor chosen. SUMMARY

It is an objective of the present invention to provide systems and methods for monitoring performance metric values by dynamically controlling a resolution model. The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect of the invention there is provided a monitoring server for monitoring performance metric values by dynamically controlling a resolution model, comprising: a memory storing a code, at least one hardware processor coupled to the memory for executing the code, the code comprising: instructions to aggregate data points representing measured values of a performance metric, the data points sorted to value ranges according to a resolution model, the data points received from a plurality of monitored nodes that received the resolution model from the monitoring server, instructions to calculate a required percentile value range by calculating in which of the value ranges a required percentile is located; and instructions to determine a modified resolution model based on a required resolution of the required percentile value range, and to send the modified resolution model to the plurality of monitored nodes. This first aspect of the invention provides the advantage of reducing the quantity of data required to be transmitted in order for a monitoring server to monitor performance metrics of a plurality of monitored entities. According to a second aspect of the invention there is provided a method for monitoring performance metric values by dynamically controlling a resolution model, comprising: aggregating data points received from a plurality of monitored entities that received the resolution model representing measured values of a performance metric, the data points sorted to value ranges according to a resolution model, calculating in which of the value ranges a required percentile value is located, determining a modified resolution model based on a required resolution of the required percentile value range, and sending the modified resolution model to the plurality of monitored nodes. This second aspect of the invention provides a method for reducing the quantity of data required to be transmitted in order for a monitoring server to monitor performance metrics of a plurality of monitored entities.

With reference to the first and/or second aspects, in a first possible implementation the resolution model is modified for the calculated required percentile value range, and in response to sending the modified resolution model to the monitored nodes, to receive and aggregate data points representing measured values within the required percentile value range, sorted to value sub-ranges according to the modified resolution model. This implementation provides the advantage of enabling the monitoring server to calculate a new distribution model based on the transmitted modified resolution model by having the monitored entities transmit distribution models to the monitoring server that are based on the modified resolution model.

With reference to the first and/or second aspects, or the first possible implementation, in a second possible implementation, the modified resolution model is sent exclusively to monitored nodes with data points within the calculated required percentile value range. This implementation provides the advantage of reducing network traffic by eliminating null responses from monitored nodes.

With reference to the first and/or the second aspects, or the first, the second possible implementation, in a third possible implementation, the modified resolution model is sent, and receiving in response to sending the modified resolution model only data points within the calculated required percentile value range. This implementation provides the advantage of reducing network traffic from monitored nodes by eliminating sending data point that are not within the required percentile value range.

With reference to the first and/or the second aspects, or the first, the second, or the third possible implementation , in a fourth possible implementation, the resolution model is modified to include higher resolution in proximity to the required percentile value range, than in other value ranges. This implementation provides the advantage of increasing the accuracy of the data points, potentially reducing the number of iterations necessary to achieve a required resolution. With reference to the first and/or the second aspects, or the first, the second, the third, or the fourth possible implementation, in a fifth possible implementation, the resolution model is modified by calculating a logarithmic resolution around the middle value of the required percentile value range, without changing the number of value ranges. This implementation provides the advantage of increasing the accuracy of the data points, potentially reducing the number of iterations necessary to achieve a required resolution. With reference to the first and/or the second aspects, or the first, the second, the third, the fourth, or the fifth possible implementation, in a sixth possible implementation, the aggregation comprises receiving from each of the plurality of monitored nodes an indication of a quantity of data points located in each of the value ranges and summing the quantities of data points in each of the value ranges. This implementation provides the advantage of allowing the aggregated data points to represent collectively measured performance metrics of all the monitored nodes.

With reference to the first and/or the second aspects, or the first, the second, the third, the fourth, the fifth, or the sixth possible implementation, in an seventh possible implementation, repeating the aggregating, calculating and determining until the required percentile value range has the required resolution. This implementation provides the advantage of automatically repeating the code instructions until a required resolution is achieved.

With reference to the first and/or the second aspects, or the first, the second, the third, the fourth, the fifth, the sixth, or the seventh possible implementation, in an eighth possible implementation the required resolution is determined according to a predefined required accuracy for a monitored quantity. This implementation provides the advantage of a default required accuracy, allowing the monitoring server to operate independent of user instructions. With reference to the first and/or the second aspects, or the first, the second, the third, the fourth, the fifth, the sixth, the seventh, or the eighth possible implementation, in a ninth possible implementation, in each of successive time intervals aggregate the data points, calculate a required percentile value range, and determine the modified resolution model, wherein in each time interval the data points are aggregated according to a resolution model determined in the previous time interval. This implementation provides the advantage of reducing latency by determining the modified resolution model prior to receiving the data points.

With reference to the first and/or second aspects and/or to the ninth possible implementation, in a tenth possible implementation the modified resolution model in each time interval is predefined, determined according to a desired number of repetitions or determined according to a predefined resolution increase in each repetition. This implementation provides the advantage of allowing a user to control the number of repetitions or the rate of change of resolution model between repetitions.

With reference to the first and/or second aspects and/or to the ninth possible implementation, in an eleventh possible implementation, a rate of change of the required percentile value range is calculated according to the required percentile value ranges calculated in successive time intervals, to predict based on the calculated change rate an estimated required percentile value range of a next interval, and to modify the resolution model to include higher resolution in proximity to the estimated required percentile value range than in other value ranges. This implementation provides the advantage of reducing the number of iterations required to achieve a desired resolution by predicting required percentile value range based on previous iterations.

With reference to the first and/or second aspects and/or to the eleventh possible implementation, in a twelfth possible implementation, predicting in each time interval an estimated required percentile value range by an auto-regressive model dependent on the number of previous time intervals. This implementation provides the advantage of reducing the number of iterations required to achieve a desired resolution by predicting required percentile value range based on an auto-regressive model.

With reference to the first and/or second aspects and/or to the eleventh possible implementation, in a thirteenth possible implementation, determining lengths of the time intervals according to the rate of change of the required percentile value range, so that the estimated required percentile value range is within a predetermined range from the required percentile value range calculated in the previous interval. This implementation provides the advantage of the determining the resolution model appropriate for a current interval by adapting the rate of change of the time intervals to the rate of change of the required percentile value range. With reference to the first and/or the second aspect, or the first, the second, the third, the fourth, the fifth, the sixth, the seventh, the eighth, the ninth, the tenth, the eleventh, the twelfth, or the thirteenth possible implementation, in a fourteenth possible implementation an initial resolution model is determined by a user configuration. This implementation provides the advantage of allowing user control of initial resolution model.

With reference to the first and/or the second aspect, or the first, the second, the third, the fourth, the fifth, the sixth, the seventh, the eighth, the ninth, the tenth, the eleventh, the twelfth, the thirteenth, or the fourteenth possible implementation, in a fifteenth possible implementation, the measured values of the performance metric located in a value range are received from the monitored nodes when the number of measured values in the value range is smaller than a predetermined threshold. This implementation provides the advantage of reducing network traffic in the case where the distribution model is larger than the measured values.

With reference to the first and/or the second aspect, or the first, the second, the third, the fourth, the fifth, the sixth, the seventh, the eighth, the ninth, the tenth, the eleventh, the twelfth, the thirteenth, the fourteenth, or the fifteenth possible implementation, in a sixteenth possible implementation the resolution model includes different resolutions in different value regions. This implementation provides the advantage of allowing greater resolution in value ranges predicted to contain the required percentile.

Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims. Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced. In the drawings:

FIG. 1 is a flowchart schematically representing a Performance Measurement method for monitoring performance of a distributed computing system by dynamically modifying the resolution of a distribution model representing performance measurements, according to some embodiments of the present invention;

FIG. 2A is a schematic illustration of an exemplary Performance Monitoring server for monitoring performance of a distributed computing system by dynamically modifying the resolution of a distribution model representing performance measurements, according to some embodiments of the present invention; FIG. 2B is a schematic illustration of an exemplary monitored entity for reporting measured performance values to a Performance Monitoring Server, according to some embodiments of the present invention; FIG. 2C is a schematic illustration of an exemplary distributed computer network comprising a Monitoring Server 200 and multiple monitored entities 250;

FIG. 3A is a histogram graphical representation of a master distribution model, according to some embodiments of the present invention;

FIG. 3B is a master distribution model of the histogram represented in FIG. 3A, according to some embodiments of the present invention;

FIG. 4A is a histogram graphical representation of a master distribution model received in response to transmitting a modified resolution model, according to some embodiments of the present invention; and

FIG. 4B is a master distribution model of the histogram represented in FIG. 4A, according to some embodiments of the present invention.

DETAILED DESCRIPTION

The present invention, in some embodiments thereof, relates to a monitoring server for monitoring the performance of distributed computing systems, and more specifically, but not exclusively, to monitor measured performance of distributed computers by dynamically modifying the resolution a distribution model representing the measured performance.

Monitoring performance metrics of distributed computers of Cloud computing service and/or data centers is a requirement of many SLAs. A SLA may for example require reporting every minute on metrics of a service, for example response latency to service requests, dropped packets, Quality of Service (QoS), and/or any other metric of performance.

The required monitoring may generate a volume of network traffic in a distributed system that causes increased network latency. For example, thousands or more monitored nodes, referred to herein as monitored entities, may be required to report millions or more values of measured performance per minute. Performance measuring may cause degradation of service, for example by increasing network latency and/or clogging a cloud network with monitoring traffic. Further, a monitoring server must collect and analyze data representing measurements of a performance metric from each of multiple monitored entities, potentially increasing the latency of the monitoring server reporting.

A monitoring server may be a computing device adapted to monitor performance metric values received from monitored entities, and a monitored entity may be a computing device that measures and transmits values of performance metrics to a monitoring server, for example over a computer network. Values of measured performance metrics may be, for example, the number of milliseconds (ms) for a monitored entity to respond to a service request, the percentage of dropped and/or out of order packets received by a monitoring entity, the number of completed requests, and the like.

A solution for reducing the volume of data transmitted by monitored entities to a monitoring server is to transmit in place of the performance measurement values a statistical frequency distribution, referred to herein as a distribution model, which represents the performance measurement values. The distribution model may be graphically represented, for example as a histogram.

A distribution model may be a set of numerals, referred to herein as data points, each representing the frequency or count of performance measurement values that occur within a particular interval or value range. According to implementations of the present invention, the performance measurement may represent the response times to service requests in units of milliseconds of a monitored entity during a time interval. The value ranges may be a scope of measured performance values, for example defined by a set of maximum and minimum values of the measured performance values, such as 300ms to 350ms, and/or 350ms to 400ms, and the like.

The value ranges of a distribution model may be represented by a set of numbers and/or pairs of numbers, for example a list and/or a table, where each number and/or pair of numbers represents the scope of a corresponding value range, as described above. The set of value ranges of a distribution model is referred to herein as a resolution model.

For example, a resolution model with 10 equal value ranges for response times measured by monitored entities may be calculated by code instructions executing on a monitoring server in the following manner: the values of the lowest and highest measured response times may be predetermined, for example to be 800ms and 1,300ms respectively. The scope of all the measured values is the result of subtracting the lowest measured value from the highest measured value, w in this example the scope is 500ms. The scope of each value range is calculated to be the quotient obtained by dividing the scope of all measured values by the number of value ranges, where in this example the quotient of 500ms is divided by 10 yielding value ranges of 50sm. The scope of each of the 10 value ranges is defined in the following manner: the 50ms scope of the lowest value range is determined from the lowest measured response time of 800ms, resulting in a value range of 800ms to 850ms. Each of the remaining 9 value ranges of 50ms is defined with a lowest value equal to the highest value of the proceeding value range. The resolution model as calculated above comprises the set of 10 value ranges of 50ms each with a combined scope of 800ms to 1,300ms.

Continuing the above example, a distribution model representing 1,000 values of response time measured by a monitored entity may be calculated by code instructions executing on the monitoring entity in the following manner. The monitoring entity receives from the monitoring server the resolution model as calculated above. The monitoring entity calculates 10 data points by identifying the number of measured response times whose values fall within each of the 10 value ranges. The resulting data points comprise the distribution model, which may be transmitted to the monitoring server.

In order to represent the performance of a cloud computing system and/or data center comprising multiple monitored entities, a master distribution model may be aggregated from multiple distribution models by code instructions executing on the monitoring server. The aggregation comprises summing the data points of corresponding value ranges of distribution models received from multiple monitored entities.

An important aspect of monitoring performance of monitored entities is achieving a required resolution, for example a SLA may require reporting accuracy within 5ms for a required percentile of response time of all monitored entities. In the above example each value range of the distribution model is 50ms, and therefore the value of the performance measurements is only represented within a 50ms resolution, which would not fulfill the required accuracy set out by the SLA.

In order to achieve a required resolution of performance measurement, code instructions executing on a monitoring server may modify a resolution model, for example according to requirements of an SLA, and transmit the modified resolution model to the monitored entities. By iteratively modifying and transmitting the resolution model to monitored entities, a monitoring server may control the resolution of performance measurements over time.

The present invention, in some embodiments thereof, presents methods and systems for receiving multiple distribution models representing performance measurement values from monitored entities, aggregating the received distribution models into a master distribution model, modifying a resolution model according to requirements to fulfill a required level of resolution, and transmitting the modified resolution model to the monitored entities.

By substituting a distribution model for measured performance values, the present invention reduces the amount of data transmitted from monitored entities to a monitoring server. As shown in the above example, 1,000 measured values of performance may be represented by a distribution model comprising 10 data points, whereby the quantity of transmitted data may be reduced by two orders of magnitude. By transmitting a modified resolution model to monitored entities, the present invention enables the monitoring server to achieve any required resolution for a distribution model, and to dynamically modify the resolution of the data points transmitted by monitored entities.

Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways. The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention. The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.

The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field- programmable gate arrays (FPGA), or programmable logic arrays (PL A) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions. Reference is now made to FIG. 1, a flowchart schematically representing Performance Measurement method 100 for monitoring performance of a distributed computing system by dynamically modifying the resolution of a distribution model representing measured performance of the distributed computers, according to some embodiments of the present invention. A distribution model representing performance measurements may comprise a set of entries, for example a list and/or table of entries. The entries may be a number, a character, and/or any other representation of a real number. Each entry corresponds to a range of values of performance measurements, and a numerical value of each entry represents the number of performance measurements whose measured value falls within the range of that entry. The order of the entries of the distribution model and the resolution model may correspond to the order of the value ranges, for example from the lowest value range to the highest value range.

Performance Measurement method 100 begins when, in response to sending a resolution model, distribution models representing performance measurements is received. For example, a plurality of monitored entities may transmit performance monitoring data to a monitoring server in response to receiving a resolution model from the monitoring server. The monitoring server may be for example a Performance Monitoring Server 200 as described below in FIG. 2A. The monitored entry may be for example a monitored entity 250 as described below in FIG. 2B.

When the resolution of the distribution model is lower than a required resolution, for example a requirement from a client for a specific accuracy of reporting performance values, the monitoring server may modify the resolution model and transmit the modified resolution model, for example to the plurality of monitored entities, as described below.

For example, Performance Measurement method 100 may comprise calculating the 99 th percentile of measured response time of monitored entities with a resolution of 5ms. Multiple distribution models are received and aggregated into a master distribution model. The value range that contains the 99 th percentile of measured response times is identified, and the value range is compared to the 5ms required resolution. When the value range containing the 99 th percentile value is less than or equal to 5ms, Performance Measurement method 100 is complete. When the received distribution model value range is greater than 5ms, Performance Measurement method 100 enters a second iteration by generating and transmitting a modified resolution model with value ranges of 5ms or less.

Reference is now made to FIG. 2A, a schematic illustration of an exemplary Performance Monitoring server 200 for monitoring performance of distributed computing system by dynamically modifying the resolution of a distribution model representing performance measurements, according to some embodiments of the present invention. A Performance Monitoring server 200 comprises an input/output (I/O) interface 202, a hardware (HW) processor(s) 204, and storage 208.

Performance Monitoring Server 200 is adapted to receive distribution models from monitored entities 250, as described below in FIG. 2B, and to transmit to monitored entities 250 modified resolution models, for example by executing code in Communications Module 213 to instruct I/O 202 to receive and transmit.

I/O 202, HW processor(s) 204, and storage 208 may comprise for example a server, a desktop computer, an embedded computing system, an industrial computer, a ruggedized computer, a laptop, a cloud computer, a private cloud, a public cloud, a hybrid cloud, and/or any other type of computing system. Optionally, Performance Monitor server 200 comprises a virtual machine (VM) in place of I/O 202, HW processor(s) 204, and storage 208. Performance Monitoring method 100 may be executed by HW Processor(s) 204 executing code from one or more software modules in storage 208, for example Aggregator module 210, Percentile Calculator module 211, Modify Resolution module 212, and Communications module 213. Wherein a software module refers to a plurality of program instructions stored in a non-transitory medium such as the storage 208 and executed by a processor such as the processor(s) 204.

Storage 208 may include one or more non-transitory persistent storage devices, for example, a hard drive, a Flash array and the like. The storage 208 may further comprise one or more network storage devices, for example, a storage server, a network accessible storage (NAS) a network drive, and/or the like.

I/O 202 may include one or more interfaces, for example, a network interface, a memory interface and/or a storage interface for connecting to the respective resource(s), i.e. network resources, memory resources and/or storage resources. Optionally, I/O 202 may comprise one or more input interfaces, for example a keyboard, a soft keyboard, a touch screen, a graphical user interface (GUI), a voice to text system, and/or any other data input interface. I/O 202 may comprise one or more output interfaces, for example a screen, a touch screen, video display, and/or any other visual display device. Optionally, I/O 202 may comprise a network interface card (NIC), a wireless router, and/or any other type of network interface adapted to communicating with network 230.

Network 230 may be any type of data network, for example, a local area network (LAN), a fiber optic network, an Ethernet LAN, a fiber optic LAN, a digital subscriber line (DSL), a wireless LAN (WLAN), a wide area network (WAN), a broadband connection, an Internet connection using an Internet Service Provider (ISP) and/or any other type of computer network. Network 230 may employ any type of data networking protocols, including transport control protocol and/or internet protocol (TCP/IP), user datagram protocol (UDP), Bluetooth, Bluetooth low energy (BLE), 802.11 compliant wireless local area network (WLAN), and/or any other wired or wireless LAN or WAN protocol.

Reference is now made to FIG. 2B, a schematic illustration of an exemplary monitored entity 250 for reporting measured performance values to a Performance Monitoring Server 200, according to some embodiments of the present invention.

Optionally, monitored entity 250 comprises an input/output (I/O) interface 202, a hardware (HW) processor(s) 204, and a storage 208 comprising software code instructions, for example Performance Measurement module 260, Distribution Model module 261, Communications module 262.

Optionally, the software code instructions stored in storage 208 that when executed on processor(s) 204 instruct receiving a resolution model from Performance Monitoring Server 200, collecting data representing monitored entity 250 performance measurements, calculating a distribution model representative of the collected data according to the resolution model, transmitting the distribution model to Performance Monitoring Server 200, receiving a modified resolution model from Performance Monitoring Server 200, recalculating a distribution model according to the modified resolution model, and transmitting the recalculated distribution model to the Performance Monitoring server 200.

Reference is now made to FIG. 2C, a schematic illustration of an exemplary distributed computer network comprising a Monitoring Server 200 and multiple monitored entities 250, according to some embodiments of the current invention. As shown in FIG. 2C, multiple monitored entities 250 are connected via network 230 to a Performance Monitoring server 200.

Reference is now made again to FIG. 1. As shown in 101, Performance Monitoring method 100 begins when distribution models are received according to transmitted resolution model, for example by code instructions from Communications Module 213 executing on processor(s) 204 instructing I/O 202.

Optionally, the resolution models are transmitted to a plurality of monitored entities 250, and the distribution models are received from monitored entities 250. For example, a list of monitored entities may be received as input to Aggregator Module 210 from user input and/or from a network resource via I/O 202, and/or from storage 208. The list may be updated as resources are allocated and/or deallocated to a specific service whose performance is being monitored, for example when the Cloud and/or data center employ advanced auto scaling to allocate/deallocate resources.

Each received distribution model may be for example digital data contained in a computer file and/or any other type of computer message received from network 230 to I/O 202.

Each distribution model may be representative of performance values measured by the Monitored entity 250, for example response time measured in ms to a service request. As shown in 102, the received distribution models are aggregated, for example by code instructions from Aggregation Model 210 executing on processor(s) 204. Optionally, as described above, corresponding value ranges from all received distribution models are summed to calculate a master distribution model. For example, when a resolution model comprises three value ranges and three distribution models are received, where all three entries in each distribution model are the number 12, the master distribution model would comprise three entries, each with the value of 36.

As shown in 103, a required percentile value range is calculated, for example by code instructions in Percentile Calculator Module 211 executing on processor(s) 204.

The required percentile value range comprises a value range in which a required percentile is found. The required percentile is a percentile of the values of the master distribution model that is required to be calculated. Optionally, a required percentile may be received as input to Percentile Calculator Module 211 from user input and/or a network resource via I/O 202, and/or from storage 208.

For example, a performance measurement requirement of a SLA may be to calculate the 99 th percentile of the response times to service requests by a plurality of monitored entities 250. In this case, 99% is the required percentile. The required percentile value range is the value range within the master distribution model that contains the smallest value which is greater than 99% of all the values in the master distribution model.

As shown in 104, the required percentile value range is compared to a required resolution, for example by code instructions in Modify Resolution Module 212 executing on processor(s) 204. The required resolution comprises a received requirement. Optionally, a required resolution may be received as input to Percentile Calculator Module 211 from user input and/or from a network resource via I/O 202, and/or from storage 208. For example, a performance measurement requirement of a SLA may be to calculate within 5ms the 99 th percentile of the response times to service requests by a plurality of monitored entities 250. In this case, 5ms is the required resolution.

As shown in 106, when the required percentile value range is less than the required resolution, Performance Monitoring method 100 is completed.

As shown in 105, when the required percentile value range is greater than the required resolution, a modified resolution model is calculated, for example by code instructions in Modify Resolution Module 212 executing on processor(s) 204.

Optionally, a modified resolution model is calculated with value ranges equal to the required resolution. Performance Measurement method 100 then continues with step 101 as described above, where the modified resolution model is transmitted. Optionally, Performance Measurement method 100 continues from 105 to 100, thereby initiating a new iteration Performance Measurement method 100. Performance Measurement method 100 may iterate repeatedly until the required resolution model value range is fulfilled. For example if the required resolution value range changes during Performance Measurement method 100, as shown in 105, the resolution model may need to be modified repeatedly during a series of iterations.

Optionally, Performance Measurement method 100 may be repeated, as described above in 105, for the performance measurements collected by monitored entities 250 during a single time interval. For example, Performance Measurement method 100 may be repeated for performance measurements collected by the monitored entities between 12:00 and 12:01 of a specific date until the required percentile value range has the required resolution.

Optionally, the transmitted resolution model of a given time interval is calculated in the previous time interval.

Optionally, Performance Measurement method 100 receives distribution models representing performance measurements collected during a time interval, for example by code instructions in Communications Module 213 instructing monitored entities 250 to collect performance measurements during a time interval.

Optionally, the resolution model is predetermined for a time interval. For example a required resolution model may be received as input to Percentile Calculator Module 211 from user input via I/O 202 from a network resource, and/or from storage 208.

Optionally, Performance Measurement Method 100 is iterated at constant time intervals, for example every 2 minutes, for example by code instructions from Aggregator module 210 executing on processor(s) 204 instructing to send a resolution model. The time interval, and the number of times to repeat Performance Measurement method 100, may be received to storage 208 from user input and/or from a network resource via I/O 202.

Optionally, when Performance Measurement Method 100 is repeated at constant time intervals as described above, the modified resolution model for each time interval is predefined, for example received from user input and/or from a network resource via I/O 202, and used as input to code instructions for Modify Resolution module 212 executing on processor(s) 204. The predefined resolution models may be according to a desired number of iterations. For example, when four iterations are required, each of four resolutions model may be predefined.

The predefined resolution model, as described above, may be according to a desired rate of change of the resolution model. For example, each resolution model may have 20% higher resolution than the resolution model of the proceeding time interval. Optionally, the required percentile value range may be predictively calculated according to a rate of change of the required percentile value range in previous iterations of Performance Measurement method 100, for example by code instructions from Percentile Calculator module 211 executing on processor(s) 204. For example, when the required percentile value range of three immediately previous iterations have the range 235ms-240ms, 240ms-245ms and 245ms-250ms, the rate of change is calculated to be 5ms per time interval, and the resolution model of the current interval may comprise a required percentile value range of 250ms-255ms.

Optionally, the required percentile value range and/or the size of value ranges may be predictively calculated according to an autoregressive model based on previous required percentile value ranges, for example by code instructions from Percentile Calculator module 211 executing on processor(s) 204. Optionally, the required percentile value range and/or the size of value ranges may be predictively calculated according to machine learning based on previous required percentile value ranges, for example by code instructions from Percentile Calculator module 211 executing on processor(s) 204. Optionally, the time interval may be calculated according to a rate of change of the required percentile value range in previous iterations of Performance Measurement method 100, for example by code instructions from Percentile Calculator module 211 executing on processor(s) 204. For example, in order for a master distribution model to have a required percentile value range within a predetermined range, the time interval is calculated according to the rate of change of the required percentile value range such that the required percentile value range will occur within a predetermined range during the calculated time interval.

Optionally, the value ranges of the resolution model and/or modified resolution model are not uniform in size, for example by code instructions in Modify Resolution Module 212 executing on processor(s) 204. For example, the value ranges in proximity to required percentile value range may be smaller than value ranges not in proximity to the required percentile value range. In another example, the size of value ranges is calculated logarithmically, where the size of the value ranges increase logarithmically as they are more distant from the required percentile value range.

Optionally, the modified resolution model is only transmitted in response to received distribution models that represent at least one measured performance value within the required percentile value range, for example by code instructions in Modify Resolution Module 212 executing on processor(s) 204.

Optionally, the modified resolution model comprises only value ranges within the required percentile value range.

Optionally, when the number of measured performance values that the monitored entity needs to report is small, then the measured values may be transmitted instead of a distribution model, for example by code instructions in Distribution Model Module 261 executing on processor(s) 204.

Performance Monitoring Method 100 is now demonstrated by way of example, where a modified resolution model is calculated with a required resolution of 5ms for the 99 th percentile of a master distribution model representing measured response time of a plurality of monitored entities, as shown below in FIG. 3A, 3B, 4A, and 4B. Reference is now made to FIG. 3A, a histogram graphical representation of a master distribution model comprising 10 value ranges of 50ms each, according to some embodiments of the present invention. Reference is now also made to FIG. 3B, the master distribution model represented in FIG. 3A, according to some embodiments of the present invention. The master distribution model has a resolution of 50ms, and as shown in 301 the sum of the entries 355. The 99th percentile is calculated as by 355*0.99 = 352. As shown in 302, the 99th percentile is in the required percentile value range of 450ms-500ms.

The required resolution is 5ms, so a modified resolution model is calculated and transmitted as described above in 105 with 10 value ranges of 5ms each that all fall within the required percentile value range of 450ms-500ms, as described above in 105. Reference is now made to FIG. 4A, a histogram graphical representation of a master distribution model aggregated from distribution models received in response to transmitting the modified resolution model, according to some embodiments of the present invention. Reference is now also made to FIG. 3B, the master distribution model represented in FIG. 4A, according to some embodiments of the present invention.

As shown in 401, the received distribution model comprises data points representing 5 measured performance values, starting with the 350 th value. The value range containing the 99 th percentile is calculated in the following way. The rank of the 99th percentile has been identified as 352 nd value, and the modified resolution model comprises data points representing 5 measured performance values beginning with the 350 th value. As shown in 402, the data point representing the value range 465- 470ms contains the 352 nd performance value. The value range 465ms-470ms fulfills the requirement to report the 99 th percentile with a resolution of 5ms.

The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

It is expected that during the life of a patent maturing from this application many relevant monitoring servers will be developed and the scope of the term monitoring server is intended to include all such new technologies a priori.

As used herein the term "about" refers to ± 10 %. The terms "comprises", "comprising", "includes", "including", "having" and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of and "consisting essentially of. The phrase "consisting essentially of means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.

As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.

The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.

The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". Any particular embodiment of the invention may include a plurality of "optional" features unless such features conflict.

Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases "ranging/ranges between" a first indicate number and a second indicate number and "ranging/ranges from" a first indicate number "to" a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween. It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.