METHODS AND APPARATUS FOR RUNNING AN ANALYTICS FUNCTION

Title:

METHODS AND APPARATUS FOR RUNNING AN ANALYTICS FUNCTION

Document Type and Number:

WIPO Patent Application WO/2017/028930

Kind Code:

Abstract:

A method in a distributed network for determining an optimal node on which to run an analytics function is disclosed. The method comprises determining a candidate list of potential nodes on which to run the analytics function. The method further comprises, for each node in the candidate list, calculating at least one parameter for the node relating to the cost of running the analytics function and determining the optimal node based on the at least one calculated parameter.

More Like This:

JP2012094193	NETWORK SYSTEM, COMMUNICATION CONTROL METHOD FOR NETWORK SYSTEM AND PROGRAM
WO/2003/075152	AUTOMATIC NETWORK LOAD BALANCING USING SELF-REPLICATING RESOURCES
JP2006501543	Common scheduler web service for distributed network environments

Inventors:

LARSSON TONY (SE)
SEYVET NICOLAS (SE)
MULAS VIELA IGNACIO (SE)

Application Number:

PCT/EP2015/069173

Publication Date:

February 23, 2017

Filing Date:

August 20, 2015

Export Citation:

Click for automatic bibliography generation Help

Assignee:

ERICSSON TELEFON AB L M (PUBL) (SE)

International Classes:

G06F9/50

Domestic Patent References:

WO2015032430A1

2015-03-12

Foreign References:

US20140372611A1

2014-12-18

Other References:

ARSLAN ENGIN ET AL: "Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications", 2014 5TH INTERNATIONAL WORKSHOP ON DATA-INTENSIVE COMPUTING IN THE CLOUDS, IEEE, 21 November 2014 (2014-11-21), pages 17 - 24, XP032726756, DOI: 10.1109/DATACLOUD.2014.10
BALAJI PALANISAMY ET AL: "Purlieus: Locality-aware resource allocation for MapReduce in a cloud", HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2011 INTERNATIONAL CONFERENCE FOR, IEEE, 12 November 2011 (2011-11-12), pages 1 - 11, XP032081479, ISBN: 978-1-4503-0771-0
MANFU MA ET AL: "A Grid-distance Based Scheduling for Grid Resource Management", HIGH-PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION, 2005. PROCEEDINGS. EIGHTH INTERNATIONAL CONFERENCE ON BEIJING, CHINA 30-03 NOV. 2005, PISCATAWAY, NJ, USA,IEEE, 30 November 2005 (2005-11-30), pages 1 - 6, XP010895507, ISBN: 978-0-7695-2486-3, DOI: 10.1109/HPCASIA.2005.4

Attorney, Agent or Firm:

MCPHERSON, Sofie (GB)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1 . A method in a distributed network for determining an optimal node on which to run an analytics function, the method comprising: determining a candidate list of potential nodes on which to run the analytics function (step 102);

for each node in the candidate list, calculating at least one parameter for the node relating to the cost of running the analytics function (step 104); and

determining the optimal node based on the at least one calculated parameter (step 106).

2. A method as claimed in claim 1 , wherein the step of determining a candidate list of potential nodes comprises determining a preliminary list of potential nodes based on the location of one or more data sources used in the analytics function.

3. A method as claimed in claim 2, wherein the candidate list is determined from the preliminary list using graph analysis methods.

4. A method as claimed in claim 3, wherein the graph analysis method comprises shortest path calculations.

5. A method as claimed in claim 4, wherein the analytics function requires data stored on at least a first node and a second node and the shortest path calculations determine the shortest path between the first node and the second node.

6. A method as claimed in claim 5, wherein the candidate list of potential nodes contains nodes along the shortest path between the first node and the second node.

7. A method as claimed in claim 5, wherein the analytics function further requires data stored on a third node and the shortest path calculations further determine the shortest path between the first node and the third node and the shortest path between the second node and the third node.

8. A method as claimed in claim 7, wherein the candidate list of potential nodes contains nodes along the shortest path between the third node and the first node and/or the third node and the second node.

9. A method as claimed in any one of claims 4 to 8, wherein the shortest path calculations are weighted according to one or more negotiated link costs between nodes in the network.

10. A method as claimed in any one of the preceding claims, further comprising calculating a total cost of computing the analytics function on each of the nodes in the candidate list, wherein the total cost is a combination of the values of one or more parameters calculated for the node and/or one or more neighbouring nodes.

1 1 . A method as claimed in claim 10 wherein the total cost is a weighted combination of the values of the one or more parameters, and wherein the step of determining the optimal node is based on the weighted combination.

12. A method as claimed in claim 1 1 , wherein the weighting reflects the frequency that data will need to be accessed on the one or more neighbouring nodes in order to compute the analytics function.

13. A method as claimed in claim 10 wherein the total cost is an average combination of the values of the one or more parameters calculated for the node and/or one or more neighbouring nodes, and wherein the step of determining the optimal node is based on the average combination.

14. A method as claimed in any one of claims 10 to 13, wherein the step of determining an optimal node comprises determining a prioritized list of nodes by ranking each candidate node according to its total cost and selecting an optimal node from the prioritized list.

15. A method as claimed in any one of the preceding claims, wherein the at least one parameter relates to at least one of:

- a node cost;

- a link cost;

- an uplink link cost;

- a downlink link cost;

- a measure of network utilization;

- the availability of computing resources at a node;

- the latency of one or more databases on a node;

- the speed at which data can be written to and/or read from one or more databases on a node; and

- the number of times a particular data source has to be accessed to run the analytics function.

16. A network node (300) in a distributed network, the network node comprising:

an analytics function location module (302) for determining an optimal node in the distributed network on which to run an analytics function, wherein the analytics function location module (302) is configured to: determine a candidate list of potential nodes on which to run the analytics function;

for each node in the candidate list, obtain at least one parameter relating to the cost of running the analytics function; and - determine the optimal node based on the at least one parameter.

17. A network node as claimed in claim 16, wherein the analytics function location module is configured to store and maintain a list of the locations of data sources in the network.

18. A network node as claimed in claim 17, wherein the analytics function location module is configured to receive one or more updates on the location of data sources in the network and update the list of locations of data sources accordingly.

19. A network node as in any one of claims 16, 17 and 18, wherein the analytics function location module is further configured to receive one or more parameters relating to the cost of running the analytics function.

20. A network node as in claim 19, wherein the analytics function location module is configured to receive the one or more parameters through gossip communication.

21 . A network node (400) in a distributed network, the network node comprising a processor (402) and a memory (404), said memory (404) containing instructions executable by said processor (402), whereby said network node (400) is operative to:

- determine a candidate list of potential nodes on which to run the analytics function; - calculate, for each node in the candidate list, at least one parameter for the node relating to the cost of running the analytics function; and

- determine the optimal node based on the at least one

calculated parameter.

22. A network node (500) in a distributed network, the network node comprising:

a first module (502) for determining a candidate list of potential nodes on which to run the analytics function;

a second module (504) for calculating, for each node in the candidate list, at least one parameter for the node relating to the cost of running the analytics function; and

a third module (506) for determining the optimal node based on the at least one calculated parameter.

23. A computer program which, when run on a computer, causes the computer to carry out a method according to any one of claims 1 to 15.

24. A computer program product comprising computer readable storage medium and a computer program according to claim 23 stored on the computer readable storage medium.

Description:

METHODS AND APPARATUS FOR RUNNING AN ANALYTICS FUNCTION

Technical field

This disclosure relates to a method and apparatus for running an analytics function. In particular, it relates to determining an optimal node in a distributed network on which to run an analytics function.

Background

Vast amounts of data is generated daily by people and their devices. One of the challenges of modern computing is how to deal with this data. In particular, when analytics clients (ACs) try to analyse data, the required data sources may be stored in different locations and this poses a logistical problem in distributed networks. Existing solutions to perform analytics in a cloud environment (e.g. a

distributed network) tend to be centrally located in data centres with unlimited resources, using frameworks such as Hadoop™ and Spark™. Alternatively they may be statistically provisioned/deployed in specific nodes. However, such frameworks can increase network overheads as data has to be copied from one or more storage nodes and moved through the network to a central node for processing.

Summary According to a first aspect, there is provided a method in a distributed network for determining an optimal node on which to run an analytics function. In a first step, the method comprises determining a candidate list of potential nodes on which to run the analytics function. The method further comprises, for each node in the candidate list, calculating at least one parameter for the node relating to the cost of running the analytics function, and determining the optimal node based on the at least one calculated parameter. Such a method has an advantage of minimizing the network overhead associated with processing analytics functions, by dynamically choosing which node in the network to run an analytics function on. Thus, instead of executing analytics functions at a designated or central node each time, this method allows the analytics function to be run on an optimal node in the network, where the optimal node can be chosen in order to reduce the network overhead associated with issues such as database latency, bottlenecks, actuation and data location. According to an embodiment, the step of determining a candidate list of potential nodes comprises determining a preliminary list of potential nodes based on the location of one or more data sources used in the analytics function. According to a second embodiment, the candidate list is determined from the preliminary list using graph analysis methods. The graph analysis method may comprise, for example, shortest path calculations.

According to a third embodiment, the analytics function requires data stored on at least a first node and a second node, and wherein the shortest path calculations determine the shortest path between the first node and the second node. In this embodiment, the candidate list of potential nodes contains nodes along the shortest path between the first node and the second node.

According to a further embodiment, the analytics function further requires data stored on a third node, and wherein the shortest path calculations further determine the shortest path between the first node and the third node and the shortest path between the second node and the third node. In this

embodiment, the candidate list of potential nodes also contains nodes along the shortest path between the third node and the first node and/or the third node and the second node.

In some embodiments, the shortest path calculations are weighted according to one or more negotiated link costs between nodes in the network.

In some embodiments, the method further comprises calculating a total cost of computing the analytics function on each of the nodes in the candidate list, wherein the total cost is a combination of the values of one or more

parameters calculated for the node and/or one or more neighbouring nodes.

In some embodiments, the total cost is a weighted combination of the values of the one or more parameters, and the step of determining the optimal node is based on the weighted combination. In such embodiments, the weighting may reflect the frequency that data will need to be accessed on the one or more neighbouring nodes in order to compute the analytics function.

In some embodiments, the total cost is an average combination of the values of the one or more parameters calculated for the node and/or one or more neighbouring nodes, and the step of determining the optimal node is based on the average combination.

In further embodiments, the step of determining an optimal node comprises determining a prioritized list of nodes by ranking each candidate node according to its total cost and selecting an optimal node from the prioritized list.

In any of the preceding embodiments, the at least one parameter can relate to at least one of:

- a node cost;

- a link cost;

- an uplink link cost; - a downlink link cost;

- a measure of network utilization;

- the availability of computing resources at a node;

- the latency of one or more databases on a node;

- the speed at which data can be written to and/or read from one or more databases on a node; and

- the number of times a particular data source has to be accessed to run the analytics function. According to a second aspect, there is provided a network node in a

distributed network. The network node comprises an analytics function location module for determining an optimal node in the distributed network on which to run an analytics function. The analytics function location module is configured to determine a candidate list of potential nodes on which to run the analytics function. The analytics function location module is further configured, for each node in the candidate list, to obtain at least one parameter relating to the cost of running the analytics function and to determine the optimal node based on the at least one parameter. According to an embodiment, the analytics function location module is configured to store and maintain a list of the locations of data sources in the network. The analytics function location module may be configured to receive one or more updates on the location of data sources in the network and update the list of locations of data sources accordingly.

In some embodiments, the analytics function location module is further configured to receive one or more parameters relating to the cost of running the analytics function. In some embodiments, the analytics function location module is configured to receive the one or more parameters through gossip communication. According to a third aspect, there is provided, a network node in a distributed network, the network node comprising a processor and a memory, said memory containing instructions executable by said processor. The network node is operative to determine a candidate list of potential nodes on which to run the analytics function. The network node is further operative to calculate, for each node in the candidate list, at least one parameter for the node relating to the cost of running the analytics function and determine the optimal node based on the at least one calculated parameter.

According to a fourth aspect there is provided a network node in a distributed network, the network node comprising a first module for determining a candidate list of potential nodes on which to run the analytics function; a second module for calculating, for each node in the candidate list, at least one parameter for the node relating to the cost of running the analytics function; and a third module for determining the optimal node based on the at least one calculated parameter.

According to a fifth aspect, there is provided a computer program which, when run on a computer, causes the computer to carry out a method according to any of the embodiments listed above.

According to a sixth embodiment, there is provided a computer program product comprising computer readable storage medium and a computer program according to the fifth aspect, stored on the computer readable storage medium.

Brief Description of the Drawings

Features, objects and advantages of the presently disclosed techniques will become apparent to those skilled in the art by reading the following detailed description where references will be made to the appended figures in which: Figure 1 illustrates an example of a method of determining an optimal node on which to run an analytics function, according to an embodiment; Figure 2 shows an example distributed network;

Figure 3 shows an example network node according to an embodiment;

Figure 4 shows an example network node according to another embodiment; and

Figure 5 shows an example network node according to a further embodiment. Detailed Description

Conventionally, in a distributed network, analytics functions that require data from disparate (and perhaps globally distributed) sources have tended to copy the data required for the calculation from each of the sources to a central node to run the analytics function. As the number of requests for performing analytics functions increases over time, the network overheads in copying data to and from a central node in this way also increases. In time this overhead is projected to reach unsustainable levels. New methods are therefore required to minimize the overhead associated with processing analytics functions. Figure 1 shows a method according to an embodiment, for determining an optimal node on which to run an analytics function. The method comprises determining a candidate list of potential nodes on which to run the analytics function, step 102. The method further comprises, for each node in the candidate list, calculating at least one parameter for the node relating to the cost of running the analytics function, step 104. In step 106, an optimal node is determined based on the at least one calculated parameter. In this way, instead of running an analytics function on a centrally located node, the method provides a mechanism for dynamically choosing which node to run the analytics function on. Executing an analytics function at the

"optimal" place in the network is crucial for performance reasons, for example to reduce the network overhead associated with database latency,

bottlenecks, actuation and data location.

It will therefore be appreciated that some of the aforementioned problems mentioned in the background section may be addressed in a distributed network through the use of a method 100 as described in Figure 1 .

In general, a distributed network consists of a number of nodes connected in various ways with a number of links. Such a distributed network can be illustrated as a graph, as shown in Figure 2. The circle labelled AC in Figure 2 represents an analytics client (AC). The other circles in Figure 2 represent nodes (or computational resources) that can potentially also run analytics clients or analytics functions. Each straight line represents a network link (or connection) between two nodes. Each node is labelled with a letter (a-y), according to the data stored on it, or streamed though it (e.g. the node labelled a, has data source 'a' on it, or streamed through it). Each node and each link is also labelled with a number representing the cost of using the respective nodes and links when running the analytics function. In an Internet of Things (loT) type scenario, such as the one in Figure 2, resources can be limited in some parts of the distributed network and this is reflected in the costs.

The costs of each node and link are dynamically allocated and can be calculated based on utilization of bandwidth, cpu, ram etc. The costs are dynamically updated to reflect the current status of the nodes and the network. The 'cost' in this sense can be a dimensionless number based on a scale agreed between the nodes in the network. Alternatively, the cost can be the actual monetary cost (e.g. in euros or dollars) of using the node, or another dimensional number, such as the CPU utilisation.

The following processes and services continuously run on the network in the background:

Nodes: All nodes run an internal process that, based on resource utilization (cpu, ram, i/o), calculates a cost for the node. This cost indicates how utilized the node is at the moment. A low cost means low utilization and a high cost means high utilization. The cost can be a relative, for example, the utilization can be ranked according to an agreed scale such as e.g. 1 to 10. The node and link costs are dynamically updated at each node. This can be done periodically, or based on threshold values for certain metrics (for example cpu, io, ram, nw etc).

Once calculated, the costs are shared throughout the network and

continuously updated by means of data transfer between neighbouring nodes. This can be done, for example, with a simple heartbeat protocol that periodically sends out a heartbeat with the nodes identifier. All neighbouring nodes are cached in the nodes. Also, there are available networking

implementations that keep track of this, for example gossip communication protocols. These algorithms are used in the point-to-point (P2P) networks, which makes them a perfect fit for distributed scenarios. Gossip communication protocols are a way that computers share information across a network using a form of random "peer selection". At specified time intervals, each node in the network picks another node at random and shares any new data or updates e.g. in this case, any updates on calculated node costs. The fact that the chosen nodes are random means that some nodes will receive the data more than once from two or more different nodes. However, gossip communication protocols are an efficient way of sharing data quickly and efficiently around a distributed network.

Link costs: Nodes that are directly connected and share a link continuously calculate and agree on a cost for that particular link. The link cost is directional such that there are separate link costs for the uplink and downlink. This is done based on the network metrics available such as bandwidth or latency. If the nodes are the vertices of the graph shown in Figure 2, then the link is the value attached to the edge formed between these two vertices. Link costs can be calculated using a Gather Apply Scatter model: o Gather: Receive information about adjacent vertexes. o Apply: Take computation from gather phase, o Scatter: Update data on adjacent edges. With the above mentioned processes running in the background, when an analytics client has an analytics function "f that it wants to execute, using data sources a, b, c, for example f(a, b, c), the method 100 as illustrated in Figure 1 can run in the network as follows. The method 100 may run on a node in the network that is configured to run an Analytics Function Location Service (AFLS). In order to determine an optimal node in the network for running an analytics function, the AFLS has access to one or more data sources, for example, the AFLS may have access to one or more of (i) a list of nodes in the network (ii) the costs of running the analytics function on each node (ii) a list of link costs for transferring data between each pair of nodes (iii) a list of databases and/or the data sources on each node. The AFLS may keep track of the data sources listed above and may update the data sources (for example, what data is stored on each node) based on updates from the network received, for example, through the gossip

communication protocol described above. In some embodiments, the step of determining a candidate list of potential nodes on which to run the analytics function 102, comprises determining a preliminary list of potential nodes based on the location of one or more data sources used in the analytics function. For example, for the analytics function f(a,b,c) the AFLS may generate a preliminary list of nodes based on the location of the data sources a, b and c. For example, the preliminary list of nodes may contain nodes that have a, b or c stored on them, or the candidate list may contain nodes that can accommodate the function "f and have access to a, b and/or c. In the example shown in Figure 2, for clarity, data source 'a' can be obtained from node a, data source 'b' can be obtained from node b and data source 'c' can be obtained from node c. In these examples, the data can be obtained from the respective nodes because, for example, the data is stored on the node, or because the data is being generated and streamed from the node. Therefore, in this example, the preliminary list contains nodes a, b and c.

According to some embodiments, the AFLS will then determine a candidate list from the preliminary list using graph analysis methods. For example, the graph analysis method may comprise shortest path (SP) calculations for all combinations of dependent sources, e.g. calculating all of the shortest paths between the different locations of the data sources a, b and c (a-b, a-c, b-c). In general, if the analytics function requires data stored on a first node, a second node and a third node, then the shortest path calculations determine the shortest path between the first node and the second node, the shortest path between the first node and the third node and the shortest path between the second node and the third node.

In the case that there is more than one node in the network that stores a particular data source (e.g. the data is repeated), then this may also involve calculating the shortest paths for each repeated data source, for example if there were a second source of data a, a' then the shortest path calculations may determine the shortest paths between (a-b, a-c, b-c, a'-b, a'-c').

In some embodiments, the shortest path calculations are calculated by the nodes of the network. For example, shortest path calculations can be initiated by the AFLS by sending ComputeSP multicast messages to the first nodes in each unique combination: i. message to a: get shortest path between a-b and a-c

ii. message to b: get shortest path between b-c

Once a node receives a ComputeSP message, it will return an OK message and store the AFLS return address, and start a shortest path calculation. In an alternative embodiment, the AFLS may calculate the shortest paths itself, based on information available to it about the network and network costs. In this embodiment, all information is sent to one node that collects a global view of the network. The information might be received at the node through a dedicated protocol that sends all updates to the central node, or alternatively, through the gossip protocol described above.

Shortest path computations can be performed using any already existing algorithm, for example such as Dijkstra's algorithm or the A ^* (A-star) search algorithm.

In some embodiments, the shortest path calculations calculate the shortest physical network path between the two nodes, for example, the path(s) that contains the fewest intermediary nodes. In alternative embodiments, the basic shortest path calculations can be modified such that the algorithm is effectively weighted according to the negotiated link costs between the nodes. For example, whilst the path (x->d- >e) in Figure 2, may physically be the same length as the path (x->y->e), the link cost between nodes d and e is higher than the link cost between nodes y and e. These link costs may be used as weightings in the shortest path calculations, such that the shortest path calculation would consider (x->y->e) to be 'shorter' than, or preferential to, the path (x->d->e). The results of the shortest path calculation are collected in a format that shows the entire path, including node and link costs. Referring again to Figure 2, the shortest paths for the combinations (a-b, a-c, b-c) are:

a->x->b (costs for links and nodes are also included)

a->x->y->c (costs for links and nodes are also included) b->x->y->c (costs for links and nodes are also included)

In embodiments where the shortest paths are calculated by the nodes themselves, the results of the shortest path calculations are then sent to the AFLS using a message such as a ResultSP message. The ResultSP message contains a list of all of the nodes along the shortest path between the two nodes (including the end nodes themselves). The AFLS collects all of the nodes along all of the shortest paths and this forms the candidate list of potential nodes on which to run the analytics function. The candidate list 102 may therefore comprise a list of nodes that appear along a shortest path between two nodes that store data needed for the analytics function. The candidate list is sorted so that each node only appears once. According to Figure 2, (a, x, y, b, c) are the unique nodes from the shortest paths between the nodes (a, b, c).

The candidate list above is then sorted, using one or more parameters. This may comprise calculating a total cost of computing the analytics function on each of the nodes in the candidate list, wherein the total cost is a combination of the values of one or more parameters calculated for the node and/or one or more neighbouring nodes.

As an example, the two parameters 'link cost' and 'node cost' can be used to put the candidates in a prioritized order.

First, sort using link costs: For each candidate (a, x, y, b, c) extract link cost to each data source (a, b, c) and weigh them together, by averaging them to compute a 'total cost'. Thus according to the example from figure 2:

• a: (link cost to a + link cost to b + link cost to c)/3 = v1 = (0+3+3)/3=6/3

• b: (link cost to a + link cost to b + link cost to c)/3 = v2 = (3+0+4)/3=7/3

• c: (link cost to a + link cost to b + link cost to c)/3 = v3 = (3+4+0)/3=7/3

• x: (link cost to a + link cost to b + link cost to c)/3 = v4 = (1 +2+2)/3=5/3 (lowest)

• y: (link cost to a + link cost to b + link cost to c)/3 = v5 = (2+3+1 )/3=6/3

The end result can then be used to sort the candidates according their total cost: x (5/3), a (6/3), y (6/3), b (7/3), c (7/3).

The sorted list above ranks the nodes in order of their link costs (the node with the lowest link cost being ranked the highest). The available resources on each node (i.e. the node cost) can then added to the calculations, by weighing the node cost together with the results from the previous step. This calculation can be a simple average operation such as: a: (6/3 + 3)/2 = 15/6

b: (7/3 + 0)12 = 7/6

c: (7/3 + 5)/2 = 22/6

· x: (5/3 + 0)/2 = 5/6

y: (6/3 + 1 )/2 = 9/3 Sorting this list of total costs (in the order of lowest cost to highest cost) gives: x, b, a, y, c. Thus in this example, based on optimising according to link costs and node costs, the optimal node on which to run the analytics function is node x and the second most optimal node is node b.

The analytics client can then execute the analytics function according to the prioritised list of nodes. In other examples, the total cost can also be a weighted combination of the values of the one or more parameters. For example, the link and node costs can be weighted according to how important they are for a particular network or analytics function. The formula to calculate the total cost of running the analytics function on a node a, Cost _Totai,a , would then be:

COSt _Total,a = Weight _node * COStnode, a + Weight| _ink * COStiink, a-

Where weight _n0de and weight| _ink are the weightings to be applied to the node and link costs respectively. This enables a higher cost to be put on nodes that already are overutilized, by making weight _n0de > weight| _ink-

The examples given above take node costs and link costs into consideration when ranking each node and determining an optimal node on which to calculate the analytics function. More generally, other parameters may also be used, for example, parameters relating to utilisation, such as a node cost or link cost or a measure of network utilization. Alternatively, the one or more parameters may relate to latency, for example, the availability of computing resources, the latency of one or more databases on a node or the speed at which data can be written to and/or read from one or more databases on a node. Costs for the links may be directional (i.e. different costs in different directions depending on load) and thus there may be different parameters describing uplink link cost and downlink link cost for each pair of nodes. Furthermore, there may be differences between the costs of reading and writing data as the cost in read might depend on the delay of a stream (network capacity) while the cost in write might depend on the disk type of the machine (SSD disk). The choice of which parameters to use may depend on a number of factors, for example, the parameters may be chosen on a network level, such that the AFLS uses the same parameters and cost calculations to determine the optimal nodes for all analytics functions running on the network. For example, if the nodes in a network are generally over-utilised, the equations and calculations above might be modified so as to reduce node utilisation. For instance, the AFLS may calculate the optimal node for each analytics function based solely on the node cost.

In some examples, the parameters may be determined on a network level (i.e. the same parameters may be used for all analytics functions as above) but the parameters chosen may be changed over time, for example, if the AFLS has been calculating the optimal node for each analytics function based solely on the node cost and, over time, node utilisation drops, it may then be preferable for the AFLS to stop calculating the optimal node based on node cost and to use another parameter instead, for example, link utilisation.

In alternative embodiments still, the parameter(s) may be chosen on a case by case basis for each analytics function. As such, the method may further comprise choosing one or more parameters for the analytics function to use to determine the optimal node on which to run the analytics function. For instance, if a first analytics function requires a node to frequently write to its database, it may be important to choose a node with a low database latency and thus database latency may be used to determine the optimal node.

However, if a second analytics function requires particularly resource intensive calculations, it may be more important to choose a node with sufficient memory, and thus a parameter based on CPU may be used to determine the optimal node. In general, different types of costs may be used, depending on the needs of the applications and agreed service level agreements (SLAs). For example, one application might have requirements on maximum latency and therefore a cost model that captures this aspect is needed. At the same time there might be another application that mainly cares about high availability, or a third application that mainly cares about the cost (in dollars) to run the application. All these requirements can be aggregated into different cost values at each node. It is then a matter of selecting the right cost value when executing the method. In some examples, the chosen parameters relate to potential bottlenecks resulting from the specific tasks that need to be performed in the analytics function. This might include the number of times a particular data source has to be accessed to run the analytics function. The algorithm is also applicable to write access, where the results need to be written in different storage parts. For example, a node might need to frequently update (write) its data and thus the optimal node will reduce the time it takes to write the data to file. As noted above, there may be different costs associated with reading and writing and thus the correct cost needs to be selected depending on whether the analytics function is predominantly reading or writing data.

Frequency of data access can also be incorporated into the algorithm by putting different weights for different data sources (small weight on high- frequency data sources, and higher weight on low-frequency data sources). This can then be used when doing the link cost calculations, in the sense that all link costs to a certain data source get an additional weight. The effect of this is that the analytics location protocol will try to give high-frequency data sources higher priority, and the optimal node to run the analytics function will be closer to those data sources.

Figure 3 shows a network node 300 in a distributed network, according to another embodiment. The network node 300 comprises an analytics function location module 302 for determining an optimal node in the distributed network on which to run an analytics function. The analytics function location module is configured to determine a candidate list of potential nodes on which to run the analytics function; for each node in the candidate list, obtain at least one parameter relating to the cost of running the analytics function; and determine the optimal node based on the at least one parameter.

In one embodiment of the network node, the analytics function location module 302 is configured to store and maintain a list of locations of data sources in the network. The analytics function location module may then be configured to receive one or more updates of the location of data sources in the network, for instance via the gossip communication protocol referred to above. The analytics function location module may then be configured to update its list of locations of data sources according to the received updates. The analytics function location module may also receive one or more parameters, from a second node in the network relating to the cost of running the analytics function on that node.

Figure 4 shows a network node 400 in a distributed network, according to another embodiment. The network node comprises a processor 402 and a memory 404, said memory containing instructions executable by said

processor. The network node is operative to determine a candidate list of potential nodes on which to run the analytics function. The network node is further operative to calculate, for each node in the candidate list, at least one parameter for the node relating to the cost of running the analytics function, and determine the optimal node based on the at least one calculated parameter. The network node may be further operative to carry out any optional steps of method 100.

Figure 5 illustrates functional units in another embodiment of a network node in a distributed network which may execute the method 100 of the present invention, for example according to computer readable instructions received from a computer program. It will be understood that the units illustrated in Figure 5 are software implemented functional units, and may be realised in any appropriate combination of software modules.

The network node comprises a first module 502 for determining a candidate list of potential nodes on which to run the analytics function. The network node further comprises a second module 504 for calculating, for each node in the candidate list, at least one parameter for the node relating to the cost of running the analytics function, and a third module 506 for determining the optimal node based on the at least one calculated parameter.

The network node 500 may comprise a fourth module for storing and maintaining a list of the locations of data sources in the network. The network node 500 may further comprise a fifth module for receiving one or more updates on the location of data sources in the network and updating the list of locations of data sources accordingly.

In some embodiments, the network node 500 may further comprise a sixth module for receiving one or more parameters relating to the cost of running the analytics function. Furthermore, the network node 500 may comprise a seventh module for receiving the one or more parameters through gossip communication.

According to another embodiment, there is provided a computer program which, when run on a computer, causes the computer to carry out a method according to any of the embodiments listed above. According to another embodiment, there is provided a computer program product comprising computer readable storage medium and a computer program according to the above, stored on the computer readable storage medium.

The embodiments described herein have the advantage of minimizing the network overhead associated with processing analytics functions, by

dynamically choosing which node in the network to run an analytics function on. This allows the analytics function to be run on an optimal node in the network, where the optimal node can be chosen in order to reduce the network overhead associated with issues such as database latency,

bottlenecks, actuation, data location and so on.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim, "a" or "an" does not exclude a plurality, and a single processor or other unit may fulfil the functions of several units recited in the claims. Any reference signs in the claims shall not be construed so as to limit their scope.

Previous Patent: SOLID HYDROGEN STORAGE SYSTEM

Next Patent: METHOD FOR GENERATING A SWITCHING SEQUENCE IN AN INDUSTRIAL SYSTEM, AND DEVICE