Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DISTRIBUTED RESOURCE MANAGEMENT IN A DATA PROCESSING SYSTEM
Document Type and Number:
WIPO Patent Application WO/2020/204768
Kind Code:
A1
Abstract:
A system and method for distributing resources between a plurality of nodes (610, 620, 630, 640, 650) in a data processing system (600). In one embodiment, a node(610, 1200) collects first and second resource usage by first and second group of processes, and determines whether the first and second resource usage is underspent with respect to a global share of resources between the first and second group of processes. The node (610, 1200) offers to transfer first underspent resources of the first group of processes to a corresponding group of processes in another node of the plurality of nodes (610, 620, 630, 640, 650), receives at least one bid for the first underspent resources from the corresponding group of processes of the another node of the plurality of nodes (610, 620, 630, 640, 650), and selects a beneficiary node (640) for the first underspent resources based on an assessment.

Inventors:
SKÖLDSTRÖM PONTUS (SE)
SEDAGHAT MINA (SE)
TURULL DANIEL (SE)
Application Number:
PCT/SE2019/050286
Publication Date:
October 08, 2020
Filing Date:
March 29, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
International Classes:
G06F9/50
Foreign References:
US20160044702A12016-02-11
US20120272237A12012-10-25
US20090265712A12009-10-22
Other References:
None
Attorney, Agent or Firm:
SJÖBERG, Mats (SE)
Download PDF:
Claims:
IN THE CLAIMS:

1. A node (610, 1200) in a data processing system (600) including a plurality of nodes (610, 620, 630, 640, 650), comprising

processing circuitry (1210), configured to:

collect first resource usage by a first group of processes and second resource usage by a second group of processes operable in said node (610, 1200);

determine whether said first resource usage and said second resource usage is underspent with respect to a global share of resources between said first group of processes and said second group of processes;

offer to transfer first underspent resources of said first group of processes to a corresponding group of processes in at least one other node of said plurality of nodes (610, 620, 630, 640, 650);

receive at least one bid for said first underspent resources from said corresponding group of processes of said at least one other node of said plurality of nodes (610, 620, 630, 640, 650);

assess said at least one bid for said first underspent resources from said corresponding group of processes of said at least one other node of said plurality of nodes (610, 620, 630, 640, 650);

select a beneficiary node (640) from said at least one other node of said plurality of nodes (610, 620, 630, 640, 650) for said first underspent resources based on said assessment; and

adjust a local share of resources for said first group of processes.

2. The node (610, 1200) as recited in Claim 1 wherein said processing circuitry (1210) is configured to allocate said first underspent resources to said beneficiary node (640).

3. The node (610, 1200) as recited in Claim 1 wherein said processing circuitry (1210) is configured to select said beneficiary node (640) and another beneficiary node (620) from said at least one other node of said plurality of nodes (610, 620, 630, 640, 650) for said first underspent resources based on said assessment.

4. The node (610, 1200) as recited in Claim 3 wherein said processing circuitry (1210) is configured to allocate said first underspent resources to said beneficiary node (640) and said another beneficiary node (620).

5. The node (610, 1200) as recited in Claim 4 wherein said processing circuitry (1210) is configured to allocate said first underspent resources about half to said beneficiary node (640) and another half to said another beneficiary node (620).

6. The node (610, 1200) as recited in Claim 1 wherein said global share of resources is applicable to said plurality of nodes (610, 620, 630, 640, 650).

7. The node (610, 1200) as recited in Claim 1 wherein said processing circuitry (1210) is configured to send an acceptance message to said beneficiary node (640).

8. The node (610, 1200) as recited in Claim 1 wherein said processing circuitry (1210) is configured to rank said at least one bid for said first underspent resources from said corresponding group of processes of said at least one other node of said plurality of nodes (610, 620, 630, 640, 650) prior to selecting said beneficiary node (640).

9. The node (610, 1200) as recited in Claim 8 where said rank depends on a full allocation of said first underspent resources to said at least one other node of said plurality of nodes (610, 620, 630, 640, 650) among said plurality of nodes (610, 620, 630, 640, 650).

10. The node (610, 1200) as recited in Claim 1 wherein said first resource usage and said second resource usage comprises a central processing unit (CPU) and/or a memory usage associated with said first group of resources and said second group of resources, respectively.

11. A method (1000) of operating a node (610, 1200) in a data processing system (600) including a plurality of nodes (610, 620, 630, 640, 650), comprising

collecting (1015) first resource usage by a first group of processes and second resource usage by a second group of processes operable in said node (610);

determining (1020) whether said first resource usage and said second resource usage is underspent with respect to a global share of resources between said first group of processes and said second group of processes; offering (1025) to transfer first underspent resources of said first group of processes to a corresponding group of processes in at least one other node of said plurality of nodes (610, 620, 630, 640, 650);

receiving (1030) at least one bid for said first underspent resources from said corresponding group of processes of said at least one other node of said plurality of nodes (610, 620, 630, 640, 650);

assessing (1035) said at least one bid for said first underspent resources from said corresponding group of processes of said at least one other node of said plurality of nodes (610, 620, 630, 640, 650);

selecting (1040) a beneficiary node (640) from said at least one other node of said plurality of nodes (610, 620, 630, 640, 650) for said first underspent resources based on said assessment; and

adjusting (1060) a local share of resources for said first group of processes.

12. The method (1000) as recited in Claim 11 further comprising allocating (1055) said first underspent resources to said beneficiary node (640).

13. The method (1000) as recited in Claim 11 wherein said selecting (1040) comprises selecting said beneficiary node (640) and another beneficiary node (620) from said at least one other node of said plurality of nodes (610, 620, 630, 640, 650) for said first underspent resources based on said assessment.

14. The method (1000) as recited in Claim 13 further comprising allocating (1055) said first underspent resources to said beneficiary node (640) and said another beneficiary node (620).

15. The method (1000) as recited in Claim 14 wherein said allocating (1055) comprises allocating said first underspent resources about half to said beneficiary node (640) and another half to said another beneficiary node (620).

16. The method (1000) as recited in Claim 11 wherein said global share of resources is applicable to said plurality of nodes (610, 620, 630, 640, 650).

17. The method (1000) as recited in Claim 11 further comprising sending (1045) an acknowledgement message to said beneficiary node (640).

18. The method (1000) as recited in Claim 11 wherein said assessing (1035) comprises ranking said at least one bid for said first underspent resources from said corresponding group of processes of said at least one other node of said plurality of nodes (610, 620, 630, 640, 650) prior to selecting said beneficiary node (640).

19. The method (1000) as recited in Claim 18 where said rank depends on a full allocation of said first underspent resources to said at least one other node of said plurality of nodes (610, 620, 630, 640, 650) among said plurality of nodes (610, 620, 630, 640, 650).

20. The method (1000) as recited in Claim 11 wherein said first resource usage and said second resource usage comprises a central processing unit (CPU) and/or a memory usage associated with said first group of resources and said second group of resources, respectively.

21. A node (640, 1200) in a data processing system (600) including a plurality of nodes (610, 620, 630, 640, 650), comprising

processing circuitry (1210), configured to:

receive an offer for first underspent resources of a corresponding first group of processes from an offering node (610) to a first group of processes in said node (640, 1200);

provide a bid for said first underspent resources to said offering node

(610);

receive an acknowledgement for a transfer of said first underspent resources from said offering node (610);

receive said transfer of said first underspent resources from said offering node (610) in response to said acknowledgement; and

adjust a local share of resources for said first group of processes.

22. The node (640, 1200) as recited in Claim 21 wherein said processing circuitry (1210) is configured to adjust a local share of resources for a second group of processes within said node (640, 1200).

23. The node (640, 1200) as recited in Claim 21 wherein said processing circuitry (1210) is configured to provide said bid for a portion of said first underspent resources to said offering node (610).

24. The node (640, 1200) as recited in Claim 21 wherein said processing circuitry (1210) is configured to send an acknowledgement of receipt of said first underspent resources to said offering node (610).

25. The node (640, 1200) as recited in Claim 21 wherein said processing circuitry (1210) is configured to send an acknowledgement of receipt of said first underspent resources to said offering node (610) to allow said offering node (610) to adjust a share of resources for said corresponding first group of processes.

26. A method (1100) of operating a node (640, 1200) in a data processing system (600) including a plurality of nodes (610, 620, 630, 640, 650), comprising

receiving (1115) an offer for first underspent resources of a corresponding first group of processes from an offering node (610) to a first group of processes in said node (640, 1200);

providing (1120) a bid for said first underspent resources to said offering node

(610);

receiving (1125) an acknowledgement for a transfer of said first underspent resources from said offering node (610);

receiving (1130) said transfer of said first underspent resources from said offering node (610) in response to said acknowledgement; and

adjusting (1140) a local share of resources for said first group of processes.

27. The method (1100) as recited in Claim 26 wherein said adjusting (1140) comprises adjusting a local share of resources for a second group of processes within said node (640, 1200).

28. The method (1100) as recited in Claim 26 wherein said providing (1120) comprises providing said bid for a portion of said first underspent resources to said offering node (610).

29. The method (1100) as recited in Claim 26 further comprising sending

(1135) an acknowledgement of receipt of said first underspent resources to said offering node (610).

30. The method (1100) as recited in Claim 26 further comprising sending

(1135) an acknowledgement of receipt of said first underspent resources to said offering node (610) to allow said offering node (610) to adjust a share of resources for said corresponding first group of processes.

Description:
DISTRIBUTED RESOURCE MANAGEMENT IN A DATA PROCESSING

SYSTEM

TECHNICAL FIELD

The present disclosure is directed, in general, to the field of data processing systems and, more particularly, to a system and method for distributing resources between a plurality of nodes in a data processing system.

BACKGROUND

In Linux, Control Groups (“Cgroups”) are a mechanism to limit, account, and enforce resource usage on a set of processes. Cgroups allow a node in a data processing system to allocate resources such as central processing unit (“CPU”) time, system memory and network bandwidth among user-defined groups of tasks (processes) running on a system. This control facilitates fairness when sharing resources among multiple processes and nodes.

Cgroups, however, generally operate within a single Linux node. If multiple nodes are running in a cluster and it would be advantageous to set global {i.e., cluster wide) resource limits among the same users running respective processes on multiple nodes in a cluster. There is no current mechanism, however, to enforce and maintain resource limits on a global level.

It is highly desirable, therefore, to efficiently set global resource limits for a set of processes in a data processing system. A resource allocation process to set the global resource limits that addresses the aforementioned issues can enhance the efficiency with which resources are allocated in a data processing system including a data center, or the like.

SUMMARY

These and other problems are generally solved or circumvented, and technical advantages are generally achieved, by advantageous embodiments of the present disclosure for a system and method for distributing resources between a plurality of nodes in a data processing system. In one embodiment, a node, and related method, is configured to collect first resource usage by a first group of processes and second resource usage by a second group of processes operable in the node, and determine whether the first resource usage and the second resource usage is underspent with respect to a global share of resources between the first group of processes and the second group of processes. The node is also configured to offer to transfer first underspent resources of the first group of processes to a corresponding group of processes in at least one other node of the plurality of nodes, and receive at least one bid for the first underspent resources from the corresponding group of processes of the at least one other node of the plurality of nodes. The node is also configured to assess the at least one bid for the first underspent resources from the corresponding group of processes of the at least one other node of the plurality of nodes, select a beneficiary node from the at least one other node of the plurality of nodes for the first underspent resources based on the assessment, and adjust a local share of resources for the first group of processes.

In another embodiment, a node, and related method, is configured to receive an offer for first underspent resources of a corresponding first group of processes from an offering node to a first group of processes in said node, provide a bid for the first underspent resources to the offering node, and receive an acknowledgement for a transfer of the first underspent resources from the offering node. The node is also configured to receive the transfer of the first underspent resources from the offering node in response to the acknowledgement, and adjust a local share of resources for the first group of processes.

The foregoing has outlined rather broadly the features and technical advantages of the present disclosure in order that the detailed description of the disclosure that follows may be better understood. Additional features and advantages of the disclosure will be described hereinafter, which form the subject of the claims of the disclosure. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures or processes for carrying out the same purposes of the present disclosure. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the disclosure as set forth in the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present disclosure, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

FIGURE 1 illustrates block diagrams of embodiments of communication nodes formed with and without Cgroups, respectively;

FIGURE 2 illustrates a block diagram of an embodiment of communication nodes demonstrating resource allocation such as CPU capacity between groups in a data processing system;

FIGURE 3 illustrates block diagram of an embodiment of a communication node;

FIGURES 4 and 5 illustrate flowcharts of embodiments of methods of operating a data processing system;

FIGURE 6 illustrates a signaling diagram of an embodiment of a quota transfer of time-shared resources in a data processing system;

FIGURE 7 illustrates a signaling diagram of an embodiment of a quota transfer of space-shared resources in a data processing system;

FIGURE 8 illustrates a block diagram of an embodiment of communication nodes re-adjusting quotas for time-shared resources based on process utilizations to improve global fairness on a cluster level in a data processing system;

FIGURE 9 illustrates a block diagram of an embodiment of communication nodes re-adjusting quotas for space-shared resources in a data processing system;

FIGURES 10 and 11 illustrate flow diagrams of embodiments of a method of operating a data processing system; and

FIGURE 12 illustrates a block diagram of an embodiment of a communication node in a data processing system.

Corresponding numerals and symbols in the different figures generally refer to corresponding parts unless otherwise indicated, and may not be redescribed in the interest of brevity after the first instance. The FIGURES are drawn to illustrate the relevant aspects of exemplary embodiments. DETAILED DESCRIPTION

The making and using of the present exemplary embodiments are discussed in detail below. It should be appreciated, however, that the embodiments provide many applicable inventive concepts that can be embodied in a wide variety of specific contexts. The specific embodiments discussed are merely illustrative of specific ways to make and use the systems, subsystems, and modules associated with a system and method to distribute resources among a plurality of nodes in a data processing system.

A system will be described herein with respect to exemplary embodiments in a specific context, namely, a node that allocates resources in a data processing system. While the principles will be described in the environment of a cloud data processing system, any environment such a local data processing system that may benefit from such a system and method that enables these functionalities is well within the broad scope of the present disclosure. As contemplated herein, resources allocable by the system and method herein include, without limitation, servers, microprocessors and microprocessor cores and memory, bandwidth of communication systems, servers and portions thereof, transmitter power levels and associated receivers, and intervals of time over which such elements are enabled to operate.

In Linux, Cgroups are used to limit access to different resources ( e.g ., CPU, memory, block input/output (“I/O”)). In general, there are two types of resources, namely, time-shared resources and space-shared resources.

Time- shared resources are resources that are time- shared among different tenants and/or processes, and can be intermittently allocated to different tenants/processes. In practice, the assignment of these resources is based on scheduling algorithms that control access to the resource, for instance, a process can be given access to a CPU for a fraction of a time unit. In this case, an initial allocation is changed by adjusting the amount of time assigned to a process or group of processes.

For example, two processes can time share a CPU, where each can intermittently use 500 milliseconds over a time period of one second, so each process gets 50 percent (“%”) access to the CPU resource. In the case of time-shared resources, if the allocated resource is not used by a process (e.g., it only uses one millisecond of 500 milliseconds assigned), the remaining resource cannot be accumulated and claimed in the next time period (which in that case would have to be a second consisting of 499 milliseconds). Unused resources are therefore lost.

Cgroups in Linux can control access to several different time-shared resources, as described in the article entitled“Control Group v2,”at

https://www.kernel.org/doc/Documentation/cgroup-v2.txt, October 2015, which is incorporated herein by reference. For CPU shares of resources, the total amount of CPU time (from all the available CPUs) is split among processes/groups of processes. For block I/O shares, the total amount of block I/O time is split among processes/groups of processes. For example, if a block device ( e.g ., a hard drive) can perform 1000 read operations per period, a 50/50 split allows two processes to perform 500 operations each in the period. For network I/O shares, the total amount of network I/O is split among processes/groups of processes. For example, if a network card can send at 1000 mega bits per second (“Mbit/s”), a 50/50 split allows two processes to send at 500 Mbit/s each. The network I/O sharing can be defined in slightly different manners, for example, either as limits on bits per second, packets per second, requests per second, or for specific processes (e.g., hypertext transfer protocol (“HTTP”) requests per second).

Space-shared resources are resources shared in terms of“space” (physical units) rather than in terms of time. Typically, these resources cannot easily be re-assigned to a different process/groups of processes. For example, once memory is assigned to a process, it may not be readily re-assigned to another process without disrupting the original process that may have used it to store data. These resources are typically assigned in an absolute manner (e.g., one gigabyte of memory rather than 24% of memory).

The space-shared resource types that can be controlled by Cgroups include memory, wherein the total memory space can be shared among multiple processes for the duration of the process lifetime. For CPU cores, multiple CPU cores can be dedicated to a tenant or a process. For devices, it can be determined which hardware device(s) a process may access.

There are two main motivations for using Cgroups to limit and control allocated resources to processes or process groups. One is to facilitate fair usage among groups of processes of a data processing system, and another is to limit resource use to improve quality of service. Typically, when running multiple processes on a single machine, all of the processes get a fair share of system resources, particularly CPU time. It should be noted that fair usage does not necessarily mean equal usage, but it means getting resources with respect to a process’s priority, number of processes, and other characteristics, where it can be achieved using nice values and different scheduling settings, as described in the article entitled“Changing Priority on Linux Processes,” at

https://www.nixtutor.com/linux/changing-priority-on-linux -processes, April 2009, which is incorporated herein by reference. For example, if ten processes are fully operational, they each get a fair share of 1/1 Oth of the total CPU time available. When having multiple groups of processes, ensuring fairness among processes does not guarantee fairness among the groups. This becomes an issue when groups of processes only runs a single process while another groups of processes runs 99 processes. In this case, the first group gets only one percent of CPU time and the other group gets 99% of CPU time.

To facilitate fairness among consumption (rather than fairness among processes), the common practice is to create two Cgroups, one for each groups of processes and assign processes to the groups. Quotas are then assigned to the groups to facilitate the fairness. In that case, a 50/50 share of resources can be established between two groups, regardless of their number of processes. At the time of contention, when the processes are all competing for the CPU resource, the quotas are limits to ensure fairness among multiple groups. However, if one group is not using its assigned CPU shares, the excess CPU time can be used by the other groups if they can consume the resource.

Turning now to FIGURE 1, illustrated are block diagrams of embodiments of communication nodes ( e.g ., Linux nodes) 110, 150 formed with and without Cgroups, respectively. In either case, the Linux nodes 110, 150 are formed with five processes (designated“Process 1, ... , Process 5”). In the case of the Linux node 110 formed without the Cgroups, each process is allocated an equal share 20% of a resource. In the case of the Linux node 150 formed with the Cgroups, the Processes 1 and 2 in a first Cgroup 160 can each be allocated a larger share 25% of the resource, and the Processes 3, 4, and 5 in a second Cgroup 170 can be allocated only a 17% share of the remaining resource. By incorporating the first and second Cgroups 160, 170, fairness between groups can be established rather than fairness just between processes, which can thereby allocate 50% of the resource to each of the two Cgroups. Resources can be limited to facilitate quality of service. When two processes share resources on a single machine, they can compete over the shared resources. The competition may lead to contention and thus impact their performance. In this scenario, an administrator may want to isolate the processes to prevent interference and the negative impacts on the process’s performance. For example, when running a critical and non-critical process on the same machine, there is a risk that a non-critical process would consume a large amount of CPU or memory that is needed by a more critical process, causing the critical process to fail or under-perform. In this case, two groups are created, and critical processes are placed in one group. Limits are then put on the non-critical process, establishing that resources required for the critical processes are available when needed.

To illustrate problems with existing solutions, for global Cgroups, there is a quota for time-shared resources, namely, Cgroups can only operate in the scope of a single Linux kernel. In other words, there is no mechanism to distribute groups and enforce access control on processes running on multiple nodes, each with its own Linux kernel.

Assume that it is advantageous to facilitate a global CPU balance of 80:20 between a first Cgroup and a second Cgroup running over two nodes. Using Cgroups 80%:20% quota limits can be assigned, locally, for each process group on each node. However, this local assignment may not necessarily serve the global 80%:20% target.

Turning now to FIGURE 2, illustrated is a block diagram of an embodiment of communication nodes (or nodes) demonstrating resource allocation such as CPU capacity between groups in a data processing system 200. As shown in FIGURE 2, a first Cgroup (referred to as a Group 1 210) is allocated an 80% global share of the resource and a second Cgroup (referred to as a Group 2 220) is allocated an 20% global share of the resource. Group 1 210A in a first node (designated Node A 230) is also allocated an 80% local share of the resource and Group 2 220A in the Node A 230 is also allocated an 20% local share of the resource, both consistent with the global shares. A first process PI and a second process P2 are using 60%, 20%, respectively, of the Group 1 210A local share of resources for the Node A 230. Also, a third process P3 and a fourth process P4 are using 20%, 0%, respectively, of the Group 2 220A local share of resources for the Node A 230. Group 1 21 OB in a second node (designated Node B 240) is also allocated an 80% local share of the resource and Group 2 220B in the Node B 240 is also allocated an 20% local share of the resource, both consistent with the global shares. A fifth process P5 is using 10% of the Group 1 210B local share of resources for the Node B 240. Also, a sixth process P6 and a seventh process P7 are using 70%, 20%, respectively, of the Group 2 220B local share of resources for the Node B 240.

Thus, the fifth process P5 is not fully using its share, as it is only consuming 10% of the total 80% of its quota. This allows the sixth and seventh processes P6, P7 to use the remainder (70% + 20%) of the resource. As illustrated in FIGURE 2, this distribution of resource shares leads to the global usage ratio of 45%: 55% between Group 1 210 and Group 2 220, which is far from the global target of 80%:20%. The resulting actual use is 45% for Group 1 210 and 55% for Group 2 220. This example shows that local control of resources does not necessarily lead to a desired global ratio as processes within a process group may underspend one node, while other processes of the same process group running on another node may be constrained when as a whole, the group needs more resources.

As introduced herein, global ratios are improved and accommodated by allowing resource quotas for a process group to be transferred and re-adjusted between different nodes in a data processing system. Each node keeps track of used resources and determines whether resources are being over or underspent by local groups. When underspending is detected, the node (an offering node) then offers to transfer its quota from its local process groups to same process groups running on beneficiary nodes. The proposed mechanism keeps track of, controls, and transfers the quota in a distributed and peer-to-peer fashion without the need of centralized control.

The process improves the distribution of resources usage between different applications running on multiple nodes in a cluster given a global target, and allows process groups to make more efficient use of their assigned global quotas. It also extends Cgroups across multiple machines in a cluster. Re-balancing of group quotas between multiple nodes in a cluster allows a more accurate global enforcement of a target global quota.

Turning now to FIGURE 3, illustrated is block diagram of an embodiment of a communication node (or node) 310. The node 310 includes one or more Cgroups (one of which is designated 320) that contain one or more processes (designated PI, .. P3) that consume node resources ( e.g ., CPU, memory, and/or network resources).

A usage monitor 330 collects resource usage for processes running within each Cgroup 320, and notifies a controller 340 when the usage exceeds its fair share according to predefined policies defined by a system administrator. A communication module 350 is responsible for sending and receiving offers to and from other nodes, as described hereinbelow.

The controller 340 enforces the resource limits based on adjusted quotas. The controller 340 may need to preempt resources (probably from overspent tenants) to schedule and allocate resources based on the new quota assignment. To have a distributed Cgroup, a data processing system usually contains at least two nodes.

The resource balancing mechanism (for flexible resources such as inter-tenant, inter-host) runs on each node and keeps track of used resources, and determines whether resources are being over and underspent by local groups. In case of underspending, a process group can offer to transfer its quota to its affiliated process groups running on beneficiary nodes. The communication is done through the communication module.

As used herein,“overspending” refers to a group using more resources than a local quota and“underspending” refers to a group using less resources than a local quota. A“beneficiary group” refers to a group in a“beneficiary node” that would benefit from receiving additional quota from an offering node.

Turning now to FIGURES 4 and 5, illustrated are flowcharts of embodiments of methods of operating a data processing system. FIGURE 4 illustrates a method 400 operable in an offering node of time-shared resources, and FIGURE 5 illustrates a method 500 operable in a beneficiary node for the time-shared resources. The following methods periodically or aperiodically run on each node.

Beginning with the method 400 of FIGURE 4, the offering node begins a monitoring cycle at a step or module 410. At a step or module 415, the offering node monitors local usage per process group and assesses usage with respect to a local target ratio (e.g., configured group ratios).

At a decisional step or module 420, the offering node determines if a process group is underspent with respect to the local target ratio therefor. If a process group is not underspent, the offering node determines if all process groups have been checked at a decisional step or module 425. If all process groups have not been checked, the offering node checks the next process group at a step or module 470 and monitors and assesses the local usage per the next process group at the step or module 415. If all process groups have been checked (per decisional step or module 425), the offering node waits until the next monitoring cycle at a step or module 480 and starts the monitoring cycle over again at the step or module 410.

Returning to the decisional step or module 420, if a process group is underspent, then the offering node offers the group’s unused quota (underspent resources) to the same group running on other nodes (beneficiary nodes) at a step or module 430. At a step or module 435, a smoothing mechanism can be applied by the offering node to avoid making radical changes by offering only a fraction of the unused quota per period. If, for instance, there is a 70% imbalance, then a maximum 10% of the quota could be offered per period.

At a decisional step or module 440, the offering node determines if a quota is lower than a minimum quota to avoid complete starvation of a process group. If the quota is already lower than a minimum, then the offering node maintains the minimal quota at a step or module 475. Thereafter, the offering node checks for the next process group at a step or module 470 and monitors and assesses the local usage per the next process group at the step or module 415.

If the quota is not lower than a minimum (per decisional step or module 440), the offering node listens for a response from one or more beneficiary nodes providing incoming bids for the underspent resources until a timeout at a step or module 445. If there is a beneficiary node that can benefit from exchanging a quota, the beneficiary node can bid on the offer. For a beneficiary node to be able to bid for more quota for a particular group, it should be able to reduce the quota of other local groups. If the beneficiary node is able to bid, it sends a bid including how much quota it is willing to accept and between which groups.

At a step or module 450, the offering node ranks the incoming bids based on how close they are to the offer, and selects the bid closest to a match. In accordance therewith, the offering node may select multiple bids to transfer the underspent resources. The offering node then sends a message to the selected beneficiary node(s) accepting the bid(s) for quota transfer at a step or module 455. At a step or module 460, the offering node receives a transfer confirmation from the beneficiary node(s). The confirmation is preferable to avoid“selling” the same quota twice to different beneficiary nodes. After the exchange, the offering node (and beneficiary node(s)) adjusts the process group quotas to match the new values at a step or module 465. The offering node then checks for the next process group at the step or module 470 and monitors and assesses the local usage per the next process group at the step or module 415. The method 400 will iterate until many, it not all, of the process groups have been checked.

Turning now to the method 500 of FIGURE 5, a beneficiary node begins a monitoring cycle at a step or module 505. At a step or module 510, the beneficiary node waits for an incoming offer of underspent resource for a process group from an offering node. At a decisional step or module 520, the beneficiary node determines if a received offer benefits a process group thereof. If the received offer does not benefit one of the process groups, the beneficiary node waits for another incoming offer at the step or module 510, otherwise the beneficiary node bids on the offer at a step or module 530.

For the beneficiary node to be able to bid for more quota for a particular group, it should be able to reduce the quota of other local groups.

At a step or module 540, the beneficiary node sends a bid to the offering node including how much quota it is willing to accept, and between which groups. The beneficiary node then waits for an answer to the bid from the offering node at a step or module 550.

At a decisional step or module 560 (and in conjunction with the answer from the offering node), the beneficiary node determines if the bid was accepted by the offering node. If the bid was not accepted, the beneficiary node waits for another incoming offer at the step or module 510, otherwise the beneficiary node confirms the transfer of the underspent resource at a step or module 570. The confirmation is preferable to avoid transferring the same quota twice to different beneficiary nodes.

After the exchange, the beneficiary node (and offering node) adjusts the process group quotas to match the new values at a step or module 580. The beneficiary node then waits for another incoming offer at the step or module 510. The method 500 will continue as the beneficiary node is monitoring the system for new incoming offers. Turning now to FIGURE 6, illustrated is a signaling diagram of an embodiment of a quota transfer of time-shared resources in a data processing system 600. In this example, an overspending of resources is detected for groups 2 and 3 in a first node Nbl 610. The first node 610 has detected that groups 2 and 3 are overspending resources with 40% and 20%, respectively, and group 1 is not fully utilizing the 60% of resources that were assigned. The data processing system 600 also includes a second node Nb2 620, a third node Nb3 630, a fourth node Nb4 640 and a fifth node Nb5 650.

As mentioned above, the first node Nbl 610 detects local usage (generally designated 660) and determines that groups 2 and 3 are overspending resources with 40% and 20%, respectively, and group 1 is not fully utilizing the 60% of resources that were assigned. The first node Nbl 610 extends an offer (generally designated 665) to transfer 60% of the quota from group 1 and in return get a 40% quota for group 2 and 20% for group 3. The first node Nbl thereafter waits for bid (generally designated 670) within a timeout period (generally designated 675).

The second node Nb2 620 extends a bid (generally designated 680) to receive 40% of group 1 resources from the first node Nbl 610 and transfer 40% of group 2 resources to the first node Nbl 610. The fourth node Nb4 640 extends a bid (generally designated 685) to receive 60% of group 1 resources from the first node Nbl 610 and transfer 40% of group 2 resources and 20% of group 3 resources to the first node Nbl 610. Since the bid from the fourth node Nb4 640 more closely matches the criteria from the first node Nbl 610, the first node Nbl 610 transfers the group 1 resources to the fourth node Nb4 640 in exchange for receiving the group 2 and group 3 resources therefrom represented by a sell/buy exchange (generally designated 690) to confirm the resource transfer.

The first node Nbl 610 and fourth node Nb4 640 adjust their respective quotas for the group 1, group 2 and group 3 resources. As indicated generally by 695, the first node Nbl 610 increases an allocation to group 2 (+40%) and to group 3 (+20%), and decreases an allocation to group 1 (-60%). Of course, the fourth node Nb4 640 will perform the opposite re-allocation to its corresponding group 1, group 2 and group 3 resources. The mechanism for space-shared resources is similar, with the exception of the trigger. In this case, a node that wishes to receive an extra resource quota initiates the exchange. Turning now to FIGURE 7, illustrated is a signaling diagram of an embodiment of a quota transfer of space-shared resources in a data processing system 700. The mechanism for space-shared resources is similar to that for time-shared resources, with the exception of the trigger. For space-shared resources, a first node Nbl 710 that wishes to receive an extra quota starts the exchange with a second node Nb2 720, a third node Nb3 730, a fourth node Nb4 740 and a fifth node Nb5 750.

The first node Nbl 710 monitors its groups and detects if a group is getting close to exhausting its allocated quota. In this example, the first node Nbl 710 detects that group 1 is about to reach its memory threshold, i.e., a resource quota limit has been reached (generally designated 760). Additionally, the first node Nbl determines whether it could accommodate an increase in quota for the specific resource. In this example, the first node Nbl checks if it has any free memory which could be allocated to group 1 if the quota was increased.

The first node Nbl 710 extends/sends a request (generally designated 765) for additional group 1 memory resources to the second node Nb2 720, the third node Nb3 730, the fourth node Nb4 740 and the fifth node Nb5 750. The second node Nb2 720, the third node Nb3 730, the fourth node Nb4 740 and the fifth node Nb5 750, each calculate how much of the quota they could transfer before they reach the threshold for group 1 resources. The first node Nbl thereafter waits for offers (generally designated 770) within a timeout period (generally designated 775).

The second node Nb2 720 extends an offer (generally designated 780) to transfer 200 megabytes (“Mb”) of memory to the first node Nbl 610. The fourth node Nb4 740 extends an offer (generally designated 785) to transfer 800 Mb of memory to the first node Nbl 610. The first node Nbl 710 ranks incoming offers based on the amount of resources offered and ranks the offers based on the amount of available resources. In accordance therewith, the first node Nbl 710 may take into account which offer has the least negative impact on the other nodes. This may involve combining multiple offers from different offering nodes.

In this case, the first node Nbl 710 accepts the offer from the fourth node Nb4 represented by a buy/sell exchange (generally designated 790) for the 800 Mb of memory. The first node Nbl 710 and fourth node Nb4 740 adjust their respective quotas. As indicated generally by 795, the first node Nbl 710 increases a memory allocation of group 1 by 800 Mb. Of course, the fourth node Nb4 740 will perform the opposite re allocation to its corresponding group 1 resources.

Turning now to FIGURE 8, illustrated is a block diagram of an embodiment of communication nodes (or nodes) re-adjusting quotas for time-shared resources based on process utilizations to improve global fairness on a cluster level in a data processing system 800. As shown in FIGURE 8, a first Cgroup (referred to as a Group 1 810) is allocated an 80% global share of the resource and a second Cgroup (referred to as a Group 2 820) is allocated an 20% global share of the resource. Group 1 810A in a first node (designated Node A 830) is allocated an 100% local share of the resource and Group 2 820A in the Node A 230 is allocated an 0% local share of the resource. A first process PI and a second process P2 are allocated 70%, 30%, respectively, of the Group 1 810A local share of resources for the Node A 830. Also, a third process P3 and a fourth process P4 are allocated 0%, 0%, respectively, of the Group 2 820A local share of resources for the Node A 830.

Group 1 810B in a second node (designated Node B 840) is allocated an 60% local share of the resource and Group 2 820B in the Node B 840 is allocated an 40% local share of the resource. A fifth process P5 is allocated 10% of the Group 1 810B local share of resources for the Node B 840. Also, a sixth process P6 and a seventh process P7 are allocated 70%, 20%, respectively, of the Group 2 820B local share of resources for the Node B 840.

From the perspective of global Cgroups for time-shared resources for the example described previously hereinabove with reference to FIGURE 2, a global target ratio of 80%:20% can be improved by transferring resource quotas between process groups running on different nodes. Transferring 20% of the assigned quota from Group 1 810B in the Node B 840 to the corresponding Group 1 810a in the Node A 830 improves global fairness to 55%:45% from 45%:55%, closer to the global desired ratio (80%:20%).

Recall that Group 1 810A and Group 2 810B in the Node A and the Node B 830, 840, respectively, originally had a split of resources of 80%:20%, as illustrated in FIGURE 2. In a larger example with more nodes, or with different goals, the gain can be more substantial.

Turning now to FIGURE 9, illustrated is a block diagram of an embodiment of communication nodes (or nodes) re-adjusting quotas for space-shared resources such as memory in a data processing system 900. The space-shared resources are transferable within global Cgroups. The concept of fairness and resource ratios is a little more complicated for space-shared resources such as memory. The space-shared resources are usually assigned and dedicated to a specific group and, being a physical resource, its use cannot readily be multiplexed in time. However, the idea of a quota transfer still is beneficial to improve resource utilization, both within a process group and for a cluster. By transferring a quota between processes in the same group, the global quota for a process group can be utilized where it is needed. This also improves cluster utilization, as the resources that were bound to a quota can be used by other groups.

In the example illustrated in FIGURE 9, Group 1 910A in Node A 920 illustrates starvation thereon wherein first and second processes PI, P2 exhibit 75% use, the 75% being the sum of 700 Mb and 50 Mb use out of a one gigabyte (“Gb”) allocation. In Node B 930, a fifth process P5 for the Group 1 910B utilizes 100 Mb of the 1 Gb, which is only 10% use. After the transfer of 800 Mb from the Node B 930, the Node A 920 now has a Group 1 910A allocation of 1.6 Gb and utilizes in the first process PI 700 Mb and in the second process P2 50 Mb, which is 47% use. In the Node B 930 after the transfer, the Group 1 910B has an allocation of 0.3 Gb, of which the fifth process P5 utilizes 100 Mb representing a 33% use, which is closer to balanced usage.

The quota transfer in this case depends on the beneficiary node having enough of the considered resources available to increase the quota for the group and reclaiming resource limits on the offering node (or offering node) without affecting running processes. In this scenario, the goal is not only to reach fairness between different groups, but rather to efficiently use the available quota within a process group that is spread over multiple nodes. This also makes it less likely for any of the processes to fully consume their available memory, which could result in processes crashing or being terminated by the underlying system.

Turning now to FIGURE 10, illustrated is a flow diagram of an embodiment of a method 1000 of operating a data processing system. The method 1000 is operable in a communication node ( e.g ., node 610 of FIGURE 6) of a data processing system ( e.g ., including a data center, see 600 of FIGURE 6) including a plurality of nodes (e.g., nodes 610, 620, 630, 640, 650 of FIGURE 6) and begins at a start step or module 1010. At a step or module 1015, the node collects first resource usage by a first group of processes and second resource usage by a second group of processes operable in the node. The first resource usage and the second resource usage may include a central processing unit (“CPU”) and/or a memory usage associated with the first group of resources and the second group of resources, respectively.

At a decisional step or module 1020, the node determines whether the first resource usage and/or the second resource usage is underspent with respect to a global share of resources between the first group of processes and the second group of processes. The global share of resources may be applicable to the plurality of nodes. If the first resource usage and the second resource usage are not underspent, then the method 1000 returns to the step or module 1015, otherwise the method continues to the step or module 1025.

Assuming for the sake of illustration that the first resource usage is underspent (but the second resource usage is not underspent), the node offers to transfer first underspent resources of the first group of processes to a corresponding group of processes in at least one other node of the plurality of nodes at the step or module 1025. At a step or module 1030, the node receives at least one bid ( e.g ., within a timeout period) for the first underspent resources from the corresponding group of processes of the at least one other node of the plurality of nodes. The node assesses the at least one bid for the first underspent resources from the corresponding group of processes of the at least one other node of the plurality of nodes at a step or module 1035. If the node receives at least two bids, the node may rank the bids for the first underspent resources from the corresponding group of processes of the at least one other node of the plurality of nodes prior to selecting a beneficiary node(s). The rank may depend on a full allocation of the first underspent resources to the at least one other node of the plurality of nodes among the plurality of nodes.

At a step or module 1040, the node selects a beneficiary node(s) from the at least one other node of the plurality of nodes for the first underspent resources based on the assessment and sends an acknowledgement message to the beneficiary node(s) at a step or module 1045. At a step or module 1050, the node receives a confirmation message from the beneficiary node(s) and allocates the first underspent resources to the beneficiary node(s) at a step or module 1055. In accordance with the assessment and allocation, the node may select first and second beneficiary nodes for, e.g., about half of the first underspent resources and allocate the first underspent resources thereto. At a step or module 1060, the node adjusts a local share of resources for the first group of processes. The method 1000 ends at an end step or module 1065.

It should be understood that an analogous method 1000 can be applied to a node that is overspent with respect to a resource usage. In such a case, the node makes a request for underspent resources from another node(s) of the plurality of nodes in an attempt to receive the same from the another node(s) to cover the nodes overspending.

As such, the steps and modules described above would be performed in a reverse manner.

Turning now to FIGURE 11, illustrated is a flow diagram of an embodiment of a method 1100 of operating a data processing system. The method 1100 is operable in a communication node ( e.g ., node 640 of FIGURE 6) of a data processing system ( e.g ., including a data center, see 600 of FIGURE 6) including a plurality of nodes (e.g., nodes 610, 620, 630, 640, 650 of FIGURE 6) and begins at a start step or module 1110. At a step or module 1115, the node receives an offer for first underspent resources of a corresponding first group of processes from an offering node to a first group of processes in the node. The node provides a bid for the first underspent resources (or a portion thereof) to the offering node at a step or module 1120. At a step or module 1125, the node receives an acknowledgement for a transfer of the first underspent resources from the offering node.

The node receives the transfer of the first underspent resources from the offering node in response to the acknowledgement at a step or module 1130. The node sends an acknowledgement of receipt of the first underspent resources to the offering node to, for instance, allow the offering node to adjust a share of resources for the corresponding first group of processes at a step or module 1135. At a step or module 1140, the nodes adjusts a local share of resources for the first group of processes. In accordance therewith, the node may adjust a local share of resources for other groups of processes within the node. The method 1100 ends at an end step or module 1145.

It should be understood that an analogous method 1100 can be applied to a node that is overspent with respect to a resource usage. In such a case, the node may receive a request for underspent resources from another node(s) of the plurality of nodes in an attempt to receive the same from the another node(s) to cover the nodes overspending.

As such, the steps and modules described above would be performed in a reverse manner. Turning now to FIGURE 12, illustrated is a block diagram of an embodiment of a communication node 1200 in a data processing system. The communication node 1200 may be a server configured to perform functions of allocating resources in a data processing system. The communication node 1200 includes a processor (or processing circuitry) 1210, a memory 1220 and a communication interface 1230. The

communication node 1200 may also include an antenna(s) 1240 depending on the type of device such as a server with wireless communication capability. In particular

embodiments, some or all of the functionality described herein may be provided by, without limitation, a machine type communication (“MTC”) and machine-to-machine (“M2M”) devices, a radio base station, a radio network controller, and a data center (e.g, computer(s) that form a data center).

The functionality of the communication node 1200 may be provided by the processor 1210 executing instructions stored on a computer-readable medium, such as the memory 1220 shown in FIGURE 12. Alternative embodiments of the communication node 1200 may include additional components (such as the interfaces, devices and circuits mentioned above) beyond those shown in FIGURE 12 that may be responsible for providing certain aspects of the device’s functionality, including any of the functionality to support the solution described herein.

The processor 1210 (or processors), which may be implemented with one or a plurality of processing devices, perform functions associated with its operation including, without limitation, performing the operations of allocating communication and data processing resources, decoding of individual bits forming a communication message, formatting of information and overall control of a respective communication node 1200. Exemplary functions related to management of communication resources include, without limitation, hardware installation, traffic management, performance data analysis, configuration management, security, billing and the like. The processor 1210 may be of any type suitable to the local application environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (“DSPs”), field-programmable gate arrays (“FPGAs”), application-specific integrated circuits (“ASICs”), and processors based on a multi-core processor

architecture, as non-limiting examples. The processor 1210 may include, without limitation, application processing circuitry. In some embodiments, the application processing circuitry may be on separate chipsets. In alternative embodiments, part or all of the application processing circuitry may be combined into one chipset, and other application circuitry may be on a separate chipset. In still alternative embodiments, part or all of the application processing circuitry may be on the same chipset, and other application processing circuitry may be on a separate chipset. In yet other alternative embodiments, part or all of the application processing circuitry may be combined in the same chipset.

The processor 1210 may be configured to perform any operations described herein. The operations as performed by the processor 1210 may include processing information obtained by the processor by, for example, converting the obtained in formation into other information, comparing the obtained information or converted information to information stored in the respective device, and/or performing one or more operations based on the obtained information or converted information, and, as a result of the processing, making a determination.

The memory 1220 (or memories) may be one or more memories and of any type suitable to the local application environment, and may be implemented using any suitable volatile or nonvolatile data storage technology such as a semiconductor-based memory device, a magnetic memory device and system, an optical memory device and system, fixed memory and removable memory. The programs stored in the memory 1220 may include program instructions or computer program code that, when executed by an associated processor, enable the respective communication node 1200 to perform its intended tasks. Of course, the memory 1220 may form a data buffer for data transmitted to and from the same. Exemplary embodiments of the system, subsystems, and modules as described herein may be implemented, at least in part, by computer software executable by the processor 1210, or by hardware, or by combinations thereof.

The communication interface 1230 modulates information onto a carrier waveform for transmission by the respective communication node 1200 to another communication node. The respective communication interface 1230 also demodulates information received from another server for further processing. The communication interface 1230 can support duplex operation for the respective server 1200, and supports communication with a core network. The antenna 1240 (antennas), when applicable, may be any type of antenna capable of transmitting and receiving data and/or signals wirelessly. In some

embodiments, the antenna 1240 may include one or more omni-directional, sector or panel antennas operable to transmit/receive radio signals between, for example, 2 gigahertz (“GHz”) and 66 GHz. An omni-directional antenna may be used to transmit/receive radio signals in any direction, a sector antenna may be used to transmit/receive radio signals from devices within a particular area, and a panel antenna may be a line of sight antenna used to transmit/receive radio signals in a relatively straight line. While the antenna 1240 facilitates wireless communication for the communication node 1200, the communication node 1200 may also communicate via a wired communication path via the communication interface 1230 and, in such instances, the antenna 1240 may not be necessary. The subsystems as introduced above with respect to the preceding FIGURES may be embodied in the communication node 1200 performed by, for instance, the processor 1210 in conjunction with the memory 1220.

Thus, a system and method has been introduced herein to allocate resources between a plurality of nodes in a data processing system. The system and method can be performed in real-time, taking into account multiple criteria, to allocate resources within the data processing system.

The foregoing description of embodiments of the present proposed solution has been presented for the purpose of illustration and description. It is not intended to be exhaustive or to limit the proposed solution to the present form disclosed. Alternations, modifications and variations can be made without departing from the spirit and scope of the present proposed solution.

As described above, the exemplary embodiment provides both a method and corresponding system consisting of various modules providing functionality for performing the steps of the method. The modules may be implemented as hardware (embodied in one or more chips including an integrated circuit such as an application specific integrated circuit), or may be implemented as software or firmware for execution by a processor. In particular, in the case of firmware or software, the exemplary embodiment can be provided as a computer program product including a computer readable storage medium embodying computer program code (i.e., software or firmware) thereon for execution by the computer processor. The computer readable storage medium may be non-transitory (e.g., magnetic disks; optical disks; read only memory; flash memory devices; phase-change memory) or transitory (e.g., electrical, optical, acoustical or other forms of propagated signals-such as carrier waves, infrared signals, digital signals, etc.). The coupling of a processor and other components is typically through one or more busses or bridges (also termed bus controllers). The storage device and signals carrying digital traffic respectively represent one or more non-transitory or transitory computer readable storage medium. Thus, the storage device of a given electronic device typically stores code and/or data for execution on the set of one or more processors of that electronic device such as a controller.

Although the embodiments and its advantages have been described in detail, it should be understood that various changes, substitutions, and alterations can be made herein without departing from the spirit and scope thereof as defined by the appended claims. For example, many of the features and functions discussed above can be implemented in software, hardware, or firmware, or a combination thereof. Also, many of the features, functions, and steps of operating the same may be reordered, omitted, added, etc., and still fall within the broad scope of the various embodiments.

Moreover, the scope of the various embodiments is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed, that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized as well.

Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.