Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
LEARNING MACHINE LEARNING INCENTIVES BY GRADIENT DESCENT FOR AGENT COOPERATION IN A DISTRIBUTED MULTI-AGENT SYSTEM
Document Type and Number:
WIPO Patent Application WO/2021/156441
Kind Code:
A1
Abstract:
Machine learning techniques for multi-agent systems in which agents interact whilst performing their respective tasks. The techniques enable agents to learn to cooperate with one another, in particular by mixing incentives, in a way that improves their collective efficiency.

Inventors:
GEMP IAN MICHAEL (GB)
Application Number:
PCT/EP2021/052813
Publication Date:
August 12, 2021
Filing Date:
February 05, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DEEPMIND TECH LTD (GB)
International Classes:
G06N3/04; G06N3/00; G06N3/08; G06N20/20
Foreign References:
US20190354867A12019-11-21
Other References:
HONGBIGN WANG ET AL: "Integrating Reinforcement Learning with Multi-Agent Techniques for Adaptive Service Composition", ACM TRANSACTIONS ON AUTONOMOUS ADOPTIVE SYSTEM, ASSOCIATION FOR COMPUTING MACHINERY, INC., NEW YORK, NY, US, vol. 12, no. 2, 25 May 2017 (2017-05-25), pages 1 - 42, XP058339099, ISSN: 1556-4665, DOI: 10.1145/3058592
LEI LEI ET AL: "Deep Reinforcement Learning for Autonomous Internet of Things: Model, Applications and Challenges", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 22 July 2019 (2019-07-22), XP081445712
Attorney, Agent or Firm:
FISH & RICHARDSON P.C. (DE)
Download PDF:
Claims:
CLAIMS:

1. A computer-implemented method of training a first machine learning system to select actions to be performed by a first agent of a group of agents to control the first agent to perform a task in an environment, wherein whilst performing the task the first agent interacts with one or more other agents of a group of agents in the environment respectively controlled by one or more other machine learning systems to perform one or more other tasks, the method comprising: receiving, from each of the other machine learning systems, a respective machine learning objective-defining value used for training the other machine learning system; determining a combined objective-defining value from a combination of a first machine learning objective-defining value for the first machine learning system and the machine learning objective-defining values received from the other machine learning systems, wherein the combination is defined by a set of mixing parameters; training the first machine learning system using the combined objective-defining value; adjusting the set of mixing parameters using gradient descent to optimize an efficiency estimate, wherein the efficiency estimate is dependent upon a rate of change of the combined objective value with time.

2. A method as claimed in claim 1, wherein the efficiency estimate comprises a cost which has a higher value when the combined objective-defining value is worsening with time than when the combined objective-defining value is improving with time.

3. A method as claimed in claim 1 or 2, wherein adjusting the set of mixing parameters using gradient descent comprises determining a set of gradients of the efficiency estimate with respect to the set of mixing parameters, and adjusting the set of mixing parameters using the set of gradients.

4. A method as claimed in claim 3, wherein determining the set of gradients of the efficiency estimate includes adding a regularization term to the set of gradients to inhibit adjusting the set of mixing parameters away from the first machine learning objective- defining value.

5. A method as claimed in claim 3 or 4, wherein determining the set of gradients of the efficiency estimate comprises applying a trial modification to the set of mixing parameters to determine a trial set of mixing parameters, and determining the efficiency estimate using a trial combined objective-defining value for the first machine learning system where the trial combined objective-defining value is defined by the trial set of mixing parameters.

6. A method as claimed in claim 5, wherein the trial modification to the set of mixing parameters defines a direction, and wherein adjusting the set of mixing parameters using the set of gradients comprises adjusting the set of mixing parameters in the opposite direction in response to the combined objective-defining value worsening while the trial modification to the set of mixing parameters is applied.

7. A method as claimed in claim 5 or 6, wherein determining the efficiency estimate comprises estimating a rate of change of the trial combined objective-defining value with time by determining a change in the trial combined objective-defining value over multiple machine learning time steps.

8. A method as claimed in claim 7, wherein determining the efficiency estimate comprises determining a difference between first and second mean returns from the trial combined objective-defining value at the respective start and end of a trial period.

9. A method as claimed in any preceding claim, wherein determining the combined objective-defining value comprises determining a linear combination of the first machine learning objective-defining value and the machine learning objective-defining values received from the other machine learning systems each weighted by a respective one of the mixing parameters.

10. A method as claimed in any preceding claim, further comprising sending the first machine learning objective-defining value to each of the one or more other machine learning systems.

11. A method as claimed in any preceding claim, wherein the first machine learning system includes a policy neural network to receive observations of the environment and to select the actions to be performed by the first agent in response to the observations, wherein the policy neural network has a plurality of policy neural network parameters, and wherein training the first machine learning system comprises adjusting the policy neural network parameters using gradient descent to optimize an objective defined by the combined objective-defining value.

12. A method as claimed in any preceding claim, wherein the first machine learning objective-defining value and the machine learning objective-defining values from the other machine learning systems each comprise the value of a loss function dependent on, or the value of a reward received in response to, respectively, an action of the first agent and an action each of the one or more other agents.

13. A method as claimed in any preceding claim, wherein the efficiency estimate is component of a cost of inefficiency of the group of agents performing their respective tasks in the environment.

14. A method as claimed in any preceding claim, wherein the method is also implemented by each of the one or more other machine learning systems.

15. A method as claimed in any preceding claim wherein the first machine learning system is a first reinforcement learning system, and wherein each of the other machine learning systems are reinforcement learning systems.

16. A method as claimed in any one of claims 1-15, wherein each agent of the group of agents comprises a robot or autonomous vehicle, wherein each of the tasks comprises navigating a path through the environment from a start point to an end point, and wherein the first machine learning objective-defining value is dependent on an estimated time or distance for the agent physically to move from the start point to the end point.

17. A method as claimed in any one of claims 1-15, wherein the environment is a data packet communications network environment, wherein each agent of the group of agents comprises a router to route packets of data over the communications network, and wherein the first machine learning objective-defining value is dependent on a routing metric for a path from the router to a next or further node in the data packet communications network.

18. A method as claimed in any one of claims 1-15, wherein the environment is an electrical power distribution environment, wherein each agent of the group of agents is configured to control routing of electrical power from a node associated with the agent to one or more other nodes over one or more power distribution links, and wherein the first machine learning objective-defining value is dependent on one or both of a loss and a frequency or phase mismatch over the one or more power distribution links.

19. A method as claimed in any one of claims 1-15, wherein the environment is a plant or service facility, wherein and each agent of the group of agents is configured to control an item of equipment in the plant or service facility, and wherein the first machine learning objective-defining value is dependent on resource usage by the plant or service facility.

20. One or more computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the method of any one of claims 1-19.

21. A system comprising one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the method of any one of claims 1-19. 22. A system comprising a group of machine learning systems each to control a respective one of group of agents, wherein each of the machine learning systems is configured to implement the method of any one of claims 1-19.

Description:
LEARNING MACHINE LEARNING INCENTIVES BY GRADIENT DESCENT FOR AGENT COOPERATION IN A DISTRIBUTED MULTI-AGENT SYSTEM

BACKGROUND

[0001] This specification relates to machine learning, and in example implementations to reinforcement learning.

[0002] In a reinforcement learning system, an agent interacts with an environment by performing actions that are selected by the reinforcement learning system in response to receiving observations that characterize the current state of the environment.

[0003] Some reinforcement learning systems select the action to be performed by the agent in response to receiving a given observation in accordance with an output of a neural network.

[0004] Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks are deep neural networks that include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters.

SUMMARY

[0005] This specification describes machine learning technologies which enable agents to learn to cooperate with one another in a way that improves their collective efficiency. More particularly, an agent learns to modify its goal based on the behavior of other agents, enabling a better overall result to be achieved than if each agent pursued an individual “selfish” goal. In implementations the agents operate in a real-world environment.

[0006] There is described a computer-implemented method of training a first machine learning, e.g. reinforcement learning, system to select actions to be performed by a first agent of a group of agents to control the first agent to perform a task in an environment. Whilst performing the task the first agent interacts, directly or indirectly, with one or more other agents of a group of agents in the environment, respectively controlled by one or more other machine learning, e.g. reinforcement learning, systems to perform one or more other tasks. The tasks may all have the same character e.g. they may all be routing tasks. The interaction is typically such that the ability of the agent to perform the task in the environment is affected by the one or more other agents performing the one or more other tasks in the environment.

[0007] The method may comprise receiving, e.g. at the first machine learning system, from each of the other machine learning systems, a respective machine learning, e.g. reinforcement learning, objective-defining value used for training the other machine learning system. The objective-defining value, which may be referred to as an “incentive”, is used to define a training objective and may be e.g. a reward value or loss. [0008] The method may further comprise determining a perturbed i.e. combined objective-defining value from a combination of a first machine learning, e.g. reinforcement learning, objective-defining value for the first machine learning system and the objective-defining values received from the other machine learning systems. The combination may be defined by a set of mixing parameters (later A L ). The objective- defining values may be rewards and the combined objective-defining value may be a combined reward.

[0009] The method may further comprise training the first machine learning system using the combined objective-defining value, in particular training the first machine learning system using a machine learning technique to optimize the combined objective-defining value.

[0010] The method may further comprise adjusting the set of mixing parameters using gradient descent to optimize an (in)efficiency estimate metric (later p L ), in particular an (in)efficiency estimate metric for the first machine learning system, which in implementations represents a contribution of the first machine learning system to an overall (in)efficiency estimate for the group of agents. The adjusting may be performed concurrently with the machine learning. In implementations the (in)efficiency estimate metric is dependent upon a rate of change of the combined objective-defining value with time. The combined objective-defining value may be summed or averaged e.g. over multiple action-selection time steps or over a machine learning episode, and/or discounted, before determining the rate of change.

[0011] Thus the rate of change of the combined objective-defining value with time may have a value which is used as a metric which the system learns to improve.

[0012] This metric may be understood as relating to an efficiency, in implementations specifically an inefficiency, of the group of agents. The method effectively uses this metric to set a learning goal for the first agent, i.e. to define a combined objective value for the first machine learning system, so that the first agent is incentivized to learn to cooperate with the other agents.

[0013] Each of the agents may be incentivized in this way. Thus in implementations each of the machine learning systems implements the above-described method.

[0014] In implementations the efficiency estimate metric represents a contribution of the first machine learning system i.e. first agent to an overall (in)efficiency of the group of agents. This overall (in)efficiency may represent a so-called price of anarchy for the group of agents i.e. a difference between, or ratio of, an equilibrium value of a total loss for the group of agents without any mixing, and the smallest total loss that could be achieved with fully cooperative agents.

[0015] When the objective-defining values of the other agents are mixed with those of the first agent the goal of the first agent is modified. The modification of the mix depends on the rate of change of the combined objective-defining value with time. This can allow the group of agents to escape a Nash equilibrium, where each agent has optimized its “selfish” goal but where the overall performance of the group of agents, e.g. as measured by their collective losses or rewards, could be improved were the goals of the agents to change. For example, this can allow the group of agents, collectively, to escape Braess’s “paradox”, a non-intuitive effect in which providing an additional resource which all agents selfishly use, results in decreased overall performance.

[0016] Thus in implementations the efficiency estimate comprises a cost or “price”, and optimizing the efficiency estimate involves minimizing this cost. The cost may have a higher value when the (average) combined objective-defining value is worsening with time than when the (average) combined objective-defining value is improving with time. Thus the efficiency estimate may have a relatively lower or zero value when the (average) combined objective-defining value is improving with time, e.g. reducing loss or increasing reward, and a relatively higher value when the (average) combined objective- defining value is worsening with time. Since in implementations a gradient of the (average) efficiency estimate with respect to the mixing parameters depends on the value of the efficiency estimate, this gradient may vary in the same way as the value of the efficiency estimate.

[0017] Thus the set of mixing parameters may be adjusted more when the (average) combined objective-defining value is worsening with time and less or not at all when it is improving, e.g. by applying a ReLU (rectified linear unit) function to the (average) rate of change of the combined objective-defining value with time. As described later, the efficiency estimate may be a component (for the first agent) of a price of anarchy or price of stability for the group of agents, that is a price of inefficiency of the machine learning e.g. reinforcement learning.

[0018] Adjusting the set of mixing parameters using gradient descent may comprise determining a set of gradients (later V^.) of the efficiency estimate with respect to the set of mixing parameters (later A t ), and adjusting the set of mixing parameters using the set of gradients. This may comprise adjusting A L by —h where h is a learning rate; it need not involve determining the efficiency estimate explicitly.

[0019] The set of gradients of the efficiency estimate may be determined “locally” to the current action selection policies of the group of agents, in particular to a current action selection policy of the first reinforcement learning system.

[0020] In some implementations determining the set of gradients of the efficiency estimate includes adding a regularization term (later, the v-ter ) to the set of gradients to inhibit adjusting the set of mixing parameters (weights) away from the first reinforcement learning objective-defining value e.g. away from their initial values. This can help to reduce the risk of inequity amongst the agents.

[0021] In implementations determining, more particularly stochastically estimating, the set of gradients of the efficiency estimate comprises applying a trial modification to the set of mixing parameters to determine a trial set of mixing parameters. The efficiency estimate is then determined using a trial i.e. perturbed combined objective-defining value for the first reinforcement learning system, defined by the trial set of mixing parameters. The trial modification may be in a direction (in a space defined by the set of mixing parameters) chosen by sampling e.g. from a unit sphere; the set of gradients may then be in the same direction. The set of mixing parameters may be adjusted in the opposite direction if the (mean) combined objective-defining value e.g. mean return, worsens while the trial modification to the set of mixing parameters is applied.

[0022] Determining the efficiency estimate may comprise estimating a rate of change of the (average) trial perturbed objective-defining value with time. This may involve determining a change in the trial combined objective-defining value over multiple machine learning time steps, for example over one or more machine learning, e.g. reinforcement learning, episodes. This may be a stochastic estimate e.g. determined by a finite difference method. For example determining the efficiency estimate may comprise determining a difference between first and second mean returns of the trial combined objective-defining value at the start and end of a trial period.

[0023] In some implementations determining the combined objective-defining value comprises determining a weighted linear combination of the first reinforcement learning objective-defining value and the reinforcement learning objective-defining values received from the other reinforcement learning systems. In other implementations the objective-defining values may be combined in a more complex manner.

[0024] Implementations of the method may also involve sending the objective-defining value for the first machine learning system, e.g. a reward received by the first machine learning system, to each of the one or more other machine learning e.g. reinforcement learning systems. This may be done at each machine learning time step. In implementations the reward received by the first machine learning system is sent to each other machine learning system j weighted by a respective weight (later A tj ) for machine learning system j. Alternatively the rewards may be broadcast without applying a weight and the weights may be shared e.g. machine learning system i may receive a mixing weight for each reward it receives from each of the other machine learning systems.

[0025] In implementations the first machine learning system is a reinforcement learning system that includes an action selection policy neural network configured to receive observations of the environment and to select the actions to be performed by the first agent in response to the observations. The policy neural network may have a plurality of policy neural network parameters and training the first reinforcement learning system may comprise adjusting the policy neural network parameters using gradient descent to optimize the combined objective-defining value. The first reinforcement learning objective-defining value and the reinforcement learning objective-defining values from the other reinforcement learning systems may each comprise the value of a loss function or reward dependent upon an action, respectively, of the first agent and of the one or more other agents.

[0026] The first and other reinforcement learning systems may implement any type of reinforcement learning; they may even implement different reinforcement learning algorithms. For example, one or more of the reinforcement learning systems may be: a policy -based system such as an Advantage Actor Critic system (Mnih et al. 2016) which parameterizes a stochastic action-selection policy i.e. determines parameters of a distribution over possible actions, and optionally parameterizes a state value function; or a Q-learning system, such as a Deep Q- learning Network (DQN) system or Double-DQN system, in which the output approximates an action-value function, and optionally a value of a state, for selecting an action. One or more of the reinforcement learning systems may be a distributed reinforcement learning system such as IMP ALA, Espholt et al., arXiv: 1802.01561. One or more of the reinforcement learning systems may have a policy neural network with an action selection output which directly defines the action to be performed by the agent, such as DDPG, arXiv: 1509.02971, e.g., by defining the values of torques that should be applied to the joints of a robotic agent or the values of accelerations to be applied to a robot or vehicle drive.

[0027] For example, in one implementation, one or more of the reinforcement learning systems may implement a reinforcement learning technique which trains an action selection policy neural network using an actor-critic technique. The action selection policy neural network, or another neural network, may be configured to generate a value estimate in addition to an action selection output. The value estimate represents an estimate of a return e.g. a time-discounted return that would result, given the current state of the environment, from selecting future actions performed by the agent in accordance with the current values of the action selection network parameters. The reinforcement learning may train the action selection policy network using gradients of a reinforcement learning objective function L RL given e.g. by:

£RL - T + a ^V + b^H where a and b are positive constant values, IE 5ί-p [ ] refers to the expected value with respect to the current action selection policy (i.e., defined by the current values of the action selection policy network parameters Q ), V(s t , Q) refers to the value estimate generated by the action selection policy network for observation s t , H (p(· \s t , 0)) is a regularization term that refers to the entropy of the probability distribution over possible actions generated by the action selection network for observation s t , and R t refers to an n-step look-ahead return, based on the combined reward, e.g., given by: where y is a discount factor between 0 and 1, r t+i is the (combined) reward received at time step t + i, and V ( s t+n , Q) refers to the value estimate at time step t + n. In some variants of this approach one or both of the a£ v and b£ H terms in £ RL may be omitted. [0028] In some implementations each of the machine learning, e.g. reinforcement learning, systems may implement a method as described above. Each of the agents may be performing the same task (i.e. the same type of task), in a shared environment. Here “same task” means a task of the same type, e.g. a routing task, but typically not an identical task (e.g. because the agents have different start/end nodes to route between). [0029] The subject matter described in this specification can be implemented in particular embodiments so as to realize one or more of the following advantages.

[0030] Some implementations of the described methods address a problem encountered when agents interact when learning to perform tasks in a shared environment: the agents can learn to perform the tasks, but are overall less efficiently than if they were to cooperate. Cooperation could be imposed by a centralized authority, but this imposes a management overhead and results in a single point of failure. The described techniques enable agents in a decentralized system to learn to cooperate to solve a task efficiently, for example faster or consuming fewer resources than would otherwise be the case. In particular the described techniques enable agents to learn to cooperate with minimal communication between the agents. In multi-agent systems communications amongst all the agents can be a particular burden. The described techniques can significantly reduce the amount of data which needs to be communicated compared to e.g. sharing high dimensional observations or actions, thus reducing communications bandwidth and consequently power consumption, which is especially useful for mobile agents.

[0031] In implementations this is achieved by each agent automatically learning to adjust a goal or incentive used by its machine learning. This is done mixing its incentives with those of other agents in such a way that, collectively, the agents learn to cooperate and perform their individual respective tasks more efficiently. Gradient-based learning is used to minimize a decentralized measure of inefficiency, and hence an agent is able to adjust its individual inventive to improve the collective efficiency.

[0032] Broadly, each agent pays attention to the goals of the other agents by receiving scalar values indicating values of their loss functions or rewards e.g. by message passing. An agent can thus learn to deviate from “selfish” behavior, thereby enhancing overall efficiency. This can also allow asymmetric policies to develop where if say, one agent finds a short-cut the others avoid it so that it does not become congested. In this way implementations of the system can learn to avoid Braess’s paradox. In some contexts the approach may allow different agents to specialize, each focusing on controlling the losses it can decrease. [0033] The described techniques are applicable to many different kinds of routing and other problem. For example in the context of agents controlling autonomous vehicles more efficient routes can be learned, reducing overall energy consumption and speeding up average or maximum journey times. In the context of an electrical power grid, the techniques can increase network stability and efficiency of power transfer, particularly as grids evolve to be more decentralized, including more sources of renewable power. In the context of a computer network the efficiency of packet data communications can be increased e.g. increasing bandwidth, reducing packet latency, or increasing packet transmission reliability. In the context of a manufacturing plant, agents can learn to cooperate to control items of equipment so that the plant runs efficiently. [0034] The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS [0035] FIG. 1 shows a multi-agent machine learning system.

[0036] FIG. 2 shows an example process for training one of the machine learning systems of FIG. 1.

[0037] FIG. 3 shows an example process for one of the mix adjust systems of FIG. 1. [0038] FIGS. 4a-4c illustrate operation of the system of FIG. 1 in a toy environment. [0039] Like reference numbers and designations in the drawings indicate like elements.

DETAILED DESCRIPTION

[0040] This specification describes techniques for learning in multi-agent systems. Such systems are common in the real world and may include, for example, autonomous vehicles such as robots which interact whilst performing a task (e.g. warehouse robots), factory or plant automation systems, and computer systems. In such cases the agents may be the robots, items of equipment in the factory or plant, or software agents in a computer system which e.g. control the allocation of tasks to items of hardware or the routing of data on a communications network.

[0041] The agents may use machine learning to learn to perform the task(s). Typically machine learning techniques independently aim to optimize an objective, but this can lead to inefficiencies at the group level. One approach addressing such problems is to use a central coordinator but this lacks scalability and provides a single point of failure. However decentralized techniques typically involve communication of significant amounts of data, e.g. describing observations or actions of the individual agents, and this raises problems of communication bandwidth and also excessive power consumption as such communications, which are often wireless, require energy. This specification describes techniques for multi-agent learning which are both bandwidth and power efficient, and using which agents are able to learn to cooperate to perform a task efficiently.

[0042] Figure 1 shows a multi-agent machine learning system comprising multiple machine learning systems 100a,i,n each of which is configured to control a respective agent 102a,i,n. Each machine learning system 100a,i,n may be implemented as computer programs on one or more computers in one or more locations. In implementations, but not essentially, the machine learning systems are reinforcement learning systems.

[0043] The agents 102a,i,n operate in a common environment 104 each to perform a respective task. In general, how one agent performs its task affects how another of the agents is able to perform its task. For convenience machine learning system lOOi and agent 102i are described; the others are similar.

[0044] At each of multiple action-selection time steps machine learning system lOOi selects an action ai to be performed by agent 102i in response to an observation oi characterizing a state of the environment. The observation may include an image of the environment and/or other sensor or input data from the environment. Such observations are typically pre-processed e.g. by one or more convolutional neural network layers, and/or one or more recurrent neural network layers.

[0045] The machine learning system lOOi may also receive a reward r t as a result of performing the action a t. In general the reward is a numerical value, i.e. a scalar, and may be based on any event or aspect of the environment. For example, the reward r t may indicate whether the agent 102i has accomplished the task (e.g., for a manipulation task or navigating to a target location in the environment), or progress of the agent 102i towards accomplishing the task. Interactions of the agent 102i with the environment to perform the task may define episodes. An episode may end with a terminal state e.g. indicating whether the task was performed or not, or after a specified number of action- selection time steps.

[0046] The machine learning system lOOi is configured to broadcast the reward r t it receives at each action-selection time step to each of the other machine learning systems, e.g. via a wired or wireless communications link. Similarly at each action selection time step machine learning system lOOi receives the rewards, denoted i}·, broadcast by each of the other machine learning systems, e.g. via the same or another wired or wireless communications link. More specifically, as described later machine learning system lOOi transmits reward r weighted by a respective mixing parameter for the recipient machine, to each other machine learning system, and receives rewards i}· weighted by respective mixing parameters.

[0047] The machine learning system lOOi includes a training engine 130i which is configured to use the rewards r t it receives, and the (weighted) rewards r ; - received by the other machine learning systems, to train the machine learning system lOOi to select actions for performing the task. More particularly the training engine 130i is configured to combine the reward r L and rewards i}· at each action selection time step, e.g. in a linear combination, and to use the combined rewards to learn to train the machine learning system lOOi to perform the task.

[0048] Although some implementations train the machine learning system lOOi using the combined reward, in principle other objective-defining values could be shared, combined, and used for training. For example some implementations may share the value of a return, i.e. a cumulative reward such as a time-discounted sum of rewards, or the value of a modified reward, or the value of a reinforcement learning loss function. In general each of these example values is a single scalar quantity. Implementations in which the objective-defining values are rewards are described merely as an example. [0049] The multi-agent machine learning system described herein may use any suitable machine learning technique to train the machine learning system lOOi using the combined objective-defining value, e.g. the combined rewards - the techniques described herein are not dependent on the use of any particular training technique. Some example implementations of the system use reinforcement learning. [0050] In implementations the actions performed by agent 102i are selected by an action selection policy neural network 1 lOi, which is trained by the training engine 130i using a machine learning technique e.g. a reinforcement learning technique. The action selection policy neural network 1 lOi is trained using the combined objective-defining value, e.g. the combined rewards, as a training objective.

[0051] An output of the action selection policy neural network 1 lOi may comprise, for example, action selection scores according to which the action is selected, or the output may determine the action directly, or the output may parameterize a distribution e.g. a Gaussian distribution, according to which the action may be selected stochastically. Depending upon the implementation the machine learning system lOOi may include more components than those shown in Figure 1, e.g. a value function neural network.

[0052] The combination of the reward r t and the rewards i}· is made according to a set of mixing parameters, which may be represented as a vector of weights for agent 102i, A with one weight for agent 102i and one weight for each of the other agents. Each weight may be e.g. in the range [0,1] and the weights may be constrained to sum to 1.

[0053] The weights, A for agent 102i may be determined by a mix adjust system 120i. The mix adjust system 120i may adjust the weights for the rewards, i.e. the mixing parameters, so as to encourage the agents to cooperate. In implementations this is done by using gradient descent to optimize a metric which may be described as an efficiency estimate, more particularly by using gradient descent to minimize an inefficiency estimate. In implementations the metric is dependent upon a rate of change of the combined reward with time, in particular when averaged over multiple action- selection time steps e.g. when averaged over an episode. In some implementations the metric is dependent upon a rate of change of the return, e.g. a discounted return, from an episode (where the return is evaluated using the combined reward). In general the efficiency estimate metric may be dependent upon an averaged combined objective-defining value, e.g. average combined reward, or upon the return.

[0054] In implementations a trial modification, e.g. a small perturbation, is made to the set of mixing parameters. The trial modification may have a direction defined, effectively, by a change in direction of the vector of weights. If the averaged combined reward, or return, is increasing i.e. the combined loss is decreasing, the trial modification is maintained, otherwise the mixing parameters are adjusted, in particular in the opposite direction to that of the trial modification. When the mixing parameters are adjusted they may be adjusted by an amount dependent on a gradient of the efficiency estimate metric with respect to the mixing parameters e.g. by the product of this gradient and a learning step size.

[0055] When repeated this process encourages cooperation between the agents. Use of the combined reward enables one agent to donate part of that agent’s reward to another or, equivalently, it allows the agents to share losses, which allows one agent to learn to help another in performing its task.

[0056] In implementations the (in)efficiency estimate represented by the metric, which is later denoted is an estimate of a contribution of the first machine learning system lOOi, i.e. of agent 102i, to an upper bound on a so-called price of anarchy for the multi agent system. The price of anarchy may be defined as a difference between, or ratio of, a sum of losses of the agents and that which could be achieved by fully cooperative agents. In implementations the price of anarchy is “local” in the sense that it relates to a local region of the strategy space i.e. that defined by the set of mixing parameters for agent 102i.

[0057] In implementations the mixing parameters are adjusted to minimize the (in)efficiency estimate represented by the metric. This results in the first machine learning system lOOi having a combined objective-defining value which defines a learning objective for the first machine learning system lOOi that enables the first machine learning system lOOi to learn to select actions which take account of actions selected by (i.e. the action selection policies of) the other machine learning systems. That is, agent 102i learns to cooperate with the other agents.

[0058] When the technique is implemented by each agent, the behavior of the group of agents as a whole converges to behavior, i.e. a set of action selection policies for the agents, which tends to minimize the price of anarchy. That is, the action selection policies for the agents converge towards achieving collectively for the group, a sum of rewards (to be maximized), or losses (to be minimized), which approaches that which would be achieved by fully cooperative agents.

[0059] In implementations this can be achieved by each machine learning system transmitting or broadcasting just a single scalar value, its machine learning objective- defining value, e.g. reward or return, to each of the other machine learning systems. Thus the agents can learn to cooperate with a relatively small additional power and communications bandwidth. [0060] Depending upon the environment, where the agents are fully cooperative maximizing the collective reward can result in solutions which are not egalitarian. For example one or more agents may sacrifice themselves by taking no reward to improve the collective reward. This is typically undesirable in an engineering context, e.g. in a communications system it could result in no data packets being transmitted. The techniques described herein appear to avoid such undesirable scenarios.

[0061] In some implementations the multi-agent machine learning system of Figure 1 is a fully distributed, peer-to-peer system, lacking a master, coordinating node. This provides robustness as there is no single point of failure. In some implementations functions of the mix adjust system 120i for each machine learning system lOOi may be performed by a shared, coordinating system (not shown in Figure 1). Then the coordinating system may receive an averaged combined reward or return from each machine learning system e.g. at the end of each episode and, periodically, a set of mixing parameters from each of the machine learning systems. [0062] Figure 2 is a flow diagram of an example process for training the action selection policy neural network 1 lOi of machine learning system lOOi. The process of Figure 2 may be implemented on each of the machine learning systems 100a,i,n of Figure 1. The process uses a set of mixing parameters from mix adjust system 120i, here a vector of weights, A L, for agent 102i. The steps shown in Figure 2 may be performed for each action selection time step, repeatedly e.g. until the end of a training episode.

[0063] The process begins with machine learning system lOOi receiving an observation o t characterizing a current state of the environment (step 200). The observation may comprise e.g. data from one or more sensors of or inputs to the agent, e.g. image data from a camera and/or motion data representing motion of the agent in the environment, and may include data representing an instruction provided to the agent 102i. The observation may be preprocessed e.g. by one or more neural network systems, and is then processed by the action selection policy neural network 1 lOi (step 202) to determine an action a L for agent 102i to perform.

[0064] In response to the action a L the environment transitions to a new state and the machine learning system lOOi receives another observation o L characterizing the new state of the environment and a reward r e.g. a numeric value as a result of agent 102i performing the action (step 204). The reward r L may indicate that the new state is closer to one in which the task of agent 102i is completed. [0065] The machine learning system lOOi then broadcasts, i.e. transmits, the reward r t to the other machine learning systems and receives a corresponding reward i}· from each of the other machine learning systems (step 206). In implementations the reward r t is transmitted to each of the other machine learning systems weighted by a respective weight i.e. machine learning system lOOi transmits weighted reward The reward and the weight may be transmitted a single scalar product to machine learning system j, or separately. Similarly machine learning system lOOi receives weighted reward A^r j from machine learning system j. This involves transmitting a set of n scalar values. In some other implementations the rewards and mixing weights may be shared separately (transmitting 2n scalars). A combined reward f) is then determined by machine learning system lOOi by combining the reward r t and rewards i}· each weighted according to the set of mixing parameters, e.g. according to f t = A^r j where Am is the weight of the yth reward for agent 102i and the sum includes a weighted reward r L (step 208).

[0066] The process then performs an iteration of a machine learning technique using as a reward the combined reward f L (step 210). For example the combined reward f) may be used to define a loss function in a TD (temporal difference) learning method or a policy gradient loss of a policy gradient learning technique. The training engine 130i may then the machine learning system lOOi, in particular the action selection policy neural network 1 lOi, to minimize this loss. In some implementations the machine learning technique is a model-based or model-free reinforcement learning technique. Merely by way of example the reinforcement learning technique may be an actor-critic technique.

[0067] The process loops back to process the next observation (step 202) until the end of an episode is reached. The process then determines a return, g , (step 212) e.g. from a sum or average, optionally time discounted, of the combined reward received at each of the action time selection time steps. Each training episode may define a mixing trial time step of a mix adjust process as described below.

[0068] Figure 3 is a flow diagram of an example process for adjusting the set of mixing parameters used for determining the combined reward in the process of Figure 2. The process of Figure 3 may be performed by the mix adjust system 120i of machine learning system lOOi, for agent 102i, and may be implemented on each of the machine learning systems 100a,i,n; or the process may be performed by a separate component of the multi agent machine learning system. [0069] Initially, at step 300, the process may determine a direction in which to adjust the set of mixing parameters. For a n-dimensional vector of weights i.e. for n machine learning systems, this may comprise randomly selecting a n-dimensional direction a L e.g. drawn uniformly from a unit n-sphere. The process may then determine a perturbed, trial set of mixing parameters, A perturbed in a direction of ap the trial set of mixing parameters may be subject to a constraint that the perturbed weights sum to 1. For example the process may determine A t = softmax(\og(Ai) + d a £ ) where d may be a fixed scalar step size. Optionally the process may also determine a random trial duration e.g. constrained between upper and lower duration limits, as a number of mixing trial time steps, i.e. training episodes, to be performed.

[0070] The process may then monitor the return from each training episode for the duration of the trial, using this to determine a noisy estimate of a set of gradients of the efficiency estimate metric with respect to the set of mixing parameters.

[0071] In more detail, the returns, g , from each episode are averaged over the duration of the trial to provide a mean return, G (step 302). When the trial is complete the process then determines a value for the efficiency estimate, from a change in the mean return (step 304). As mentioned above, the efficiency estimate, £ , may represent a contribution of agent 102i to an upper bound on the price of anarchy for the system. The efficiency estimate, £ , may be determined as where G b is a baseline mean return at the start of the trial, G is the mean return at the end of the trial, t is the trial duration in mixing trial time steps, ReLU ( ) is a rectifier function (with a value of zero if the argument is less than zero), and e is a hyperparameter. The value of e shifts the origin of the rectifier function; the value may be zero, or greater than zero (and may be chosen according to the size of G).

[0072] The process then determines an estimated set of gradients of the efficiency estimate with respect to the set of mixing parameters (step 306). The estimated set of gradient, V^., may have a value dependent on the efficiency estimate, in a direction of the perturbation a £. For example the estimated set of gradient may be determined from where e t is a vector of all Is and 0 denotes elementwise division. The value of v is a hyperparameter and may be zero; where it is not zero it may be small e.g. of order 10 -6 The optional v term may be considered a regularization term and encourages the values of A i.e. the weights, back towards their initial values.

[0073] Determining a gradient of the efficiency estimate metric with respect to the set of mixing parameters, by making a perturbed, trial modification to the set of mixing parameters, facilitates determining how, and in which direction, to adjust the mixing parameters. More particularly (disregarding e ), if the mean return increases during the trial, the value of the rectifier function is zero, and the gradients are zero, and the trial set of mixing parameters may be retained as an updated set of mixing parameters. If the mean return decreases during the trial the mix adjust system 120i may adjust the mix in a direction opposite to the perturbation.

[0074] Thus at step 308 the process updates the set of mixing parameters according to the estimated set of gradients. In some implementations the set of mixing parameters, A may be adjusted according to where h is hyperparameter defining a mix adjust learning step size. Optionally A L may be clipped in logit space by a lower and/or upper bound; the bounds may have values e.g. in the range 1 to 10.

[0075] In some implementations A t may be initialized before the process of Figure 3 so that the weight of reward r t is close to unity and the weights of rewards i}· are small. For example A t may be initialized with A a = 0.99 and A tj = for j ¹ i. Thus the process may start with an agent behaving “selfishly”, but initializing with non-zero weights A tj can facilitate learning cooperation. Initializing with “selfish” behavior may help reduce the risk of the group of agents finding an efficient, but inequitable equilibrium.

[0076] In a variant of the above described process an egalitarian price of anarchy may be optimized. For example to minimize an egalitarian price of anarchy the following process may be used.

[0077] A combined reward among all agents is estimated. For example, each agent may send its combined reward to a coordinator (which may be one of the agents) which may then communicate to the group which agent has the lowest combined reward, agent "min". Alternatively, a coordinator may be avoided if each agent sends its combined reward to each other agent and each agent determines the agent with the minimum combined reward independently. [0078] Each agent may then select a random perturbation, perturb their A t , and monitor the combined reward of agent "min" for the duration of the trial (a trial length may be agreed upon by all agents). At the end of the trial each agent may then modify their A L according to whether the combined reward of agent "min" increased or decreased during the trial. For example if the combined reward of agent "min" increased (or stayed the same) the agents may each keep their A L unchanged. If the combined reward of agent "min" decreased the agents may each modify their A in particular in a direction opposite to that of their perturbation, e.g. as previously described.

[0079] In some applications each agent of the group of agents comprises a robot or autonomous or semi-autonomous land or air or sea vehicle. Each of the tasks may comprise navigating a path through a physical environment from a start point to an end point. For example the start point may be a present location of the agent; the end point may be a destination of the agent. The first or each machine learning, e.g. reinforcement learning, objective-defining value may be dependent on an estimated time or distance for the first or each agent to physically move from the start point to the end point. For example a machine learning, e.g. reinforcement learning, objective may minimize an expected delay or maximize an expected reward or return (cumulative reward) dependent upon speed of movement of the agent; or a machine learning, e.g. reinforcement learning, objective may minimize an expected length of journey.

[0080] The first or each machine learning, e.g. reinforcement learning, system may receive observations which relate to movement of the agent along a path from the start point to the end point, such as average traffic speed for routes on a map through which the agents navigate. The observations may be discrete, e.g. from junctions, or they may extend over part or all of the length of a route; they may include data such as congestion indicators. This information may be provided from a remote source. The observations may optionally include map data defining a map of routes which include more than the start point and the end point e.g. of a local region.

[0081] In some applications the first or each machine learning, e.g. reinforcement learning, system may be or be combined with an e.g. recurrent, route-planning reinforcement learning neural network, and optionally including memory for storing routing map data.

[0082] The actions performed by the first or each machine learning, e.g. reinforcement learning, system may include navigation actions to select between different routes to the same end point. For example the actions may include steering or other direction control actions for an agent.

[0083] As previously noted, the rewards/returns may be dependent upon a time or distance between nodes of a route and/or between the start and end points. The routes may be defined, e.g. by roads, or in the case of a warehouse by gaps between stored goods, or the routes may be in free space e.g. for drone agents. The agents may comprise robots or vehicles performing a task such as warehouse, logistics, or factory automation, e.g. collecting, placing, or moving stored goods or goods or parts of goods during their manufacture; or the task performed by the agents may comprise package delivery control. [0084] In some applications the machine learning, e.g. reinforcement learning, system(s) may be configured to control traffic signals, e.g. at junctions, in a similar way to that described above, to control the traffic flow of pedestrian traffic or human- controlled vehicles. The implementation details of such systems may be as previously described for robots and autonomous vehicles.

[0085] In some further related applications the technique is applied to simulations of such systems. For example such a simulation be used to design a route network such as a road network or warehouse or factory layout.

[0086] In general the observations may include, for example, one or more of images, object position data, and sensor data to capture observations as the agent interacts with the environment, for example sensor data from an image, distance, or position sensor or from an actuator. In the case of a robot or other mechanical agent or vehicle the observations may similarly include one or more of the position, linear or angular velocity, force, torque or acceleration, and global or relative pose of one or more parts of the agent. The observations may be defined in 1, 2 or 3 dimensions, and may be absolute and/or relative observations. For example in the case of a robot the observations may include data characterizing the current state of the robot, e.g., one or more of: joint position, joint velocity, joint force, torque or acceleration, and global or relative pose of a part of the robot such as an arm and/or of an item held by the robot. The observations may also include, for example, sensed electronic signals such as motor current or a temperature signal; and/or image or video data for example from a camera or a LIDAR sensor, e.g., data from sensors of the agent or data from sensors that are located separately from the agent in the environment.

[0087] In these implementations, the actions may be control inputs to control the robot, e.g., torques for the joints of the robot or higher-level control commands; or to control the autonomous or semi-autonomous land or air or sea vehicle, e.g., torques to the control surface or other control elements of the vehicle or higher-level control commands; or e.g. motor control data. In other words, the actions can include for example, position, velocity, or force/torque/acceleration data for one or more joints of a robot or parts of another mechanical agent. Action data may include data for these actions and/or electronic control data such as motor control data, or more generally data for controlling one or more electronic devices within the environment the control of which has an effect on the observed state of the environment. For example in the case of an autonomous or semi-autonomous land or air or sea vehicle the actions may include actions to control navigation e.g. steering, and movement e.g braking and/or acceleration of the vehicle. [0088] For example the simulated environment may be a simulation of a robot or vehicle agent and the reinforcement learning system may be trained on the simulation. For example, the simulated environment may be a motion simulation environment, e.g., a driving simulation or a flight simulation, and the agent is a simulated vehicle navigating through the motion simulation. In these implementations, the actions may be control inputs to control the simulated user or simulated vehicle. A simulated environment can be useful for training a reinforcement learning system before using the system in the real world. In another example, the simulated environment may be a video game and the agent may be a simulated user playing the video game. Generally in the case of a simulated environment the observations may include simulated versions of one or more of the previously described observations or types of observations and the actions may include simulated versions of one or more of the previously described actions or types of actions.

[0089] In the case of an electronic agent the observations may include data from one or more sensors monitoring part of a plant or service facility such as current, voltage, power, temperature and other sensors and/or electronic signals representing the functioning of electronic and/or mechanical items of equipment. In some applications the agent may control actions in a real-world environment including items of equipment, for example in a facility such as: a data center, server farm, or grid mains power or water distribution system, or in a manufacturing plant or service facility. The observations may then relate to operation of the plant or facility. For example additionally or alternatively to those described previously they may include observations of power or water usage by equipment, or observations of power generation or distribution control, or observations of usage of a resource or of waste production. The agent may control actions in the environment to increase efficiency, for example by reducing resource usage, and/or reduce the environmental impact of operations in the environment, for example by reducing waste. For example the agent may control electrical or other power consumption, or water use, in the facility and/or a temperature of the facility and/or items of equipment within the facility. The actions may include actions controlling or imposing operating conditions on items of equipment of the plant/facility, and/or actions that result in changes to settings in the operation of the plant/facility e.g. to adjust or turn on/off components of the plant/facility.

[0090] In some further applications, the environment is a real-world environment and the agent manages distribution of tasks across computing resources e.g. on a mobile device and/or in a data center. In these implementations, the actions may include assigning tasks to particular computing resources.

[0091] In another application the environment is a data packet communications network environment, and each agent of the group of agents comprises a router to route packets of data over the communications network. The first or each machine learning, e.g. reinforcement learning, objective-defining value may then be dependent on a routing metric for a path from the router to a next or further node in the data packet communications network, e.g. an estimated time for a group of one or more routed data packets to travel from the router to a next or further node in the data packet communications network. The observations may comprise e.g. observations of a routing table which includes the routing metrics. A route metric may comprise a metric of one or more of path length, bandwidth, load, hop count, path cost, delay, maximum transmission unit (MTU), and reliability. Optionally observations may include a route map including more than a node at which the router is located and a next node; optionally the first or each reinforcement learning system may comprise a recurrent neural network, and optionally memory for storing routing map data.

[0092] In another application the environment is an electrical power distribution environment. As power grids become more decentralized, for example because of the addition multiple smaller capacity, potentially intermittent renewable power generators, the additional interconnections amongst the power generators and consumers can destabilize the grid and a significant proportion of links can be subject to Braess’s paradox where adding capacity can cause overload of a link e.g. particularly because of phase differences between connected points. [0093] In such an environment each agent of the group of agents may be configured to control routing of electrical power from a node associated with the agent to one or more other nodes over one or more power distribution links, e.g. in a “smart grid”. The first or each machine learning, e.g. reinforcement learning, objective-defining value may then be dependent on one or both of a loss and a frequency or phase mismatch over the one or more power distribution links. The observations may comprise e.g. observations of routing metrics such as capacity, resistance, impedance, loss, frequency or phase associated with one or more connections between nodes of a power grid. The actions may comprise control actions to control the routing of electrical power between the nodes. Optionally the first or each machine learning, e.g. reinforcement learning, system may be provided with observations of a routing table, similar to a computer network routing table, and including one or more of the routing metrics e.g. for each link. Such an approach may also be used in simulation to facilitate design or testing of a power grid, e.g. to determine locations for additional interconnects between nodes.

[0094] The agents may further comprise static or mobile software agents i.e. computer programs configured to operate autonomously and/or with other software agents or people to perform a task. For example the environment may be an integrated circuit routing environment and each agent of the group of agents may be configured to perform a routing task for routing interconnection lines of an integrated circuit such as an ASIC. The first or each machine learning, e.g. reinforcement learning, objective-defining value may be dependent on one or more routing metrics such as an interconnect resistance, capacitance, impedance, loss, speed or propagation delay, physical line parameters such as width, thickness or geometry, and design rules. The objectives may include one or more objectives relating to a global property of the routed circuitry e.g. component density, operating speed, power consumption, material usage, or a cooling requirement. The actions may comprise component placing actions e.g. to define a component position or orientation and/or interconnect routing actions e.g. interconnect selection and/or placement actions.

[0095] Figure 4 illustrates operation of the system in the toy environment of Figure 4a. In this environment a predator 400 chases two prey agents 402, 404 around a table; prey 402 can only escape if prey 404 simultaneously moves out of the way, as shown. The observations are the distances around the table of each prey from the predator and from the other prey. The prey receive 0 reward if they chose not to move, -0.01 if they attempt to move, and -1 if the predator is adjacent to them after moving. The agents used a version of the “REINFORCE” policy gradient reinforcement learning technique. Figure 4b shows a graph of reward against training epochs for the described technique 450, an optimal (fully cooperative) technique 452, and a convention “selfish” policy gradient technique 454 implemented by each prey (the shaded region shows ± one standard deviation). Figure 4b shows that the described technique approaches the optimal performance. Figure 4c shows, on respective axes, the attention weight of each prey agent to its own reinforcement learning loss (reward). Each agent starts with a weight for its own loss (reward) close to 1, at star 410. As learning progresses each agent learns to de-weight its own loss (reward), thus putting more weight on the other agent’s loss (reward), ending at a mix of losses indicated by star 412.

[0096] For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions.

[0097] Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

[0098] The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be or further include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

[0099] A computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub-programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

[0100] The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

[0101] Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data.

Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.

[0102] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. [0103] To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s client device in response to requests received from the web browser.

[0104] Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a relationship graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.

[0105] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. [0106] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

[0107] Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[0108] Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.

[0109] What is claimed is: