Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A SYSTEM AND FRAMEWORK FOR OPTIMAL DECISION MAKING IN THE PRESENCE OF NON-STATIONARY OPPONENTS
Document Type and Number:
WIPO Patent Application WO/2023/072385
Kind Code:
A1
Abstract:
This disclosure relates to optimal decision making in the presence of non-stationary opponents. Additionally, there is a decision-making apparatus capable of doing the same. The disclosure includes a training apparatus for forming an algorithm to solve a multi-player problem, the apparatus comprising one or more processors configured to: (i) receive, for each of a plurality of players, a respective strategy and adopt that strategy as a current strategy for the player; and (ii) repeatedly refine the current strategy for each player by iteratively performing, for each of a plurality of stages the steps of: (a) for each of the players, computing a loss attributable to the player implementing the adopted current strategy during the respective stage in comparison to that player implementing a reference strategy during that stage; (b) computing an aggregated loss over all the players; and (c) updating the current strategies of the players in dependence on the aggregated loss; the processors further being configured to: (iii) alter the reference strategy in response to a deviation between the player's current strategies on a present iteration of step (ii) and a previous iteration of step (i).

Inventors:
YANG YAODONG (DE)
DINH LE CONG (DE)
MGUNI DAVID (DE)
Application Number:
PCT/EP2021/079853
Publication Date:
May 04, 2023
Filing Date:
October 27, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
YANG YAODONG (DE)
International Classes:
A63F13/55; A63F13/67; G06Q10/00; G06Q50/06
Foreign References:
US20210178261A12021-06-17
Other References:
LE CONG DINH ET AL: "Online Double Oracle", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 4 June 2021 (2021-06-04), XP081979620
STEPHEN MCALEER ET AL: "XDO: A Double Oracle Algorithm for Extensive-Form Games", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 11 March 2021 (2021-03-11), XP081909342
MAX OLAN SMITH ET AL: "Iterative Empirical Game Solving via Single Policy Best Response", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 3 June 2021 (2021-06-03), XP081984880
MCMAHAN H B ET AL: "Planning in the presence of Cost Functins Controlled by an Adversary", MACHINE LEARNING. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE,, 1 January 2003 (2003-01-01), pages 536 - 543, XP008098344
Attorney, Agent or Firm:
KREUZ, Georg M. (DE)
Download PDF:
Claims:
CLAIMS

1. A training apparatus for forming an algorithm to solve a multi-player problem, the apparatus comprising one or more processors configured to:

(i) receive, for each of a plurality of players, a respective strategy and adopt that strategy as a current strategy for the player; and

(ii) repeatedly refine the current strategy for each player by iteratively performing, for each of a plurality of stages the steps of:

(a) for each of the players, computing a loss attributable to the player implementing the adopted current strategy during the respective stage in comparison to that player implementing a reference strategy during that stage;

(b) computing an aggregated loss over all the players; and

(c) updating the current strategies of the players in dependence on the aggregated loss; the processors further being configured to:

(iii) alter the reference strategy in response to a deviation between the player’s current strategies on a present iteration of step (ii) and a previous iteration of step (i).

2. The training apparatus of claim 1 , wherein each stage of the plurality of stages is a time stage.

3. The training apparatus of any of the preceding claims, wherein the aggregated loss is computed over a plurality of stages.

4. The training apparatus according to claim 1 or 2, wherein the aggregated loss is computed over a single stage.

5. The training apparatus according to any one of the preceding claims, wherein the aggregated loss is an average of aggregated losses.

6. The training apparatus according to any one of the preceding claims, wherein the previous iteration in step (iii) is the iteration immediately preceding the present step (i).

7. The training apparatus of any one of the preceding claims, wherein the reference strategy includes a subset of all possible pure strategies.

8. The training apparatus of claim 7, wherein a pure strategy is a strategy that can be adopted independent of a problem state.

9. The training apparatus of claim 8, wherein the reference strategy is further altered based on the best response calculated for each stage.

10. A training apparatus according to any one of the preceding claims, wherein the multiplayer problem is performed during the operation of a control system for an industrial apparatus.

11. A training apparatus as claimed in claim 1 , wherein a loss is computed according to a loss function and the one or more of the processors are configured to adversarily alter the loss function between at least some of the iterations of step (ii).

12. A training apparatus according to claim 11, wherein the processors are configured to alter the reference strategy in response to the altered loss function.

13. The training apparatus of any one of the preceding claims, wherein the one or more processors are further configured to calculate a best response strategy that can be performed in each stage, by computing a local minimum of the computed losses attributable to the players.

14. A decision-making apparatus for solving a multi-player problem, the apparatus comprising one or more processors configured to, for each of a plurality of players, implement steps (i), (ii) and (iii) and adopt as a solution to the problem a solution determined according to the respective current strategies after multiple iterations of step (ii).

15. The decision-making apparatus according to claim 14, wherein the multi-player problem is a two-player zero-sum game.

Description:
A SYSTEM AND FRAMEWORK FOR OPTIMAL DECISION MAKING IN THE PRESENCE OF NON-STATIONARY OPPONENTS

FIELD OF THE INVENTION

This invention relates to optimal decision making in the presence of non-stationary opponents. Additionally, there is a decision-making apparatus capable of doing the same.

BACKGROUND

Understanding games with large action spaces is a critical topic in a variety of fields including but not limited to economics, operations research and artificial intelligence. A key challenge to solving large games is to compute a Nash equilibrium (NE) in which no player will be better off by deviation.

Many real-world applications require a player to maintain high average performance while learning the optimal strategy. These games often comprise a large number of dimensions which make traditional linear programming and online learning infeasible and requiring scalable methods to solve them.

A further difficulty in determining the optimal strategy is that the optimal strategy can also heavily depend on the opponent's behaviour; thus, the usual Nash equilibrium concept cannot be fully adapted to this scenario. One example of this is the game Rock-Paper-Scissor. In brief the rules of this game are that each player selects one of rock, paper or scissors and the same time another player selects one from the same group. The selections are then compared to see which player wins based on the rules that, rock beats scissors but loses to paper, paper beats rock but loses to scissors and scissors beat paper but loses to rock. In this game, against an opponent who always chooses Rock, choosing Paper consistently is the optimal strategy rather than playing the NE (1/3, 1/3, 1/3) strategy in which each of the possible selections is played one third of the time.

Traditional online learning methods cannot be directly applied to achieve a no-regret algorithm in real-world games, such as Poker or Go because the number of pure strategies is huge, for example, GO has more than number of universe atoms actions, thus causing the computational hardness and large regret property. In other words, finding a NE is generally intractable in games; computing a two-player NE is known to be PPAD (Polynomial Parity Arguments on Directed graphs) hard. An exception to the difficulty in computing the NE in a two-player game is two-player zero-sum games where the NE can be tractably solved by a linear program (LP). Despite the polynomialtime complexity of solving an LP, LP solvers are not adequate in games with prohibitively large action spaces. As a result, in order to compute large action spaces, the focus shifts towards finding efficient approximation solutions.

One such solution is a Double Oracle (DO) algorithm and its variation Policy Space Response Oracles (PSRO). These are powerful approaches to finding approximate NE solutions in games where the support of a NE is relatively small. The support of a strategy is the set of pure strategies used to construct that strategy. In the dynamics of DO, players initially begin with a limited number of strategies (a strategy pool), and can thus play only a sub-game of the original game. Then, at each iteration, a best-response strategy to the NE of the last sub-game, which is assumed to be given by an Oracle, will be added into each agent’s strategy pool. The process stops when the best response is already in the strategy pool or the performance improvement becomes trivial. This is best seen in figure 1. When an exact best-response strategy is not achievable, an approximate solution is often adopted. For example, PSRO applies reinforcement learning (RL) oracles to compute the best response. The algorithm below shows an example of a known Double Oracle algorithm.

The pseudocode of a double Oracle Algorithm is shown in Algorithm 1 above. The Double Oracle method above approximates Nash equilibrium in large size zero sum games by iteratively solving a series of sub-games (i.e. a game with a set of restricted pure strategies). Specifically, at time step t, the DO learner solves the NE of a sub-game Gt. Since the sets of pure strategies of the sub-game Gt = (fit, Ct) are often much smaller than the total number of strategies in the original game, the NE of the sub-game can be easily solved in line 5 of Algorithm 1. Based on the NE of the sub-game (ift, c* t ), each player finds a best response to the NE (line 6 of Algorithm 1), and expand their strategy set (line 7). PSRO methods are variations of DO method in which reinforcement learning (RL) methods are adopted to approximate the best response strategy. For games where Assumption 1 discussed below does not hold, DO methods revert to attempting to solve the original game with complete strategy set and therefore there is no advantage achieved over LP solutions.

In addition, this traditional solution of the Double Oracle approach has a number of other challenges that need to be overcome. One challenge is that the Double Oracle approach requires players to collaborate with each other in the restricted games (sub-games) of a known environment. As such, the approach will fail in adversarial environments when the players are not cooperating and instead are attempting to best each other. In other words, both players must follow the same learning dynamics such as fictitious play (FP). This contradicts many real-world scenarios where an opponent can choose to play any (non-stationary) strategy that is available in the sub-game.

A further challenge is that the outcome of the Double Oracle approach is only an approximation of the NE. This can be useful in protecting against the worst outcomes against a perfect opponent; however, in many games the opponent is rarely perfect, and NE is not the optimal strategy. Therefore, the Double Oracle approach lacks a guarantee of good performance. This is best described by returning to the example of Rock-Paper-Scissors.

In this example, if the column player is not a perfect opponent (i.e., plays (Rock, Paper, Scissors) = (1/2, 1/2,0) instead to NE (1/3, 1/3, 1/3)), then NE is not optimal for the row player. Instead, the optimal strategy will be to play a strategy of (R, P, S) = (0,1 ,0). The matrix of strategies for each iteration of rock, paper, scissors can be seen in the table to figure 2a. Figure 2b then demonstrates an example Double Oracle employed on the game of rock, paper, scissors. Here it can be seen that a starting strategy is chosen, wherein both players play rock R = (1 , 0, 0). In the first iteration the game is resolved such that the platers should play paper P = (0, 1 , 0) to counter the initial strategy. In the second iteration, the strategy: scissors is played, which is not the optimal solution.

Unfortunately, the Double Oracle algorithm above cannot learn the optimal strategy while achieving a payoff as good as that of a best response. This is because DO methods are not rational and allow for a learning scheme that ensure an adversary can be exploited (thus achieving no-regret). While a NE strategy guarantees the best performance of a player in the worst-case scenario, thus protecting the player in the worst case, it is too pessimistic when the opponent is not rational. For example, if an opponent continually plays rock, again and again, the optimal strategy is not to play each of rock, paper and scissors evenly, but instead to play paper continually until the opponent changes tactics in order to achieve the greatest reward.

Solving the NE in large scale games is demanding. An alternative approach is to consider learning-based methods. By playing the same game repeatedly, a learning algorithm could approximate the NE asymptomatically. A common metric to quantify the performance of a learning algorithm is to compare its cumulated pay off with the best fixed strategy in hindsight of the outcome. This is called (external) regret.

Below is the basis for a no-regret algorithm. In this algorithm C1 , C2,... are a sequence of mixed strategies played by the column player (first player), an algorithm of the row player (second player) that generates a sequence of mixed strategies TTI, TT2,... is called a no-regret algorithm if the following holds.

In the above expression T may be a stage comprising a number of stages or time steps t.

No-regret algorithms are of great interest in two-player zero-sum games since if both players follow a no-regret algorithm (not necessarily the same one), the average strategies of both players converge to a NE of the game. For example, a well-known learning algorithm for games that has the no-regret property is the Multiplicative Weights Update (MWU) algorithm. This MWU algorithm is described as:

Where p t > 0 is a parameter, TTO = [1/n, ... , 1/n] and n is the number of pure strategies. The number of pure strategies can also be thought of as the number of experts. The regret of the MWU can be bounded by the expression: Intuitively, the MWU algorithm functions by putting larger weights on experts who have lower losses in the long run. In other words, the more successful an expert is, the more their chosen strategy will be favoured. This, compared to the best-fixed strategy in hindsight (i.e. , the expert with the lowest average loss), the MWU can achieve a no-regret property. However, since the MWU requires that the whole pure strategy set is updated in each iteration, it is not possible to use it on large-size games since the number of pure strategies is too large.

When the game size is large, e.g., the number of strategies is large, if both players follow noregret algorithms then they will be guaranteed to converge to the NE in a zero-sum game. However, the regret bound of popular no-regret algorithms usually depends on the game size. As a result, directly applying no-regret algorithms, though rational, cannot solve large-scale zero-sum games.

It is therefore an objective of the training apparatus disclosed herein to solve the above problems by maintaining an average payoff that is as good as the best response in hindsight (i.e., no-regret property).

SUMMARY

In a first aspect of the present disclosure, there is provided a training apparatus for forming an algorithm to solve a multi-player problem, the apparatus comprising one or more processors configured to: (i) receive, for each of a plurality of players, a respective strategy and adopt that strategy as a current strategy for the player; and (ii) repeatedly refine the current strategy for each player by iteratively performing, for each of a plurality of stages the steps of: (a) for each of the players, computing a loss attributable to the player implementing the adopted current strategy during the respective stage in comparison to that player implementing a reference strategy during that stage; (b) computing an aggregated loss over all the players; and (c) updating the current strategies of the players in dependence on the aggregated loss; the processors further being configured to: (iii) alter the reference strategy in response to a deviation between the player’s current strategies on a present iteration of step (ii) and a previous iteration of step (i). This configuration allows a multi-player problem to be solved in an optimised manner to achieve an average payoff that is as good as the best response in hindsight. This also guarantees the average performance for the player while optimizing the strategy set even when the game size is large.

The training apparatus wherein each stage of the plurality of stages is a time stage. The training apparatus above, wherein the aggregated loss is computed over a plurality of stages. This allows increased accuracy of the optimal solution to the multi-player problem.

The training apparatus above, wherein the aggregated loss is computed over a single stage. This allows the processing and time needed to solve the problem to be reduced.

The training apparatus as described above, wherein the aggregated loss may be an average of aggregated losses. This increases the accuracy of the computed aggregated losses.

The training apparatus as above, wherein the previous iteration in step (iii) is the iteration immediately preceding the present step (i). This allows the algorithm to learn in an incremental manner with a series of small changes and thus allows for fine tuning of the strategy set.

The training apparatus above, wherein the reference strategy may include a subset of all possible pure strategies. This reduces the computation time as only a subset of strategies may be considered.

The training apparatus may use a pure strategy where a pure strategy is a strategy that can be adopted independent of a problem state. This means the player/user can begin to solve the multi-player problem with a strategy that can always be played.

The training apparatus as described above, wherein the reference strategy is further altered based on the best response calculated for each stage. This further optimises the reference strategy incrementally allowing the rewards to be maximised for further stages.

In the training apparatus as above the multi-player problem may be performed during the operation of a control system for an industrial apparatus. This provides the advantage that the multi-player problem is a complex industrial application, where for example players could be likened to nodes of a system. This therefore allows for solving of complex systems to optimise their running and provide a cost-effective service.

In the above training apparatus, a loss may be computed according to a loss function and the one or more of the processors are configured to adversarily alter the loss function between at least some of the iterations of step (ii). This makes it easier to solve for the NE but optimising the strategies of all players. In another aspect, the processors may be configured to alter the reference strategy in response to the altered loss function. This allows the player to always have available an optimised strategy for continued play and allows for faster solving.

In the training apparatus described above, the one or more processors may be further configured to calculate a best response strategy that can be performed in each stage, by computing a local minimum of the computed losses attributable to the players. This makes it easier to solve for the NE but optimising the strategies of all players.

In another aspect of this disclosure, there is provided a decision-making apparatus for solving a multi-player problem, the apparatus comprising one or more processors configured to, for each of a plurality of players, implement steps (i), (ii) and (iii) and adopt as a solution to the problem a solution determined according to the respective current strategies after multiple iterations of step (ii). This allows a multi-player problem to be solved in an optimised manner to achieve an average payoff that is as good as the best response in hindsight. It further guarantees the average performance for the player while optimizing the strategy set even when the game size is large.

In the decision-making apparatus the multi-player problem may be a two-player zero-sum game. This ensures that the solution will converge to the NE after a series of iterations and thus make it easier to optimally solve.

BRIEF DESCRIPTION OF THE FIGURES

The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings:

Figure 1 shows an example of a traditional solution using a Double Oracle algorithm.

Figure 2a illustrates the rewards and losses for each player in a game of rock, paper, scissors based on their chosen strategy using a Double Oracle algorithm.

Figure 2b illustrates a method of playing rock, paper, scissors based on an initial strategy using a Double Oracle algorithm.

Figure 3 illustrates an Online Markov Decision Process (MDP) with changing reward/loss Function. Figure 4 illustrates the problem solver of the present disclosure applied to a robot laser tag environment to solve for the optimal paths moving from point s to each of points g.

Figure 5a illustrates a comparison of the convergence of known solvers and the Online Double Oracle approach of the present disclosure for the game of Leduc poker.

Figure 5b illustrates a comparison of the expected payoff after a number of iterations for a PSRO approach and a Double Single Oracle approach for the game of Leduc poker.

Figure 6a illustrates a comparison of the convergence of known solvers and the Online Double Oracle approach of the present disclosure for the game of Kuhn poker.

Figure 6b illustrates a comparison of the expected payoff after a number of iterations for a PSRO approach and a Double Single Oracle approach for the game of Kuhn poker.

DETAILED DESCRIPTION OF THE INVENTION

Embodiments of the training apparatus to be described below are intended to solve a multiplayer problem in an optimised manner to achieve an average payoff that is as good as the best response in hindsight. A further goal of this disclosure is to propose a learning algorithm employed by a training apparatus that can guarantee the average performance while optimizing the strategy set.

This disclosure introduces a training apparatus for forming an algorithm capable of achieving high average performance in games with large dimension. The traditional linear programming and online learning methods are infeasible for these games and thus scalable methods to solve them are desirable.

In this disclosure there is described an apparatus capable of performing an online learning method that can work efficiently in large and even infinite action space. Specifically, the training apparatus of this disclosure includes processors configured to execute an algorithm containing a number of steps. The steps performed by this apparatus are different from those known in a number of ways, some such differences are that, firstly, the double oracle method is used to select the effective strategy set (i.e. , the effective strategy set is much smaller than the pure strategy set and the pure strategy set is a strategy that can be adopted independent of a problem state), and secondly the algorithm guarantees the average performance by maintaining the no-regret property. The training apparatus of the present invention introduces at least three new concepts. The first of these concepts is that the apparatus uses an Online Single Oracle that maintains a noregret performance while optimizing the number of experts used in large games. Compared to DO methods this provides the advantage that the assumption of coordination between players can be relaxed and a regret bound can be provided. In other words, the players need not cooperate with each other. In addition, compared to traditional no-regret algorithms, the present disclosure can be efficiently applied to large size games.

Secondly, when learning to co-adapt to each type of opponent, the method employed by the training apparatus of this disclosure, e.g. Online Double Oracle, is a state-of-the-art two-player zero-sum games solver.

Thirdly, the method employed by the training apparatus of this disclosure can efficiently learn in Online Markov Decision Processes using a loss function that changes adversarialy. In other words, the apparatus can perform its function even when there is an adversarial dynamic between the players. A loss may be computed according to a loss function and the one or more of the processors of the apparatus disclosed herein may be configured to adversarily alter the loss function between at least some of the iterations used to update the current strategy, or within each stage/time window discussed below.

The method actioned by the algorithm of the training apparatus of this disclosure may be based on the following Assumption 1. This assumption also applies to DO and PRSO methods.

Assumption 1 holds in many settings. In symmetric games with random entries, it has been shown that the expected support size of the NE will be (1/2 + O(1))n where n is the game size; this means the support size of a NE strategy is only half the game size. In asymmetric games (e.g., where n»m), we provide the following lemma under which Assumption 1 also holds. In Assumption 1 , a smaller game within the larger game is considered and thus the number of initial pure strategies is smaller than the total number of pure strategies that are available in the full game. As such, the support size of NEs is reduced in comparison to the support size of the complete game. The support of a strategy is the set of pure strategies used to construct that strategy. In the above assumption, the matrix n x m describes a subset of strategies of the complete game.

Given the above assumption, the training apparatus of this disclosure may form an algorithm that includes a no-regret property and a starting strategy set that is a small effective strategy set. In this case, a small effective strategy set is a number of strategies that form a subset of all the possible pure strategies.

The algorithm used by the training apparatus is Algorithm 3, which is an Online Single Oracle as seen below. This algorithm can be thought of as an online counterpart of Single Oracle in DO, but it achieves a no-regret property which is not possible in DO algorithms. A key advantage of the Online Single Algorithm is that the regret bound is not dependent on the size of a player’s total strategy set. Instead, the regret bound depends on the size of the effective strategy set. The effective strategy set is linearly dependent on the size of the support of the NE.

In contrast to classical no-regret algorithms such as MWU where the whole set of pure strategies must be considered at each iteration, i.e. Equation 5, the online single oracle of the current disclosure only considers a subset of the whole strategy set. The key operation in the algorithm is that, at each iteration t, the OSO only considers adding a new strategy to the starting pure strategy set if it is the best response to the average loss in a time window (defined later). As such, the OSO algorithm has the advantageous affects that exploration costs can be saved, and computation time can be reduced. This is achieved by ignoring the pure strategies that have not been the best response to any of the average losses L observed so far and instead focusing on the strategies that have been effective. This is advantageous for the training apparatus and assists in the learning process, especially when finding the optimal solution for extremely large games.

The pseudocode of the OSO can be seen in Algorithm 3. The OSO algorithm may be initialised with a random subset Flo of strategies from the original complete strategy set fl. In some cases, may start from only one pure strategy as in line 2 of Algorithm 3. Strategy subset fit is the effective strategy set at time t, and the period of consecutive iterations may be defined as a stage or time window T. In this time window T the set of effective strategies may remain fixed for each iteration i.e. At iteration t, the algorithm updates the TT t at line 5 where only the effective strategy set fit, rather than the complete set fl, is considered; and the best response is computed against the average loss L within the current stage (time window T) (line 9). In line 7 of the algorithm, a new best response, that is not in the existing effective strategy set, is added to the strategy set of that iteration. Despite the design of effective strategy sets, the exact best response of the OSO is considered based on the whole strategy set fl. However, this can be relaxed by approximating the best responses. Lines 4 to 9 of Algorithm 3 describe how the size of expert set in online learning is reduced by using best response oracle. This is efficient in large size games compared to MWU discussed above.

The regret bound of the OSO in the above algorithm will now be described further. In the regret bound of the algorithm the variables h, l 2 ... lt are a sequence of loss vectors played by an adversary, and <(variable) , (variable)> is the dot product. Therefore, the OSO algorithm above includes a no-regret algorithm with the following relationship: Where is the size of the effective strategy set in the final time window. This demonstrates that the time complexity of the OSO is of the same order of that of the known algorithms, however there is an added benefit that the OSO possess the no-regret characteristic. Time complexity in this regard is the time that may be taken to compute a solution.

Another aspect of this disclosure is a two-player zero sum game solver in the form of an Online Double Oracle (ODO). Online Double Oracle is when both players in the game employ the Online Single Oracle algorithm. In this case, if both players employ a no-regret algorithm, then the average strategies of both players will eventually converge to the NE when the game played is a two-player zero-sum game. Compared to traditional Double Oracle algorithms, ODO does not need to compute the Nash equilibrium in each sub-game.

In addition, ODO produces rational players since both players can play optimally to exploit the adversary and achieve the no-regret property. This leads to reduced losses for each player while the algorithm learns the optimal strategy.

If both players employ the Online Single Oracle (OSO) algorithm, then the average strategies of both players will converge to the NE with the following rate:

Where k1 and k2 are the size of the effective strategies for each of players one and two respectively, ET is a turning point on the graph of optimal strategy for both players. The turning point when both players are playing optimally, as discussed above, represents a saddle point in the convex/concave function. The one or more processors of the apparatus may be configured to calculate a best response strategy that can be performed in each stage, by computing a local minimum of the computed losses attributable to the players. This local minimum may be the saddle point discussed above.

In a situation where both players follow OSO algorithms with less-frequent best response (e.g. the response, which is sub-optimal) and the convergence rate to the Nash equilibrium will be given by the following expression: The training apparatus can be used to form the above algorithm to solve a multi-player problem. In the apparatus one or more processors may be configured to, in a general explanation, receive an initial respective strategy for each player. The effective strategy in Algorithm 3 is an example of the reference strategy set for each player. The reference strategy may include a subset of all possible pure strategies. A pure strategy may be a strategy that can be adopted independent on a problem state. An example of this may be that a pure strategy may be a strategy that is always playable. Each player adopts a respective strategy as the initial/current strategy. The current strategy set is then refined in the algorithm by performing a number of steps. As seen in line 9 of Algorithm 3 the average loss for each player is computed during a first-time window containing at least one iteration.

The average loss may be the negative consequence, or cost to a player, of the player employing the current strategy during a stage T. A stage may be a time stage, this may also be called a time window T. The loss may be found by comparing the best response to the result achieved by employing the current strategy. In addition, the loss may be computed over a plurality of stages or over a single stage.

Once the loss has been computed for each player, the aggregated loss over all players may be computed. This aggregated loss may be described by the loss vectors of each player in an adversarial situation, and may be used to find the no-regret boundary of the algorithm. This can be seen described in lines 4 to 9 of Algorithm 3. The aggregated loss may be computed from an average of aggregated losses of each player or a single player over multiple iterations. A loss may be computed according to the loss function and the one or more of the processors may be configured to adversarily alter the loss function between at least some of the iterations of the stage, altering the current strategy. The processors may also be configured to alter the reference strategy in response to the altered loss function.

Using the aggregated loss of all players, the set of reference strategies may then be updated. This updating may involve including an additional strategy as part of the reference strategy set in order to provide an optimal strategy for responding to the situation of the current situation in the current iteration. In other words, the reference strategy, or reference strategy set if more than one strategy is present, may be altered in response to the deviation between the current player’s strategy during the present iteration of refining the current strategy, as discussed above, and the previous iteration in which the current strategy of the previous iteration was refined. The reference strategy may also be altered based on the best response calculated for each stage/time window T. The Online Oracle methods described herein can be extended to Online Markov Decision Processes to achieve no-regret performance against an adversarialy changing loss function (and thus a changing reward function). This can be seen in figure 3. Markov Decision Processes (MDPs) are discrete-time stochastic control processes. In other words, they are a mathematical framework for modelling decision making in situations where an environment changes randomly in response to the actions/decisions/strategy performed by the player. Markov decision process problems (MDPs) assume a finite number of states and actions. At each time the agent observes a state and executes an action, which incurs intermediate costs to be minimized. This is best demonstrated by figure 3 in which an action At of the learner/agent/player influences the environment state X t and the reward function R t . The changes made by the action change the state of the environment and thus the reward/loss function for future iterations by the learner/agent/player.

A further example of an Online Markov Decision Process with changing reward/loss function can be seen in figure 4. Figure 4 uses the example of a robot laser tag environment. In part A of figure 4, the blue lines show the optimal trajectories for a robot travelling from a start location (s) to one of three goal locations (g). The opponent can put a sensor in one of 4 locations (x), facing one of 8 directions. The widths of the blue line trajectories correspond to the probability that the robot takes a given path. In parts B, C, D and E of figure 4 the optimal opponent strategy may be randomised among the possible placements of the sensors, each placement of the sensor producing the four fields of view. The algorithm herein solves the best paths for the robot to take to minimise the loss (e.g. the chance of being seen when moving from S to g).

Ideally the cost of an action to the player in MDPs should be minimised, e.g. the incorrect move played by a player leads to an adverse or undesirable outcome, while the reward to a player should be maximised.

OMDPs are further considered, where at each round t e N, an adversary can choose the loss function l t based on the player’ s/agent’s history {TTI , TTI , ... , TTM}. Considered herein are OMDPs with finite state space S; a finite action set at each state A; and a fixed transition model P. The player’s starting state, xi, is distributed according to some distribution po over the finite state space S. At time t, and given state x t e S, the player chooses an action a t e A. Then the player moves to a new random state x t +i which is determined by the fixed transition model P(x t +i | x t ,a t ). Simultaneously, the player receives an intermediate loss l t (x t ,a t ), in which the loss function l t : S x A -> R is bounded in [0,1] l A l x l s l and chosen by the adversary from a simplex, where {h, k, ... , II} are the loss vectors of the adversary. A zero-sum game setting is assumed, in which the adversary receives the loss of -l t (x t ,a t ) at round t and considers popular full information feedback. This means that the player/agent can observe the loss function l t after each round t.

The goal of the player is therefore to have minimum regret with respect to the best fixed policy in hindsight:

Where lA denotes the loss function at time t while the agent follows rri , ... ,TT t and F t is adaptive loss function against the best fixed policy TT of the agent.

A further aspect of this disclosure is the M DP-Online Oracle Expert algorithm as follows. MDP- OOE maintains a set of effective strategies A s t in each state. In each iteration, the best response with respect to the average loss function will be calculated. If all the actions in the best response are included in the current effective strategy set A s t for each state, then the algorithm continues with the current set A s t in each state. Otherwise, the algorithm updates the set of effective strategy in steps 8 and 9 of Algorithm 2 below. The period of consecutive iterations may be thought of as one stage or time window Tj in which the set of effective strategy A s t stays fixed, i.e. Intuitively, since both the agent and the adversary use a no-regret algorithm to play, the average strategy of both players will converge to the NE of the game. Under the small NE support size assumption, the size of the effective strategy set for the agent is also small compared to the whole pure strategy set ignores the pure strategies with poor average performance and only considers the strategies with high average performance.

The performance of the Online Oracle method can be guaranteed in the following theorem:

When the player uses Algorithm 2 in the online MDPs setting, then the regret of the system can be bounded by the above relationship.

The Online Oracle algorithm maintains a set of effective strategy A s t in each state, this is seen in line 1 of Algorithm 2. In each iteration, the best response with respect to the average loss function will be calculated. If all the actions in the best response are included in the current effective strategy set A s t for each state, then the algorithm continues with the current set A s t in each state. Otherwise, the algorithm updates the set of effective strategies in steps 14 and 15 of Algorithm 2. Again, the period of consecutive iterations may be thought of as one stage or time window Tj in which the set of effective strategy A s t stays fixed,

Intuitively, since both the agent and the adversary use a no-regret algorithm to play, the average strategy of both players will converge to the NE of the game. Under the small NE support size assumption, discussed above, the size of the effective strategy set for the agent is also small compared to the whole pure strategy set The Online Oracle algorithm ignores the pure strategies with poor average performance and only considers the one with high average performance.

Figures 5 and 6 show the performance of the Online Double algorithm of this disclosure compared to the state-of-the-art algorithms in solving two player zero-sum normal form games.

Figures 5a to 6b demonstrate experimental results comparing known solvers to the Online Double and Single Oracles approach disclosed herein. As can be seen in Figure 5a, the algorithms are applied to Leduc poker, which has a known number of 3 936 pure strategies. These figures show that the algorithm of this disclosure demonstrates convergences to the NE faster than the solves of the state of the art (PRSO, Rectified PSRO). The same can be said for Kuhn poker shown in figures 6a and 6b, which has a known number of 2 12 pure strategies.

Figures 5b and 6b demonstrate the performance comparison in terms of the average payoff where an imperfect opponent is exploited. An imperfect opponent is an opponent who does not play the Nash equilibrium. The example used in these figures is Online Single Oracle and the reduced number of iterations needed to achieve an optimal strategy.

The table below summarises some known game solvers compared to the solver of the present disclosure (Online Double Oracle).

Properties of existing solvers are shown for two-player zero-sum games A n x m. In the worst case, Double Oracle has to solve all the sub-games until reaching the full game solution. As such, the complexity is one order of magnitude larger than solving using a Linear Program. Since PSRO uses an approximate best response, the total time complexity is unknown, as seen in the table. It is noted that the regret bound of ODO of the present disclosure cannot be directly compared with the time complexity of DO.

The above algorithms may be implemented on a training apparatus that may comprise one or more processors. The one or more, or at least one, processor may be configured to execute any of the algorithms discussed herein in order to solve a multi-player problem. The multiplayer problem may be the control of an industrial apparatus. A non-limiting example of an industrial apparatus may be a traffic light system for a city, or water management system, for redirecting the flow of water. The present disclosure also relates to a decision-making apparatus for solving the above- mentioned multiplayer problem using the above-mentioned algorithms. The decision-making apparatus may comprise one or more processors configured as described above and for each of a plurality of players, implement steps of the discussed above in relation to Algorithm 3. These are receiving an initial respective strategy for each player. The effective strategy in Algorithm 3 is an example of the reference strategy set for each player. The reference strategy may include a subset of all possible pure strategies. A pure strategy may be a strategy that can be adopted independent on a problem state. An example of this may be that a pure strategy may be a strategy that is always playable. Each player adopts a respective strategy as the initial/current strategy. The current strategy set is then refined in the algorithm by performing a number of steps. As seen in line 9 of Algorithm 3 the average loss for each player is computed during a first-time window containing at least one iteration.

The average loss may be the negative consequence, or cost to a player, of the player employing the current strategy during a stage T. A stage may be a time stage, this may also be called a time window T. The loss may be found by comparing the best response to the result achieved by employing the current strategy. In addition, the loss may be computed over a plurality of stages or over a single stage.

Once the loss has been computed for each player, the aggregated loss over all players may be computed. This aggregated loss may be described by the loss vectors of each player in an adversarial situation and may be used to find the no-regret boundary of the algorithm. This can be seen described in lines 4 to 9 of Algorithm 3. The aggregated loss may be computed from an average of aggregated losses of each player or a single player over multiple iterations. A loss may be computed according to the loss function and the one or more of the processors may be configured to adversarily alter the loss function between at least some of the iterations of the stage, altering the current strategy. The processors may also be configured to alter the reference strategy in response to the altered loss function.

Using the aggregated loss of all players, the set of reference strategies may then be updated. This updating may involve including an additional strategy as part of the reference strategy set in order to provide an optimal strategy for responding to the situation of the current situation in the current iteration. In other words, the reference strategy, or reference strategy set if more than one strategy is present, may be altered in response to the deviation between the current player’s strategy during the present iteration of refining the current strategy, as discussed above, and the previous iteration in which the current strategy of the previous iteration was refined. The reference strategy may also be altered based on the best response calculated for each stage/time window Tj.

The decision-making apparatus may further adopt, as a solution to the multi-player problem, a solution determined according to the respective current strategies after the multiple iterations of the previous stage or the current stage.

In summary, the present disclosure provides a scalable learning method and apparatus for achieving high average performance in games with large dimensions.

The disclosure further provides an Online Single Oracle algorithm that evolves a combination of online learning and oracle framework to reduce the size of expert set to efficiently learn in large games. The disclosure also provides a state-of-the-art solver in the form of an Online Double Algorithm to be used in two-player zero-sum large size games and a MDP-Online Oracle Expert algorithm that is an extension of Online Single Oracle that can efficiently learn using Online Markov Decision Processes with a non-oblivious adversary.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description, it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.