Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
REWARD ESTIMATION FOR A TARGET POLICY
Document Type and Number:
WIPO Patent Application WO/2022/199792
Kind Code:
A1
Abstract:
A computer implemented method (100) is disclosed for improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing a communication network environment that is operable to perform a task. The method comprises obtaining a training dataset comprising records of task performance by the environment during a period of management according to a reference policy (110), and generating, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context (120). The method further comprises initiating the reward estimator (130), wherein the reward estimator comprises a Machine Learning model operable to estimate reward value given a particular observed context and selected action, and setting a value of a propensity impact parameter according to a feature of at least the training dataset or the reference policy (140). The method further comprises using the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function (150) based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy (150a), and adjusting a magnitude of the weighting of each difference according to the impact parameter.

Inventors:
VANNELLA FILIPPO (SE)
JEONG JAESEONG (SE)
Application Number:
PCT/EP2021/057321
Publication Date:
September 29, 2022
Filing Date:
March 22, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
International Classes:
G06N3/00; G06N3/08; G06N20/00; H04W24/02; H04W16/20
Foreign References:
US20190258938A12019-08-22
Other References:
HEUNCHUL LEE ET AL: "Deep reinforcement learning approach to MIMO precoding problem: Optimality and Robustness", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 30 June 2020 (2020-06-30), XP081710685
FILIPPO VANNELLA ET AL: "Off-policy Learning for Remote Electrical Tilt Optimization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 May 2020 (2020-05-21), XP081676210
BIETTI, A.AGARWAL, A.LANGFORD, J.: "Practical Evaluation and Optimization of Contextual Bandit Algorithms", ARXIV, ABS/1802.04064, 2018
DUDIK, MIROSLAVLANGFORD, JOHNLI, LIHONG: "Doubly robust policy evaluation and learning", 2011, ICML
Attorney, Agent or Firm:
ERICSSON (SE)
Download PDF:
Claims:
CLAIMS

1. A computer implemented method (100) for improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing a communication network environment that is operable to perform a task, the method, performed by a training node, comprising: obtaining a training dataset comprising records of task performance by the environment during a period of management according to a reference policy (110), wherein each record of task performance comprises an observed context for the environment, an action selected for execution in the environment by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on task performance by the environment (110a); generating, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context (120); initiating the reward estimator, wherein the reward estimator comprises a Machine Learning, ML, model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action (130); setting a value of a propensity impact parameter according to a feature of at least the training dataset or the reference policy (140); and using the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function (150); wherein the loss function is based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy, each difference weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases(150a); and wherein using the records of task performance in the training dataset to update the values of the ML model parameters so as to minimize the loss function comprises: adjusting a magnitude of the weighting of each difference according to the impact parameter

(150).

2. A method as claimed in claim 1 , wherein setting a value of an impact parameter comprises setting the value such that the magnitude of the weighting of each difference by the output of the propensity model increases with at least one of (340a): decreasing noise in the training dataset; decreasing balance in the action distribution of the reference policy.

3. A method as claimed in claim 2, wherein the impact parameter has a value between zero and one; and wherein adjusting a magnitude of the weighting of each difference according to the impact parameter, comprises (350a): for a given pair of observed context and action selected by the reference policy, raising the output of the propensity model to the power of the impact parameter; and dividing a function of the difference between observed and estimated reward for the given pair of observed context and action selected by the reference policy by the output of the propensity model raised to the power of the impact parameter.

4. A method as claimed in any one of claims 1 to 3, wherein using the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function comprises: inputting observed context and selected action pairs from the training dataset to the reward estimator, wherein the reward estimator processes the observed context and selected action pairs in accordance with current values of parameters of the reward estimator and outputs an estimated reward value (350b); and updating the values of the reward estimator parameters so as to minimize the loss function (350c).

5. A method as claimed in claim 4, wherein updating the values of the reward estimator parameters so as to minimize the loss function comprises updating the values of the reward estimator parameters using backpropagation (350c).

6. A method as claimed in any one of claims 1 to 5, further comprising: repeating, until a convergence condition is satisfied, the steps of: sampling a batch of records from the training dataset (335); setting a value of the propensity impact parameter according to a feature of at least the training dataset or the initiated reward estimator (340); and using the records in the training dataset to update the values of the reward estimator parameters so as to minimize the loss function (350).

7. A method as claimed in claim 6, wherein, for each sampled batch of the training dataset, setting a value of the propensity impact parameter according to a feature of at least the training dataset or the initiated reward estimator comprises setting a value of the propensity impact parameter according to a feature of at least the sampled batch of the training dataset (340).

8. A method as claimed in claim 6 or 7, wherein, for each sampled batch of the training dataset, using the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function comprises using records of task performance in the sampled batch of the training dataset (350).

9. A method as claimed in any one of claims 1 to 8, further comprising: updating the value of the impact parameter according to performance of the target policy (390).

10. A method as claimed in any one of claims 1 to 9, wherein the environment comprises at least one of (410a) a cell of a communication network, a cell sector of a communication network, at least a part of a core network of a communication network, or a slice of a communication network, and wherein the task that the environment is operable to perform comprises provision of communication network services.

11. A method as claimed in any one of claims 1 to 10, wherein an observed environment context in the training dataset comprises at least one of: a value of a network coverage parameter (401a); a value of a network capacity parameter (401b); a value of a network congestion parameter (401c); a value of a network quality parameter; a current network resource allocation (401 d); a current network resource configuration (401 e); a current network usage parameter (401f); a current network parameter of a neighbour communication network cell (401 g); a value of a network signal quality parameter (401 h); a value of a network signal interference parameter (401 i); a value of a Reference Signal Received Power, RSRP parameter; a value of a Reference Signal Received Quality, RSRQ, parameter; a value of a network signal to interference plus noise ratio, SI NR, parameter; a value of a network power parameter (401j); a current network frequency band (401k); a current network antenna down-tilt angle (4011); a current network antenna vertical beamwidth (401m); a current network antenna horizontal beamwidth (401 n); a current network antenna height (401 o); a current network geolocation (401 p); a current network inter-site distance (401 q). 12. A method as claimed in any one of claims 1 to 11, wherein the reward value indicating an observed impact of the selected action on task performance by the environment comprises a function of at least one performance parameter for the communication network (430a).

13. A method as claimed in any one of claims 1 to 12, wherein an action for execution in the environment comprises at least one of (420a): an allocation decision for a communication network resource; a configuration for a communication network node; a configuration for communication network equipment; a configuration for a communication network operation; a decision relating to provision of communication network services for a wireless device; a configuration for an operation performed by a wireless device in relation to the communication network.

14. A method as claimed in any one of claims 1 to 13, wherein the environment comprises a sector of a cell of a communication network and wherein the task that the environment is operable to perform comprises provision of radio access network services; wherein an observed environment context in the training dataset comprises at least one of: a coverage parameter for the sector a capacity parameter for the sector; a signal quality parameter for the sector a down tilt angle of the antenna serving the sector: and wherein an action for execution in the environment comprises a down tilt adjustment value for an antenna serving the sector.

15. A computer implemented method (200) for using a target policy to manage a communication network environment that is operable to perform a task, the method, performed by a management node, comprising: obtaining a reward estimator from a training node, wherein the reward estimator comprises a

Machine Learning, ML, model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using a method according to any one of claims 1 to 14 (210); receiving an observed environment context from a communication network node (220); using the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment (230); and causing the selected action to be executed in the environment (240); wherein the target policy evaluates possible actions for the environment using the obtained reward estimator (230a).

16. A method as claimed in claim 15, wherein using the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment comprises: evaluating possible actions for the environment using the obtained reward estimator (530a); and using a selection function to select an action for execution in the environment based on the evaluation (530b).

17. A method as claimed in claim 15 or 16, wherein using the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment comprises: using the reward estimator to estimate a reward from taking each possible action given the received context (530aa); and selecting for execution in the environment the action having the highest estimated reward (530bb).

18. A method as claimed in claim 15 or 16, wherein using the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment comprises: using a prediction function to predict the probability that each of the possible actions will result in the greatest reward (530aa); and selecting for execution in the environment the action having the highest probability (530bb).

19. A method as claimed in any one of claims 15 to 18, wherein the environment comprises at least one of (620a) a cell of a communication network, a cell sector of a communication network, at least a part of a core network of a communication network, or a slice of a communication network, and wherein the task that the environment is operable to perform comprises provision of communication network services.

20. A method as claimed in any one of claims 15 to 19, wherein the observed environment context received from the communication network node comprises at least one of: a value of a network coverage parameter (401a); a value of a network capacity parameter (401b); a value of a network congestion parameter (401c); a value of a network quality parameter; a current network resource allocation (401 d); a current network resource configuration (401 e); a current network usage parameter (401f); a current network parameter of a neighbour communication network cell (401 g); a value of a network signal quality parameter (401 h); a value of a network signal interference parameter (401 i); a value of a Reference Signal Received Power, RSRP parameter; a value of a Reference Signal Received Quality, RSRQ, parameter; a value of a network signal to interference plus noise ratio, SI NR, parameter; a value of a network power parameter (401j); a current network frequency band (401k); a current network antenna down-tilt angle (4011); a current network antenna vertical beamwidth (401m); a current network antenna horizontal beamwidth (401 n); a current network antenna height (401 o); a current network geolocation (401 p); a current network inter-site distance (401 q).

21. A method as claimed in any one of claims 15 to 20, wherein the reward value comprises a function of at least one performance parameter for the communication network (610a).

22. A method as claimed in any one of claims 15 to 21, wherein an action for execution in the environment comprises at least one of (630a): an allocation decision for a communication network resource; a configuration for a communication network node; a configuration for communication network equipment; a configuration for a communication network operation; a decision relating to provision of communication network services for a wireless device; a configuration for an operation performed by a wireless device in relation to the communication network.

23. A method as claimed in any one of claims 15 to 22, wherein the environment comprises a sector of a cell of a communication network and wherein the task that the environment is operable to perform comprises provision of radio access network services; wherein the observed environment context received from the communication network node comprises at least one of: a coverage parameter for the sector a capacity parameter for the sector; a signal quality parameter for the sector a down tilt angle of the antenna serving the sector: and wherein an action for execution in the environment comprises a down tilt adjustment value for an antenna serving the sector.

24. A computer program product comprising a computer readable medium, the computer readable medium having computer readable code embodied therein, the computer readable code being configured such that, on execution by a suitable computer or processor, the computer or processor is caused to perform a method as claimed in any one of claims 1 to 23.

25. A training node (900) for improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing a communication network environment that is operable to perform a task, the training node comprising processing circuitry (902) configured to cause the training node to: obtain a training dataset comprising records of task performance by the environment during a period of management according to a reference policy, wherein each record of task performance comprises an observed context for the environment, an action selected for execution in the environment by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on task performance by the environment; generate, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context; initiate the reward estimator, wherein the reward estimator comprises a Machine Learning, ML, model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action; set a value of a propensity impact parameter according to a feature of at least the training dataset or the reference policy; and use the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function; wherein the loss function is based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy, each difference weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases; and wherein using the records of task performance in the training dataset to update the values of the ML model parameters so as to minimize the loss function comprises: adjusting a magnitude of the weighting of each difference according to the impact parameter. 26. A training node as claimed in claim 25, wherein the processing circuitry (902) is further configured to cause the training node to perform the steps of any one of claims 2 to 14.

27. A management node (1100) for using a target policy to manage a communication network environment that is operable to perform a task, the management node comprising processing circuitry (1102) configured to cause the management node to: obtain a reward estimator from a training node, wherein the reward estimator comprises a Machine Learning, ML, model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using a method according to any one of claims 1 to 13; receive an observed environment context from a communication network node; use the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment; and cause the selected action to be executed in the environment; wherein the target policy evaluates possible actions for the environment using the obtained reward estimator.

28. A management node as claimed in claim 27, wherein the processing circuitry (1102) is further configured to cause the management node to perform the steps of any one of claims 16 to 23.

Description:
Reward estimation for a target policy

Technical Field The present disclosure relates to a method for improving the accuracy of a reward estimator for a target policy, and to a method for using a target policy to manage a communication network environment that is operable to perform a task. The methods are performed by a training node and by a management node respectively. The present disclosure also relates to a training node, a management node, and to a computer program product configured, when run on a computer, to carry out methods for improving the accuracy of a reward estimator for a target policy and/or for using a target policy to manage a communication network environment that is operable to perform a task.

Background The Contextual Bandit (CB) setting refers to a decision-making framework in which an agent interacts with an environment by selecting actions to be executed on the environment. The agent learns an optimal policy for action selection by interacting with the environment and collecting a reward signal as a consequence of executing an action when a given context is observed in the environment. The context comprises information about the state of the environment that the agent uses to select an action in accordance with its learned policy. The CB setting is formally defined by an interaction cycle, where at each time step t = 1, 2, .... the agent:

1) Observes a context x t ~ P(X) where X Q R d is the context space. The context is assumed to be continuous, and drawn independently and identically distributed (i.i.d.) from an unknown distribution over the context space P(X), 2) Chooses an action a t ~ p(L\c ) where n(A\x t ) X ® P(c A) is the policy, being a function from context to distributions over the action space y and representing the behavior of the policy, and P is the class of considered policy,

3) Receives a reward r(x t , a t ) ~ D(· \x t , a t ), representing the reward experienced for executing action a when context x is observed.

While interacting with the environment, a significant amount of data is collected by the agent. This offline data can be of assistance for learning policies in data-driven techniques. In the case of an environment within a communication network, learning a policy by direct interaction with the environment carries the risk of degrading network performance to a potentially unacceptable degree during an exploration phase in which the agent tries out different actions in order to learn an optimal policy. Policy learning using offline data avoids this risk. For offline learning, data is collected by a logging policy that is different from the target policy that is to be trained using the collected data. The collected data is therefore referred to as off-policy data. Learning a new target policy from off-policy data in an offline manner can avoid exploratory actions in an online environment that are the primary cause of unsafe behaviors in online learning.

Formally, a baseline dataset D o = {(c ί; a ί; r )}f =1 is assumed, which has been collected using a logging policy 7T 0 . The objective is to devise a target or learning policy p e P in an offline manner from the off- policy dataset D o with the objective of maximizing the value of the learning policy p, defined as:

As the off-policy data are collected by the logging policy p 0 , they cannot be used directly to estimate the value of the learning policy p, as the learning policy will not always select the same action as the logging policy, given the same observed environment context. One approach to addressing this problem is to use a value estimator based on the Inverse Propensity Score (IPS) technique. The value can be rewritten as: The core idea of the IPS technique is to use IPS to correct for the distribution mismatch between p 0 and p. The estimator that results from this technique is the Monte-Carlo IPS estimator of the value:

Given a policy p, belonging to a policy class P, the learning algorithm based on the IPS value estimator is:

N

1 p( W ( |C ( ) p IPS arg max V IPS (n) argmax- g. - peP peP N 1 7T 0 (d; I X;) '

The quality of an estimator is usually characterized by its Bias and Variance. The IPS estimator is usually unbiased but its variance scales quadratically with the reward variance and with the inverse of the propensity. The IPS estimator is therefore usually regarded as a low-bias and high-variance estimator. Another approach to addressing the problem that off-policy data cannot be used directly to estimate the value of the learning policy TT, is the Direct Method (DM) estimator. The DM estimator is based on the use of a model for the reward, which model is learned on the available off-policy data. The learned reward model will in turn be used for the direct estimation of the risk of policy p.

Formally, let f( ): X x L ® R, be a reward model from a function class . The reward model is learned with the objective of minimizing the Mean Squared Error (MSE) between reward samples in the off-policy data and the estimated reward from the model: f * = arg min^ ^^rCx^ aj) - r(Xj, aj)] 2 .

Based on the learned reward model f *, the DM risk estimator of policy p is defined as:

The optimal policy based on V DM (n ) is the deterministic policy that maximizes the estimated reward:

TT* DM (x) = arg max f*(x, a). aEL The DM estimator has lower variance than its IPS counterpart, but suffers from high model bias owing to the choice of the reward model. The DM estimator is therefore usually regarded as a high-bias and low- variance value estimator.

High bias is a significant disadvantage of DM training, and empirical observation suggests that DM suffers more from high bias (or unbalanced action samples) as the number of input features increases. For use in communication networks this is a significant problem, as increasing the number of input features is a key step towards increasing the gain offered by Machine Learning (ML) techniques in such networks. Attempts have been made to combine the DM and IPS estimators in an effort to address their dual bias-variance characteristics: IPS being a low-bias and high-variance estimator while DM is a high-bias and low-variance estimator. These attempts include the Importance Weighted Regression (IWR) disclosed by Bietti, A., Agarwal, A., & Langford, J. (2018), "Practical Evaluation and Optimization of Contextual Bandit Algorithms”, ArXiv, abs/1802.04064, and the Doubly Robust (DR) estimator disclosed by Dudik, Miroslav, Langford, John, and Li, Lihong, "Doubly robust policy evaluation and learning”, ICML, 2011. However, the proposed techniques offer only limited performance, particularly when applied to communication network data, which may be highly diverse and with evolving characteristics.

Summary

It is an aim of the present disclosure to provide methods, nodes, and a computer program product which at least partially address one or more of the challenges discussed above. It is a further aim of the present disclosure to provide methods, nodes and a computer program product that cooperate to produce a reward estimator for a target policy that offers improved accuracy, resulting in improved performance of its task by a communication network environment that is managed according to the target policy.

According to a first aspect of the present disclosure, there is provided a computer implemented method for improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing a communication network environment that is operable to perform a task. The method, performed by a training node, comprises obtaining a training dataset comprising records of task performance by the environment during a period of management according to a reference policy, wherein each record of task performance comprises an observed context for the environment, an action selected for execution in the environment by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on task performance by the environment. The method further comprises generating, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context. The method further comprises initiating the reward estimator, wherein the reward estimator comprises a Machine Learning (ML) model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action, and setting a value of a propensity impact parameter according to a feature of at least the training dataset or the reference policy. The method further comprises using the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function. The loss function is based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy, each difference weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases. Using the records of task performance in the training dataset to update the values of the ML model parameters so as to minimize the loss function comprises adjusting a magnitude of the weighting of each difference according to the impact parameter. According to another aspect of the present disclosure, there is provided computer implemented method for using a target policy to manage a communication network environment that is operable to perform a task. The method, performed by a management node, comprises obtaining a reward estimator from a training node, wherein the reward estimator comprises a Machine Learning (ML) model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using a method according the previous aspect of the present disclosure. The method further comprises receiving an observed environment context from a communication network node, and using the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment, wherein the target policy evaluates possible actions for the environment using the obtained reward estimator. The method further comprises causing the selected action to be executed in the environment.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer readable medium, the computer readable medium having computer readable code embodied therein, the computer readable code being configured such that, on execution by a suitable computer or processor, the computer or processor is caused to perform a method according to any one or more of the aspects or examples of the present disclosure.

According to another aspect of the present disclosure, there is provided a training node for improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing a communication network environment that is operable to perform a task. The training node comprises processing circuitry configured to cause the training node to obtain a training dataset comprising records of task performance by the environment during a period of management according to a reference policy, wherein each record of task performance comprises an observed context for the environment, an action selected for execution in the environment by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on task performance by the environment. The processing circuitry is further configured to cause the training node to generate, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context, and to initiate the reward estimator, wherein the reward estimator comprises an ML model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action. The processing circuitry is further configured to cause the training node to set a value of a propensity impact parameter according to a feature of at least the training dataset or the reference policy, and to use the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function. The loss function is based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy, each difference weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases. Using the records of task performance in the training dataset to update the values of the ML model parameters so as to minimize the loss function comprises adjusting a magnitude of the weighting of each difference according to the impact parameter.

According to another aspect of the present disclosure, there is provided a management node for using a target policy to manage a communication network environment that is operable to perform a task, the management node comprises processing circuitry configured to cause the management node to obtain a reward estimator from a training node, wherein the reward estimator comprises an ML model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using a method according to the present disclosure. The processing circuitry is further configured to cause the management node to receive an observed environment context from a communication network node, to use the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment, and to cause the selected action to be executed in the environment. The target policy evaluates possible actions for the environment using the obtained reward estimator.

Aspects of the present disclosure thus provide methods and nodes that facilitate flexible and adaptive off- policy learning through a combination of the IPS and DM off-policy learning techniques. Methods presented herein allow for continuous adaptation between contribution from IPS or DM based solutions, leading to an improved reward estimator that results in an improved target policy, and consequently improved management of a communication network environment by that policy. Through setting of a value for the propensity impact parameter, methods of the present disclosure allow for tuning of the impact of propensities on the reward estimator, favoring contributions from either propensity based or direct method techniques according to characteristics of the offline dataset used for learning and/or a feature of the reference policy. As discussed in greater detail below, performance of the methods disclosed herein has been validated on communication network data for the task of Remote Electrical Tilt (RET) optimization in 4G LTE networks, and experimental results show a significant gain over the DM method, especially when the number of input features is large.

Brief Description of the Drawings For a better understanding of the present disclosure, and to show more clearly how it may be carried into effect, reference will now be made, by way of example, to the following drawings in which: Figure 1 is a flow chart illustrating process steps in a computer implemented method for improving the accuracy of a reward estimator for a target policy; Figure 2 is a flow chart illustrating process steps in a computer implemented method for using a target policy to manage a communication network environment that is operable to perform a task;

Figures 3a and 3b show a flow chart illustrating process steps in another example of computer implemented method for improving the accuracy of a reward estimator for a target policy

Figures 4a and 4b show a flow chart illustrating process steps in another example of computer implemented method for improving the accuracy of a reward estimator for a target policy

Figure 5 a flow chart illustrating process steps in another example of computer implemented method for using a target policy to manage a communication network environment that is operable to perform a task;

Figure 6 a flow chart illustrating process steps in another example of computer implemented method for using a target policy to manage a communication network environment that is operable to perform a task; Figure 7 is a flow chart illustrating implementation of the method of Figure 1 as a training pipeline;

Figure 8 is a flow chart illustrating implementation of the method of Figure 2 as a policy pipeline;

Figure 9 is a block diagram illustrating functional modules in a training node;

Figure 10 is a block diagram illustrating functional modules in another example of training node;

Figure 11 is a block diagram illustrating functional modules in a management node;

Figure 12 is a block diagram illustrating functional modules in another example of management node;

Figure 13 illustrates antenna downtilt angle; and

Figures 14 and 15 illustrate comparative performance of methods according to the present disclosure for a Remote Electronic Tilt use case. Detailed Description

As discussed above, examples of the present disclosure provide methods and nodes that facilitate flexible and adaptive off-policy learning through a combination of the IPS and DM off-policy learning techniques. The resulting reward estimator is referred to as an Adaptive Propensity-Based Direct Estimator.

Figure 1 is a flow chart illustrating process steps in a computer implemented method 100 for improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing a communication network environment that is operable to perform a task. The task performed by the environment may comprise one or more aspects of provision of communication network services. For example, the environment may comprise a cell, a cell sector, or a group of cells of a commination network, and the task may comprise provision of Radio Access Network (RAN) services to wireless devices connecting to the network from within the environment. In other examples, the environment may comprise a network slice, or a part of a transport or core network, in which case the task may be to provide end to end network services, core network services such as mobility management, service management etc., network management services, backhaul services or other services to wireless devices, to other parts of the communication network, to network traffic originating from wireless devices, to application services using the network, etc.

The method 100 is performed by a training node, which may comprise a physical or virtual node, and may be implemented in a computing device or server apparatus and/or in a virtualized environment, for example in a cloud, edge cloud or fog deployment. The training node may for example be implemented in a core network of the communication network. The training node may encompass multiple logical entities, as discussed in greater detail below, and may for example comprise a Virtualised Network Function (VNF).

Referring to Figure 1, the method 100 comprises, in a first step 110, obtaining a training dataset comprising records of task performance by the environment during a period of management according to a reference policy. As illustrated at 110a, each record of task performance comprises an observed context for the environment, an action selected for execution in the environment by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on task performance by the environment.

An observed context for an environment comprises any measured, recorded or otherwise observed information about the state of the environment. As noted above, the environment is an environment within a communication network, such as a cell of a cellular network, a cell sector, a group of cells, a geographical region, transport network, core network, network slice, etc. An observed context for an environment may therefore comprise one or more Key Performance Indicators (KPIs) for the environment, information about a number of wireless devices connecting to the communication network in the environment, etc. The action selected for execution in the environment may be any configuration, management or other action which impacts performance of the environment task. This may comprise setting one or more values of controllable parameters in the environment for example. The reward value indicates an observed impact of the selected action on task performance by the environment. This may comprise a change in one or more KPI values following execution of the action, or any other value, combination of values etc. which provide an indication of how the selected action has impacted the ability of the environment to perform its task. For example, in the case of an environment comprising a cell of a RAN, the reward value may comprise a function of network coverage, quality and capacity parameters.

It will be appreciated that the records of task performance by the environment thus provide an indication of how the environment has been managed by the reference policy, illustrating, for each action executed on the environment, the information on the basis of which the reference policy selected the action (the context), the action selected, and the outcome of the selected action for task performance (the reward value). Training of the reward estimator according to the method 100 is performed using the obtained training data in subsequent method steps, and is consequently performed offline.

In step 120, the method 100 comprises generating, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context. The propensity model may comprise a linear model, a logistic regression model, or any other suitable model capable of being trained using the training dataset to map from an observed context in the training dataset to a probability distribution indicating the probability of selection by the reference policy of different actions. Generating the propensity model may therefore comprise initiating the propensity model and training the model by submitting as input observed contexts from the training data, and updating model parameters so as to minimize some error function based on the difference between the probability of selecting different actions generated by the model and the action selection probability distribution for the input context from the training data.

In step 130, the method 100 comprises initiating the reward estimator, wherein the reward estimator comprises a Machine Learning (ML) model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action. The method 100 then comprises, in step 140, setting a value of a propensity impact parameter according to a feature of at least the training dataset or the reference policy. Setting a value of the propensity impact parameter may comprise calculating a value based on the above mentioned feature(s), inputting a value provided by a user on the basis of the above mentioned feature(s), updating a default value of the propensity impact parameter based on previous performance of the reward estimator (which will be affected by the above mentioned features), inputting a value of the propensity impact parameter that has been learned by a model on the basis of previous experience of training datasets, reference policies and reward estimators, etc.

Finally, in step 150, the method 100 comprises using the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function. Using the records of task performance in the training data in this manner may comprise inputting pairs of observed context and action from the training data to the reward estimator, using the reward estimator to estimate a reward that would result from execution of the input action given the input observed context, evaluating the estimated reward using the loss function, and then updating trainable parameters of the reward estimator to minimize the loss function, for example via backpropagation or any other suitable method.

As illustrated at 150a, the loss function is based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy. Each difference between observed and estimated reward is weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases. This means that if the reference policy is highly likely to take a particular action given a particular observed context, the contribution to the loss function of the difference between observed and estimated reward for that combination of context and action is reduced. However, if the reference policy is highly unlikely to take a particular action given a particular observed context, the contribution to the loss function of the difference between observed and estimated reward for that combination of context and action is increased. The result of this weighting based on the propensity model is therefore to compensate for specific characteristics or imbalance in the probability distribution of actions selected by the reference policy.

As illustrated at 150, the step of using the records of task performance in the training dataset to update the values of the ML model parameters so as to minimize the loss function comprises adjusting a magnitude of the weighting of each difference according to the impact parameter. The impact parameter, by determining the magnitude of the weighting, thus determines the impact that the output of the propensity model has on the loss function, which impact may range from no impact at all to a maximum impact, the magnitude of which will be determined by the value of the impact parameter. As the magnitude of the weighting by the propensity model output of each difference increases, in accordance with the value of the impact parameter, the variance of the reward estimation increases, and the bias of the reward estimation decreases. It will be appreciated that the method 100 consequently results in an offline reward estimator in which, as opposed to choosing between the low bias, high variance estimation of Inverse Propensity Score, or the high bias, low variance estimation of the Direct Method, these two methods are combined in a flexible manner via the impact parameter, so achieving improved accuracy in reward estimation, and consequently in offline policy training.

The method 100 thus offers improved management of a communication network environment, through enabling a more accurate policy to be trained in an offline, and consequently safe, manner. The method 100 uses offline training data to generate a propensity model and a reward estimator. The output of the propensity model is used to weight contributions to the loss function during training of the reward estimator, and the magnitude of that weighting is determined by a propensity impact parameter. The reward estimator trained in this manner is used by a target policy to control the communication network environment. The increased accuracy offered by the reward estimator trained in the manner of the method 100 ensures improved performance in management of the communication network environment by the target policy, without incurring the risks of online target policy training.

The method 100 may be complemented by a computer implemented method 200 for using a target policy to manage a communication network environment that is operable to perform a task. The method 200 is performed by a management node, which may comprise a physical or virtual node, and may be implemented in a computing device, server apparatus and/or in a virtualized environment, for example in a cloud, edge cloud or fog deployment. The management node may comprise or be instantiated in any part of a logical core network node, network management centre, network operations centre, Radio Access Network node etc. A Radio Access Network Node may comprise a base station, eNodeB, gNodeB, or any other current of future implementation of functionality facilitating the exchange of radio network signals between nodes and/or users of a communication network. Any communication network node may itself be divided between several logical and/or physical functions, and any one or more parts of the management node may be instantiated in one or more logical or physical functions of a communication network node. The management node may therefore encompass multiple logical entities, as discussed in greater detail below.

Referring to Figure 2, the method 200 comprises, in a first step 210, obtaining a reward estimator from a training node, wherein the reward estimator comprises an ML model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using a method according to examples of the present disclosure. In step 220, the method 200 comprises receiving an observed environment context from a communication network node. The method 200 further comprises, in step 230, using the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment. As illustrated at 230a, the target policy evaluates possible actions for the environment using the obtained reward estimator. Finally, in step 240, the method 200 comprises causing the selected action to be executed in the environment.

The method 200 may be envisaged as a policy pipeline that uses a reward estimator trained according to the method 100 (or the methods 300, 400 described below) in evaluating possible actions for execution in the environment. It will be appreciated that much of the detail described above with reference to the method 100 also applies to the method 200. For example the nature of the environment, the observed environment context, the reward value, and possible actions for execution in the environment may all be substantially as described above with reference to Figure 1 . It will also be appreciated that by virtue of having been trained using a method according to the present disclosure, the reward estimator used in the method 200 offers all of the advantages discussed above relating to improved accuracy and flexibility, and consequently improved performance. In some examples of the method 200, additional pre-processing steps may be carried out, including for example normalizing features of the received observed context.

Figures 3a and 3b show flow charts illustrating process steps in a further example of method 300 improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing a communication network environment that is operable to perform a task. The method 300 provides various examples of how the steps of the method 100 may be implemented and supplemented to achieve the above discussed and additional functionality. As for the method 100, the method 300 is performed by a training node, which may be a physical or virtual node, and which may encompass multiple logical entities.

Referring to Figure 3a, in a first step 310 of the method 300, the training node obtains a training dataset comprising records of task performance by the environment during a period of management according to a reference policy. As illustrated at 310a, each record of task performance comprises an observed context for the environment, an action selected for execution in the environment by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on task performance by the environment. Reference is made to the description of Figure 1 above for additional detail concerning the observed environment context, reward value, etc. In some examples (not shown), the training node may perform one or more pre-processing steps such as normalizing input features from the obtained training dataset.

In step 320, the training node generates, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context. As discussed above, the propensity model may in some examples comprise a logistic regression model, and may be generated as discussed above with reference to Figure 1 . In step 330, the training node initiates the reward estimator, wherein the reward estimator comprises an ML model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action. The reward estimator may for example comprise an Artificial Neural Network, Logistic regression model, Gaussian process, Linear regression model, etc.

In step 335, the training node samples a batch of records from the training dataset. This may comprise random sampling, and the batch size may be defined by a user or operator in accordance with the size of the training dataset, time available for training, training model complexity, data characteristics, etc. In step 340, the training node sets a value of the propensity impact parameter according to a feature of at least the training dataset or the initiated reward estimator. As illustrated at 340, the feature of the training dataset comprises a feature of at least the sampled batch of the training dataset. In this manner, the value of the propensity impact parameter (and consequently the magnitude of the impact of the propensity model on the loss function, and therefore the trained reward estimator) can be adjusted according to the characteristics of the particular sampled batch of data. As discussed above, setting a value of the propensity impact parameter may comprise calculating a value based on the above mentioned feature(s), inputting a value provided by a user on the basis of the above mentioned feature(s), updating a default value of the propensity impact parameter based on previous performance of the reward estimator (which will be affected by the above mentioned features), inputting a value of the propensity impact parameter that has been learned by a model on the basis of previous experience of training datasets, reference policies and reward estimators, etc.

As illustrated at step 340a, setting the value of the impact parameter may comprise setting the value such that the magnitude of the weighting of each difference by the output of the propensity model increases with at least one of decreasing noise in the training dataset and/or decreasing balance in the action distribution of the reference policy. For example, if the noise level in the training dataset is high, the propensity impact parameter may be set to reduce the magnitude of the propensity weighting, and so reduce the variance in the reward estimation. In another example, if the actions selected by the reference policy in the training dataset are distributed in an unbalanced manner, or dominated by one or a small number of actions, the impact parameter may be set to increase the magnitude of the propensity weighting, and so to reduce the bias in the reward estimation.

As illustrated at 340b, the propensity impact parameter may be set to have a value between zero and one.

Referring now to Figure 3b, in step 350, the training node uses the records in the sampled batch of the training dataset to update the values of the reward estimator parameters so as to minimize a loss function. As discussed above and illustrated in step 150a of the method 100, the loss function minimized in step 350 is based on differences between observed reward from the sampled batch of the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy. Each difference is weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases.

As illustrated at step 350, using the records of task performance in the sampled batch of the training dataset to update the values of the ML model parameters so as to minimize the loss function comprises adjusting a magnitude of the weighting of each difference according to the impact parameter, a value of which was set at step 340. In some examples, as discussed above and as illustrated at 350a, the propensity impact parameter may be set to have a value between zero and one, and adjusting a magnitude of the weighting of each difference according to the impact parameter may comprises, for a given pair of observed context and action selected by the reference policy, raising the output of the propensity model to the power of the impact parameter, and dividing a function of the difference between observed and estimated reward for the given pair of observed context and action selected by the reference policy by the output of the propensity model raised to the power of the impact parameter. This is one way in which the effect of increasing or decreasing the magnitude of the weighting according to the output of the propensity parameter according to the value of the impact parameter may be achieved as discussed above.

As illustrated in Figure 3b, the step 350 of using the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function may comprise, steps 350b and 350c, inputting observed context and selected action pairs from the sampled batch of the training dataset to the reward estimator, wherein the reward estimator processes the observed context and selected action pairs in accordance with current values of parameters of the reward estimator and outputs an estimated reward value, and updating the values of the reward estimator parameters so as to minimize the loss function. This updating may for example be performed via backpropagation, and the reward estimator may as discussed above comprise an ANN.

In step 360, the training node checks for fulfilment of a convergence condition. The convergence condition may for example comprise a threshold value for the output of the loss function, a threshold rate of change of the output of the loss function, a maximum number of training epochs etc. If the convergence condition is not fulfilled, the training node returns to step 335, samples a new batch of records from the training dataset and repeats steps 340 (setting a value of the propensity impact parameter), and 350 (using the records in the sampled batch of the training dataset to update values of the reward estimator parameters so as to minimize a loss function).

If the check at step 360 indicates that the convergence condition is fulfilled, the training node provides the reward estimator in step 370 to a management node for use in managing the environment of the communication network.

In step 380, the training node checks whether any feedback has been received from the management node regarding performance of the target policy when using the provided trained reward estimator. If such feedback is available, the training node may update the value of the impact parameter according to performance of the target policy. In other examples, additional tuning of the value of the impact parameter may be performed as for the initial setting step on the basis of characteristics of the training dataset. As discussed above, and for an impact parameter having a value between 0 and 1 used to adjust magnitude in the manner described with reference to step 350a, if the noise level in the training dataset is high, the value of the impact parameter may be decreased so as to reduce the variance in the reward estimation, and if the actions made by reference policy in training dataset are distributed in an unbalanced manner or dominated by a particular action, the impact parameter may be increased so as to reduce the bias in the reward estimation.

Figures 4a and 4b illustrate different examples of how the methods 100 and 300 may be applied to different technical domains of a communication network. A more detailed discussion of example use cases is provided below, for example with reference to Figures 13 to 15, however Figures 4a and 4b provide an indication of example environments, contexts, actions, and rewards etc. for different technical domains. It will be appreciated that the technical domains illustrated in Figures 4a and 4b are merely for the purpose of illustration, and application of the methods 100 and 300 to other technical domains within a communication network may be envisaged.

Figures 4a and 4b illustrate steps of a method 400 for improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing a communication network environment that is operable to perform a task. The method 400 is performed by a training node, and the steps of the method 400 largely correspond to the steps of the method 100. Reference may be made to the above discussion of the method 100 for the detail of the training node and corresponding method steps.

Referring to Figure 4a, in step 410, the training node obtains a training dataset comprising records of task performance by the environment during a period of management according to a reference policy, wherein each record of task performance comprises an observed context for the environment, an action selected for execution in the environment by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on task performance by the environment. As illustrated at 410a, the environment may comprise at least one of a cell of a communication network, a cell sector of a communication network, at least a part of a core network of a communication network, or a slice of a communication network, and the task that the environment is operable to perform may comprise provision of communication network services.

Figure 3b illustrates examples of features, any one or more of which may be included in an observed context for an environment of a communication network. Referring to Figure 4b, an observed environment context in the training dataset may comprises at least one of: a value of a network coverage parameter (401a); a value of a network capacity parameter(401b); a value of a network congestion parameter (401c); a value of a network quality parameter a current network resource allocation (401 d); a current network resource configuration (401 e); a current network usage parameter (401f); a current network parameter of a neighbor communication network cell (401 g); a value of a network signal quality parameter (401 h) a value of a network signal interference parameter (401 i); a value of a Reference Signal Received Power, RSRP parameter; a value of a Reference Signal Received Quality, RSRQ, parameter; a value of a network signal to interference plus noise ratio, SINR, parameter; a value of a network power parameter (401j); a current network frequency band (401k); a current network antenna down-tilt angle (4011); a current network antenna vertical beamwidth (401m); a current network antenna horizontal beamwidth (401 n); a current network antenna height (401 o); a current network geolocation (401 p); a current network inter-site distance (401 q).

It will be appreciated that many of the parameters listed above comprise observable or measurable parameters, including KPIs of the network, as opposed to configurable parameters that may be controlled by a network operator. In the case of an environment comprising a cell of a communication network, the observed context for the cell may include one or more of the parameters listed above as measured or observed for the cell in question and for one or more neighbour cells of the cell in question.

Referring again to Figure 4a, in step 420, the training node generates, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context. As illustrated at 420a, an action for execution in the environment may comprises at least one of: an allocation decision for at least one communication network resource; a configuration for at least one communication network node; a configuration for at least one communication network equipment; a configuration for at least one communication network operation; a decision relating to provision of communication network services for at least one wireless device; a configuration for an operation performed by at least one wireless device in relation to the communication network.

In step 430, the training node initiates the reward estimator, wherein the reward estimator comprises an ML model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action. As illustrated at 430a, the reward value indicating an observed impact of the selected action on task performance by the environment may comprise a function of at least one performance parameter for the communication network.

In step 440, the training node sets a value of a propensity impact parameter according to a feature of at least the training dataset or the initiated reward estimator, and in step 450, the training node uses the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function. As illustrated at 450a, the loss function is based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy, each difference weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases. As illustrated at 450, using the records of task performance in the training dataset to update the values of the ML model parameters so as to minimize the loss function comprises adjusting a magnitude of the weighting of each difference according to the impact parameter.

Figure 5 is a flow chart illustrating process steps in a further example of method 500 for using a target policy to manage a communication network environment that is operable to perform a task. The method 500 may complement either of the methods 100 and/or 300, and is performed by a management node. The method 500 illustrates examples of how the steps of the method 200 may be implemented and supplemented to achieve the above discussed and additional functionality. As discussed above with reference to the method 200, the management node performing the method 500 may be a physical or virtual node, and may encompass multiple logical entities.

Referring to Figure 5, in a first step 510 of the method 500, the management node obtains a reward estimator from a training node, wherein the reward estimator comprises an ML model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using the method 100 or 200. In step 520, the management node receives an observed environment context from a communication network node.

In step 530, the management node uses the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment. As illustrated at 530a and 530b, the step of using the target policy to select an action for execution may comprise, in step 530a, evaluating possible actions for the environment using the obtained reward estimator, and, in step 530b, using a selection function to select an action for execution in the environment based on the evaluation.

Each of steps 530a and 530b may take different forms, depending upon the nature for the target policy. As illustrated at 530aa, evaluating possible actions for the environment using the obtained reward estimator may comprise using the reward estimator to estimate a reward from taking each possible action given the received context, for example in the case of a deterministic target policy. Using a selection function to select an action for execution in the environment based on the evaluation may in such examples comprise selecting for execution in the environment the action having the highest estimated reward, as illustrated at 530bb.

In other examples, a stochastic target policy may be used by the management node, in which case, evaluating possible actions for the environment using the obtained reward estimator may comprise using a prediction function to predict the probability that each of the possible actions will result in the greatest reward, as illustrated at 530aa. In examples of the present disclosure, the greatest reward may be the greatest reward as estimated by the reward estimator. Using a selection function to select an action for execution in the environment based on the evaluation may in such examples comprise selecting for execution in the environment the action having the highest probability, as illustrated at 530bb.

In step 540, the management node causes the selected action to be executed in the environment. It will be appreciated that much of the detail described above with reference to the methods 100 and 300 also applies to the method 400. For example, the nature of the environment, the observed environment context, the reward value, and possible actions for execution in the environment may all be substantially as described above with reference to Figures 1, 3a and 3b.

Figure 6 illustrates different examples of how the methods 200 and 500 may be applied to different technical domains of a communication network. A more detailed discussion of example use cases is provided below, for example with reference to Figures 13 to 15, however Figure 6 provides an indication of example environments, contexts, actions, and rewards etc. for different technical domains. The method 600 of Figure 6 thus corresponds to the method 400 of Figures 4a and 4b. It will be appreciated that the technical domains illustrated in Figure 6 are merely for the purpose of illustration, and application of the methods 200 and 500 to other technical domains within a communication network may be envisaged.

Figure 6 is a flow chart illustrating process steps in a computer implemented method for using a target policy to manage a communication network environment that is operable to perform a task. The method is performed by a management node, which may be a physical or virtual node, and may encompass multiple logical entities, as discussed above. Referring to Figure 6, in a first step 610, the management node obtains a reward estimator from a training node, wherein the reward estimator comprises an ML model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using the method 100, 300 or 400. As illustrated at 610a the reward value comprises a function of at least one performance parameter for the communication network.

In step 620, the management node receives an observed environment context from a communication network node. As illustrated at 620a, the environment comprises at least one of a cell of a communication network, a cell sector of a communication network, at least a part of a core network of a communication network, or a slice of a communication network, and the task that the environment is operable to perform comprises provision of communication network services. The observed context may comprise any one or more of the parameters discussed above with reference to Figure 4b.

In step 630, the management node uses the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment. The target policy evaluates possible actions for the environment using the obtained reward estimator. As illustrated at 630a, an action for execution in the environment may comprise at least one of: an allocation decision for at least one communication network resource; a configuration for at least one communication network node; a configuration for at least one communication network equipment; a configuration for at least one communication network operation; a decision relating to provision of communication network services for at least one wireless device; a configuration for an operation performed by at least one wireless device in relation to the communication network.

In step 640, the management node causes the selected action to be executed in the environment, for example by sending a suitable instruction to one or more communication network nodes and/or wireless devices in the environment, or by directly executing the action. Figures 1 to 6 discussed above provide an overview of methods which may be performed according to different examples of the present disclosure. There now follows a detailed discussion of how different process steps illustrated in Figures 1 to 6 and discussed above may be implemented according to an example training and management pipelines, illustrated in Figures 7 and 8 respectively. Referring to Figure 7, the methods 100, 300, 400 may be implemented as a training pipeline comprising the following steps:

1) Having obtained a training dataset (step 110, 310, 410), perform feature normalization on the input features considered for training. The input features are normalized to have zero mean and standard deviation equal to one.

2) Generate the propensity model p z ( ): x A ® [0,1] that estimates the reference policy, i.e. the probability of action a given the context x for the reference policy p 0 . (steps 120, 320, 420). The model consists of logistic regression model of the type:

3) Sample a batch B from the training (logging or reference policy) dataset D Uo (step 335). The training is then be executed in a batch learning setting, in which at each training step a new batch is sampled.

4) Create a reward estimation model ?(·,·): x R, comprising in an Artificial Neural Network (ANN) with initial weight vector w 0 , (step 130, 330 430), calculate loss using a loss function based on a difference between observed reward from the sampled batch of the training data and reward estimated by the reward estimation model (numerator of the MSE objective in step 7).

5) Set a value of propensity impact parameter b according to the normalized input features (step 140, 340, 440).

6) For the sampled batch B, compute a weighting for each loss value from step 4 (denominator of the MSE objective in step 7), the weighting comprising the output of the propensity model raised to the power of b. 7) Using the calculated losses from step 4 and weights from step 6, train the created reward estimation model (updating the weights from the values in the initial weight vector w 0 ) to minimize the following weighted MSE objective via backpropagation (steps 150, 350, 450):

8) Check for convergence of the learning algorithm, return to step 3 to sample new batch and stop when convergence criterion satisfied (e.g. a maximum number of training epochs) (step 360).

9) Deliver the trained reward estimator model to a policy pipeline for computation of optimal actions

(step 370).

As discussed above, the impact parameter b determines the impact of the propensity model output on the loss function, ranging from no impact at all ( j 6=0) to maximum impact ( j 6=1). As the impact parameter b increases, the variance of the reward estimation increases accordingly, and the bias of the reward estimation decreases. When b = 0, the policy with the trained reward estimation provides the low variance and high bias of Direct Method estimation. When b = 1, the trained policy provides lower bias while introducing higher variance.

Referring now to Figure 8, the methods 200, 500, 600 may be implemented as a policy pipeline for a deterministic or stochastic policy. Figure 8 illustrates a policy pipeline for a deterministic policy comprising the following steps: a. Receive an observed environment context comprising input features. b. Normalize the input features from the received environment context. c. Estimates the reward for each available action based on the reward model delivered at step 9) of the training pipeline. d. Greedily execute the action having the maximum estimated reward.

As discussed above, the methods 100, 300 and 400 may be performed by a training node, and the present disclosure provides a training node that is adapted to perform any or all of the steps of the above discussed methods. The training node may be a physical or virtual node, and may for example comprise a virtualised function that is running in a cloud, edge cloud or fog deployment. The training node may for example comprise or be instantiated in any part of a logical core network node, network management centre, network operations centre, Radio Access node etc. Any such communication network node may itself be divided between several logical and/or physical functions, and any one or more parts of the management node may be instantiated in one or more logical or physical functions of a communication network node. Figure 9 is a block diagram illustrating an example training node 900 which may implement the method 100, 300 and/or 400, as illustrated in Figures 1, 3a, 3b, 4a and 4b, according to examples of the present disclosure, for example on receipt of suitable instructions from a computer program 950. Referring to Figure 9, the training node 900 comprises a processor or processing circuitry 902, and may comprise a memory

904 and interfaces 906. The processing circuitry 902 is operable to perform some or all of the steps of the method 100, 300 and/or 400 as discussed above with reference to Figures 1, 3a, 3b, 4a and 4b. The memory 904 may contain instructions executable by the processing circuitry 902 such that the training node 900 is operable to perform some or all of the steps of the method 100, 300 and/or 400, as illustrated in Figures 1, 3a, 3b, 4a and 4b. The instructions may also include instructions for executing one or more telecommunications and/or data communications protocols. The instructions may be stored in the form of the computer program 950. In some examples, the processor or processing circuitry 902 may include one or more microprocessors or microcontrollers, as well as other digital hardware, which may include digital signal processors (DSPs), special-purpose digital logic, etc. The processor or processing circuitry 902 may be implemented by any type of integrated circuit, such as an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) etc. The memory 904 may include one or several types of memory suitable for the processor, such as read-only memory (ROM), random-access memory, cache memory, flash memory devices, optical storage devices, solid state disk, hard disk drive etc. Figure 10 illustrates functional modules in another example of training node 1000 which may execute examples of the methods 100, 300 and/or 400 of the present disclosure, for example according to computer readable instructions received from a computer program. It will be understood that the modules illustrated in Figure 10 are functional modules, and may be realised in any appropriate combination of hardware and/or software. The modules may comprise one or more processors and may be integrated to any degree.

Referring to Figure 10, the training node 1000 is for improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing a communication network environment that is operable to perform a task. The training node 1000 comprises a receiving module 1002 for obtaining a training dataset comprising records of task performance by the environment during a period of management according to a reference policy, wherein each record of task performance comprises an observed context for the environment, an action selected for execution in the environment by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on task performance by the environment. The training node 1000 further comprises a learning module 1004 for generating, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context, and for initiating the reward estimator, wherein the reward estimator comprises an ML model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action. The training node 1000 also comprises a weighting module 1006 for setting a value of a propensity impact parameter according to a feature of at least the training dataset or the reference policy. The learning module 1004 is also for using the records of task performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function, wherein the loss function is based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy, each difference weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases. The learning module is for using the records of task performance in the training dataset to update the values of the ML model parameters so as to minimize the loss function by adjusting a magnitude of the weighting of each difference according to the impact parameter. The training node 1000 may further comprise interfaces 1008 which may be operable to facilitate communication with a management node, and/or with other communication network nodes over suitable communication channels.

As discussed above, the methods 200, 500 and 600 may be performed by a management node, and the present disclosure provides a management node that is adapted to perform any or all of the steps of the above discussed methods. The management node may be a physical or virtual node, and may for example comprise a virtualised function that is running in a cloud, edge cloud or fog deployment. The management node may for example comprise or be instantiated in any part of a logical core network node, network management centre, network operations centre, Radio Access node etc. Any such communication network node may itself be divided between several logical and/or physical functions, and any one or more parts of the management node may be instantiated in one or more logical or physical functions of a communication network node.

Figure 11 is a block diagram illustrating an example management node 1100 which may implement the method 200, 500 and/or 600, as illustrated in Figures 2, 5 and 6, according to examples of the present disclosure, for example on receipt of suitable instructions from a computer program 2050. Referring to Figure 11, the management node 1100 comprises a processor or processing circuitry 1102, and may comprise a memory 1104 and interfaces 1106. The processing circuitry 1102 is operable to perform some or all of the steps of the method 200, 500 and/or 600 as discussed above with reference to Figures 2, 5 and 6. The memory 1104 may contain instructions executable by the processing circuitry 1102 such that the management node 1100 is operable to perform some or all of the steps of the method 200, 500 and/or 600, as illustrated in Figures 2, 5 and 6. The instructions may also include instructions for executing one or more telecommunications and/or data communications protocols. The instructions may be stored in the form of the computer program 1150. In some examples, the processor or processing circuitry 1102 may include one or more microprocessors or microcontrollers, as well as other digital hardware, which may include digital signal processors (DSPs), special-purpose digital logic, etc. The processor or processing circuitry 1102 may be implemented by any type of integrated circuit, such as an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) etc. The memory 1104 may include one or several types of memory suitable for the processor, such as read-only memory (ROM), random-access memory, cache memory, flash memory devices, optical storage devices, solid state disk, hard disk drive etc.

Figure 12 illustrates functional modules in another example of management node 1200 which may execute examples of the methods 200, 500 and/or 600 of the present disclosure, for example according to computer readable instructions received from a computer program. It will be understood that the modules illustrated in Figure 12 are functional modules, and may be realised in any appropriate combination of hardware and/or software. The modules may comprise one or more processors and may be integrated to any degree.

Referring to Figure 12, the management node 1200 is for using a target policy to manage a communication network environment that is operable to perform a task. The management node comprises a receiving module 1202 for obtaining a reward estimator from a training node, wherein the reward estimator comprises an ML model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using a method according to the present disclosure. The receiving module 1202 is also for receiving an observed environment context from a communication network node. The management node further comprises a policy module 1204 for using the target policy to select, based on the received observed context and from a set of possible actions for the environment, an action for execution in the environment. The target policy evaluates possible actions for the environment using the obtained reward estimator. The management node further comprises an execution module 1206 for causing the selected action to be executed in the environment. The management node 1200 may further comprise interfaces 1208 which may be operable to facilitate communication with a training node and/or with other communication network nodes over suitable communication channels.

There now follows a discussion of some example use cases for the methods of the present disclosure, as well as description of implementation of the methods of the present disclosure for such example use cases. It will be appreciated that the use cases presented herein are not exhaustive, but are representative of the type of problem within a communication network which may be addressed using the methods presented herein.

Use case 1 : Remote Electrical Tilt optimization Modern cellular networks are required to satisfy consumer demand that is highly variable in both the spatial and the temporal domains. In order to be able efficiently to provide high level of Quality of Service (QoS) to User Equipments (UEs), networks must adjust their configuration in an automatic and timely manner. Antenna vertical tilt angle, referred to as downtilt angle, is one of the most important variables to control for QoS management. The downtilt angle can be modified both in a mechanical and an electronic manner, but owing to the cost associated with manually adjusting the downtilt angle, Remote Electrical Tilt (RET) optimisation is used in the vast majority of modern networks.

The antenna downtilt is defined as the elevation angle of the main lobe of the antenna radiation pattern with respect to the to the horizontal plane, as illustrated in Figure 13. Several Key Performance Indicators (KPIs) may be taken into consideration when evaluating the performance of a RET optimization strategy, including coverage (area covered in terms of a minimum received signal strength), capacity (average total throughput in the a given area of interest), and quality. There exists a trade-off between coverage and capacity when determining an increase in antenna downtilt: increasing the downtilt angle correlates with a stronger signal in a more concentrated area, as well as higher capacity and reduced interference radiation towards other cells in the network. However, excessive downtilting can result in insufficient coverage in a given area, with some UEs unable to receive a minimum signal quality.

In the following discussion, the focus is on Capacity Coverage Optimization (CCO), which seeks to maximizing the network capacity while ensuring that the targeted service areas remain covered.

The solution is based on the availability of a reference dataset 2) po generated according to a rule-based expert reference policy 0 (a j |x j ). The reference policy is assumed to be suboptimal and consequently improvable. The goal is to devise a target policy n w {a t |c έ ) having a higher value.

For the purposes of the present use case, the following elements may be defined:

Environment: The physical 4G or 5G mobile cellular network area considered for RET optimisation. The network area may be divided into C sectors, each served by an antenna.

Context: A set of normalized KPIs collected in the area considered for the RET optimization. In one example, the context may be described by the vector s t = [cov(t), cap(t)^(t)] e [0,1] x [0,1] x [0,90] where cov(t ) is the coverage network metric, cap(t ) is the capacity metric and i9(t) is the downtilt of an antenna at time t.

Action: A discrete unitary change in the current antenna tilt angle a t e (-1,0,1). Reward: A measure of the context variation induced by the action a t taken given the context x t . The reward signal or function may be defined at the level of domain knowledge. One example of reward considers c D0F and q D0 F< the capacity and coverage Degree Of Fire (DOF), measuring the degree of alarm perceived by the policy with respect to the capacity and coverage in the cell. For example a reward may be envisaged having the following structure r t = r(x a t ) = max{c D0Fi+1 , q D 0F i+1 } -max{c D0Fi , q D0Fi }.

In the present example use case, the target policy n w ai\xi) is an ANN model parametrized by weight vector w and with an output softmax layer, taking as input the 2D context vector x t and returning a probability distribution for all actions a t e {-1,0,1}, resulting in a stochastic policy. The reference dataset D no is split into training set (70%) and test set (30%).

Practical Implementation An example policy is considered that is global and executes tilt changes on a cell-by-cell basis, meaning that the same policy is executed independently at each cell. For the data collection step, a central node may aggregate all the data coming from the different base stations. Flowever, for the policy training step (offline) and the policy execution step (online in the network) the present implementation envisages independent execution, and consequently no central coordination. It will be appreciated that other implementations may consider centralized coordination.

Evaluation methodology and experimental results

Empirical performance of methods according to the present disclosure was evaluated for the present use case using real-world cellular network data. Figure 14 illustrates performance of a method according to the present disclosure (agent_prop with b = 1) and performance of an agent using the Direct Method (DM) estimator (agent) on a test dataset. In the illustrated evaluation, the IPS estimated value is used as a test performance metric for the two agents that are trained with training dataset. The ANN model of the reward estimator has the following structure: number of input features: 35, and number of outputs: 3 (estimated reward for each action), and the batch size is 32. The ANN has 2 hidden layers with size [256, 256], It can be observed from Figure 14 that the method according to the present disclosure provides a better performance than the DM method, with a gain of up to 20% performance increase.

Figure 15 illustrates comparative performance of a method according to the present disclosure, an RL agent using a prior art method when the agent only executes an uptilt or downtilt action. Figure 15 also illustrates performance of a Self Organizing Networks solution (SON_RET). It can be seen from Figure 15 that the best performing agent is agent_prop_(35,256,32), for which the number of input features is 35, and the policy is trained according to an example of the present disclosure. The gain over the SON RET solution tracks at over 100% performance increase.

According to one example of the present disclosure, there is provided a computer implemented method for improving the accuracy of a reward estimator for a target policy, wherein the target policy is for managing Remote Electronic Tilt (RET) in at least a sector of a cell of a communication network, which cell sector is operable to provide Radio Access Network (RAN) services for the communication network, the method, performed by a training node, comprising: obtaining a training dataset comprising records of RAN service provision performance by the cell sector during a period of RET management according to a reference policy; wherein each record of RAN service provision performance comprises an observed context for the cell sector, an action selected for execution in the cell sector by the reference policy on the basis of the observed context, and a reward value indicating an observed impact of the selected action on RAN service provision performance by the cell sector; wherein an observed cell sector context in the training dataset comprises at least one of: a coverage parameter for the sector; a capacity parameter for the sector; a signal quality parameter for the sector; a down tilt angle of the antenna serving the sector; and wherein an action for execution in the cell sector comprises a downtilt adjustment value for an antenna serving the sector; the method further comprising: generating, based on the training dataset, a propensity model that estimates the probability of selection by the reference policy of a particular action given a particular observed context; initiating the reward estimator, wherein the reward estimator comprises a Machine Learning (ML) model having a plurality of parameters, and wherein the reward estimator is operable to estimate reward value given a particular observed context and selected action; setting a value of a propensity impact parameter according to a feature of at least the training dataset or the reference policy; and using the records of RAN service provision performance in the training dataset to update the values of the reward estimator parameters so as to minimize a loss function; wherein the loss function is based on differences between observed reward from the training dataset and reward estimated by the reward estimator for given pairs of observed context and action selected by the reference policy, each difference weighted according to a function of the output of the propensity model for the given pair of observed context and action selected by the reference policy, such that as the estimated probability output by the propensity model decreases, the contribution to the loss function of the difference between observed and estimated reward increases; and wherein using the records of RAN service provision performance in the training dataset to update the values of the ML model parameters so as to minimize the loss function comprises: adjusting a magnitude of the weighting of each difference according to the impact parameter.

According to another aspect of the present disclosure, there is provided a computer implemented method for using a target policy to manage managing Remote Electronic Tilt (RET) in at least a sector of a cell of a communication network, which cell sector is operable to provide Radio Access Network (RAN) services for the communication network, the method, performed by a management node, comprising: obtaining a reward estimator from a training node, wherein the reward estimator comprises a Machine Learning (ML) model that is operable to estimate a reward value given a particular observed context and selected action, and has been trained using a method according to the present disclosure; receiving an observed cell sector context from a communication network node; wherein the observed environment context received from the communication network node comprises at least one of: a coverage parameter for the sector; a capacity parameter for the sector; a signal quality parameter for the sector; a down tilt angle of the antenna serving the sector; and wherein an action for execution in the environment comprises a downtilt adjustment value for an antenna serving the sector; the method further comprising: using the target policy to select, based on the received observed context and from a set of possible actions for the cell sector, an action for execution in the cell sector; and causing the selected action to be executed in the cell sector; wherein the target policy evaluates possible actions for the cell sector using the obtained reward estimator.

Use Case Ibis: Base Station Parameter optimization

It will be appreciated that RET is merely one of many operational parameters for communication network cells. For example, a radio access node, such as a base station, serving a communication network cell may adjust its transmit power, required Uplink power, sector shape, etc, so as to optimise some measure of cell performance, which may be represented by a combination of cell KPIs. The methods and nodes of the present disclosure may be used to manage any operational parameter for a communication network cell.

Use Case 2: Dynamic Resource Allocation

In many communication networks, a plurality of services may compete over resources in a shared environment such as a Cloud. The services can have different requirements and their performance may be indicated by their specific QoS KPIs. Additional KPIs that can be similar across services can also include time consumption, cost, carbon footprint, etc. The shared environment may also have a list of resources that can be partially or fully allocated to services. These resources can include CPU, memory, storage, network bandwidth, Virtual Machines (V Ms), Virtual Network Functions (VNFs), etc.

For the purposes of the present use case, the following elements may be defined:

Environment: The cloud, edge cloud or other shared resource platform over which services are provided, and within which the performance of the various services with their current allocated resources may be monitored.

Context: A set of normalized KPIs for the services deployed on the shared resource of the environment. Action: An allocation or change in allocation of a resource to a service.

Reward: A measure of the context variation induced by an executed action given the context. This may comprise a function or combination of KPIs for the services.

Examples of the present disclosure thus provide an off-policy value estimator family and corresponding learning method that offer increased flexibility in how reward is estimated using off policy data. This flexibility is provided by the tunable propensity impact parameter b, which may be set according to the characteristics of the training dataset, and provides potential for improved bias-variance behavior of the reward estimator. Experimental evaluation has shown the effectiveness of methods according to the present disclosure using real-world network data for the RET optimization use case. The effectiveness was particularly marked when using a larger number of input features for the policy model. Examples of the present disclosure overcome limitations of the IPS and DM risk estimator methods by combining aspects of each method in a flexible manner. The improved accuracy with respect to standard IPS and DM off-policy learning techniques is provided by controlling the extent of the impact of the propensity model on the DM reward. The performance gain achieved by examples of the present disclosure increases with increasing number of input features, as existing methods for training with a large number of input features suffer from bias. As off policy training methods, examples of the present disclosure allow the development of an improved policy in an offline, and hence safe manner.

The methods of the present disclosure may be implemented in hardware, or as software modules running on one or more processors. The methods may also be carried out according to the instructions of a computer program, and the present disclosure also provides a computer readable medium having stored thereon a program for carrying out any of the methods described herein. A computer program embodying the disclosure may be stored on a computer readable medium, or it could, for example, be in the form of a signal such as a downloadable data signal provided from an Internet website, or it could be in any other form.

It should be noted that the above-mentioned examples illustrate rather than limit the disclosure, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims or numbered embodiments. The word "comprising” does not exclude the presence of elements or steps other than those listed in a claim or embodiment, "a” or "an” does not exclude a plurality, and a single processor or other unit may fulfil the functions of several units recited in the claims or numbered embodiments. Any reference signs in the claims or numbered embodiments shall not be construed so as to limit their scope.