Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DYNAMIC REINFORCEMENT LEARNING
Document Type and Number:
WIPO Patent Application WO/2023/036455
Kind Code:
A1
Abstract:
A method (400) for dynamic RL. The method includes using an RL algorithm to select a first action and triggering performance of the selected first action. The method also includes after the first action is performed, obtaining a first reward value (R1) associated with the first action. The method also includes using R1 and/or a performance indicator (P1) to determine whether an algorithm modification condition is satisfied. The method further includes, as a result of determining that the algorithm modification condition is satisfied, modifying the RL algorithm to produce a modified RL algorithm. In this way, the RL algorithm adapts to changes in the environment.

Inventors:
LI JINGYA (SE)
QI ZHIQIANG (CN)
LIN XINGQIN (US)
ARONSSON ANDERS (SE)
ZHANG HONGYI (SE)
BOSCH JAN (SE)
HOLMSTRÖM OLSSON HELENA (SE)
Application Number:
PCT/EP2021/086922
Publication Date:
March 16, 2023
Filing Date:
December 21, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
International Classes:
G06N3/08; G06N7/00
Other References:
DOS SANTOS MIGNON ALEXANDRE ET AL: "An Adaptive Implementation of e-Greedy in Reinforcement Learning", PROCEDIA COMPUTER SCIENCE, ELSEVIER, AMSTERDAM, NL, vol. 109, 12 June 2017 (2017-06-12), pages 1146 - 1151, XP085065595, ISSN: 1877-0509, DOI: 10.1016/J.PROCS.2017.05.431
XU ZHI-XIONG ET AL: "Reward-Based Exploration: Adaptive Control for Deep Reinforcement Learning", IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, vol. E101.D, no. 9, 9 September 2018 (2018-09-09), JP, pages 2409 - 2412, XP055919451, ISSN: 0916-8532, DOI: 10.1587/transinf.2018EDL8011
LIU XIAOMING ET AL: "Deep Reinforcement Learning via Past-Success Directed Exploration", PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 33, 1 January 2019 (2019-01-01), pages 9979 - 9980, XP055919952, ISSN: 2159-5399, DOI: 10.1609/aaai.v33i01.33019979
SUSAN AMIN ET AL: "A Survey of Exploration Methods in Reinforcement Learning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 1 September 2021 (2021-09-01), XP091050083
Attorney, Agent or Firm:
ERICSSON AB (SE)
Download PDF:
Claims:
CLAIMS

1. A method (400) for dynamic reinforcement learning, RL, the method comprising: using (s402) an RL algorithm to select a first action; triggering (s404) performance of the selected first action; after the first action is performed, obtaining (s406) a first reward value, Rl, associated with the first action; using (s408) Rl and/or a performance indicator, PI, to determine whether an algorithm modification condition is satisfied; as a result of determining that the algorithm modification condition is satisfied, modifying (s410) the RL algorithm to produce a modified RL algorithm.

2. The method of claim 1, wherein modifying the RL algorithm to produce the modified RL algorithm comprises modifying a parameter of the RL algorithm.

3. The method of claim 2, wherein modifying a parameter of the RL algorithm comprises modifying one or more of: an exploration probability of the RL algorithm, a learning rate of the RL algorithm, a discount factor of the RL algorithm, a replay memory capacity of the RL algorithm.

4. The method of claim 3, wherein using the RL algorithm to select the first action comprises selecting the first action based on the exploration probability, and modifying the RL algorithm to produce the modified RL algorithm comprises modifying the exploration probability.

5. The method of any one of claims 1-4, wherein using Rl and/or PI to determine whether the algorithm modification condition is satisfied comprises one or more of: comparing Rl to a first threshold, comparing ΔR to a second threshold, wherein ΔR is a difference between Rl and a reward value associated with a second action selected using the RL algorithm, or comparing the PI to a third threshold.

6. The method of any one of claims 1-4, further comprising: before using the RL algorithm to select the first action and obtaining Rl, using the RL algorithm to select a second action; triggering performance of the selected second action; and after the second action is performed, obtaining a second reward value, R2, associated with the second action, wherein using Rl and/or PI to determine whether the algorithm modification condition is satisfied comprises performing a decision process comprising: calculating ΔR = R2 - Rl ; and determining whether ΔR is greater than a drop threshold.

7. The method of claim 6, wherein using the RL algorithm to select the first action comprises selecting the first action based on an exploration probability, e, the algorithm modification condition is satisfied when ΔR is greater than the drop threshold, and modifying the RL algorithm as a result of determining that the algorithm modification condition is satisfied comprises generating a new exploration probability, εnew, for the RL algorithm, wherein εnew equals εRestart, where εRestart is a predetermined exploration probability.

8. The method of claim 6, wherein the decision process further comprises, as a result of determining that ΔR is not greater than the drop threshold, then determining whether Rl is less than a lower reward threshold.

9. The method of claim 6, wherein the decision process further comprises, as a result of determining that ΔR is not greater than the drop threshold, then determining whether Rl is greater than an upper reward threshold.

10. The method of claim 9, wherein the algorithm modification condition is satisfied when ΔR is not greater than the drop threshold and Rl is greater than the upper reward threshold, and modifying the RL algorithm as a result of determining that the algorithm modification condition is satisfied comprises generating a new exploration probability, εnew, for the RL algorithm, wherein εnew equals εend, where εend is a predetermined ending exploration probability.

11. The method of claim 9, wherein the algorithm modification condition is satisfied when ΔR is not greater than the drop threshold and R1 is not greater than the upper reward threshold, and modifying the RL algorithm as a result of determining that the algorithm modification condition is satisfied comprises generating a new exploration probability, εnew, for the RL algorithm, wherein εnew equals (E X C), where c is a predetermined constant.

12. The method of any one of claims 1-11, further comprising, prior to using the RL algorithm to select the first action: using the RL algorithm to select K-l actions, where K > 1; triggering the performance of each one of the K-l actions; and for each one of the K-l actions, obtaining a reward value associated with the action.

13. The method of claim 12, wherein using R1 and/or PI to determine whether the algorithm modification condition is satisfied comprises: using R1 and said K-l reward values to generate a reward value that is a function of these K reward values; and comparing the generated reward value to a threshold.

14. The method of claim 12, wherein using R1 and/or PI to determine whether the algorithm modification condition is satisfied comprises: using R1 and said K-l reward values to generate a reward value that is a function of these K reward values; and comparing ΔR to a threshold, wherein ΔR is a difference between the generated reward value and a previously generated reward value.

15. The method of claim 13 or 14, wherein the generated reward value is: a sum of the K reward value a weighted sum of said K reward values, a weighted sum of a subset of said K reward values, a mean of said K reward values, a mean of a subset of said K reward values, a median of said K reward values, or a median of a subset of said K reward values.

16. The method of any one of claims 12-15, where the value of K is determined based on a correlation time of the environment and/or application requirements.

17. The method of any one of claims 12-15, where the value of K is determined based on a maximum allowed service interruption time.

18. The method of any one of claims 1-17, wherein one or more of the recited thresholds is dynamically changed based on environment changes and/or service requirement changes.

19. The method of any one of claims 1-18, further comprising: using the modified RL algorithm to select another action; and triggering performance of the another action.

20. A computer program (543) comprising instructions (544) which when executed by processing circuitry (502) of an agent (201) causes the agent (201) to perform the method of any one of claims 1-19.

21. A carrier containing the computer program of claim 20, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium (542).

22. A reinforcement learning, RL, agent (201), the RL agent being configured to: use (s402) an RL algorithm to select a first action; trigger (s404) performance of the selected first action; after the first action is performed, obtain (s406) a first reward value, Rl, associated with the first action; use (s408) R1 and/or a performance indicator, PI, to determine whether an algorithm modification condition is satisfied; as a result of determining that the algorithm modification condition is satisfied, modify (s410) the RL algorithm to produce a modified RL algorithm.

23. A reinforcement learning, RL, agent (201), the RL agent comprising: processing circuitry (502); and a memory (542), the memory containing instructions (544) executable by the processing circuitry, wherein the RL is configured to perform a process comprising: using (s402) an RL algorithm to select a first action; triggering (s404) performance of the selected first action; after the first action is performed, obtaining (s406) a first reward value, Rl, associated with the first action; using (s408) Rl and/or a performance indicator, PI, to determine whether an algorithm modification condition is satisfied; as a result of determining that the algorithm modification condition is satisfied, modifying (s410) the RL algorithm to produce a modified RL algorithm.

24. The RL agent of claim 22 or 23, wherein the RL agent is further configured to perform the method of any one of claims 2-19.

Description:
DYNAMIC REINFORCEMENT LEARNING

TECHNICAL FIELD

[001] This disclosure relates to reinforcement learning.

BACKGROUND

[002] Reinforcement Learning (RL) is a type of machine learning (ML) that enables an agent to leam by trial and error using feedback based on the actions that the agent triggers. RL has made remarkable progress in recent years and is now used in many applications, including real-time network management, simulations, games, etc. RL differs from the commonly used supervised and unsupervised ML approaches. Supervised ML requires a training data set with annotations provided by an external supervisor, and unsupervised ML is typically a process of determining an implicit structure in a data set without annotations.

[003] The concept of RL is straightforward: an RL agent is reinforced to make better decisions based on the past learning experience. This method is similar to the different performance rewards that we encounter in everyday life. Typically, the RL agent implements an algorithm that obtains information about the current state of a system (a.k.a., “environment”), selects an action, triggers performance of the action, and then receives a “reward,” the value of which is dependent on the extent to which the action produced a desired outcome. This process repeats continually and eventually the RL agent learns, based on the reward feedbacks, the best action to select given the current state of the environment.

[004] Although a designer sets the reward policy, that is, the rules of the game, the designer typically gives the RL agent no hints or suggestions as which actions are best for any given state of the environment. It’s up to the RL agent to figure out which action is best to maximize the reward, starting from totally random trials and finishing with sophisticated tactics. By leveraging the power of search and many trials, RL is an effective way to accomplish a task. In contrast to human beings, an RL agent can gather experience from thousands of parallel gameplays if a reinforcement learning algorithm is run on a sufficiently powerful computer infrastructure.

[005] Q-Learning:

[006] Q-leaming is a reinforcement learning algorithm to leam the value of an action in a particular statue (see, e.g. reference [2]). Q-leaming does not require a model of the environment, and theoretically, it can find an optimal policy that maximizes the expected value of the total reward for any given finite Markov decision process. The Q-algorithm is used to find the optimal action/selection policy: Q: S x A → IR (Eq. 1).

[007] FIG. 1 illustrates the basic flow of Q-leaming algorithm. Before learning begins, Q is initialized to a possibly arbitrary value. Then, at each time t the agent selects an action a t , observes a reward r t , enters a new state s t+1 (that may depend on both the previous state s t and the selected action), and Q is updated using the following equation: where a is the learning rate with 0 < a < 1 and it determines to what extent newly acquired information overrides the old information, and y is a discount factor with 0 < y < land it determines the importance of future rewards.

[008] Deep Q-Leaming:

[009] A simple way of implementing Q-leaming algorithm is to store the Q matrix in tables. However, this can be infeasible or not efficient when the number of states or actions becomes large. In this case, function approximation can be used to represent Q, which makes Q-leaming applicable to large problems. One solution is to use deep learning for function approximation. Deep learning models consist of several layers of neural networks, which are in principle responsible for performing more sophisticated tasks like nonlinear function approximation of Q.

[0010] Deep Q-leaming is a combination of convolutional neural networks with the Q-leaming algorithms. It uses deep neural network with weights 6 to achieve an approximated representation of Q. In addition, to improve the stability of the deep-Q learning algorithm, a method called experience replay was proposed to remove correlations between samples by using a random sample from prior actions instead of the most recent action to proceed (see., e.g., reference [3]). The deep Q-leaming algorithm with experience replay proposed in reference [3] is shown in the table below. After performing experience replay, the agent selects and executes an action according to an e-greedy policy, E defines the exploration probability for the agent to perform a random action.

SUMMARY

[0011] As noted above, reinforcement learning has been successfully used in in many use cases (e.g., cart-pole problem solving, robot locomotion, Atari games, Go Games, etc.) where the RL agent is dealing with a relatively static environment (the set of states don’t change), and it is possible to obtain all possible environment states, which is known as “full observability.”

[0012] Theoretically, RL algorithms can also cope with the dynamic changing environment if sufficient data can be collected to abstract the changing environment and there is sufficient time for training and trials. These requirements, however, can be difficult to meet in practice because large data collection can be complex, costly, and time consuming, or even infeasible. In many cases, it is not possible to have the full observability of the dynamic environment, e.g., when quick decision needs to be taken, or when it is difficult/infeasible to collect data for some features. On example is a public safety scenario, where an unmanned aerial vehicle (UAV) (a.k.a., drone) carrying base station (“UAV-BS”) needs to be deployed quickly in a disaster area to provide wireless connectivity for mission critical users. It is important to adapt the UAV-BS’ configuration and location to the real-time mission critical traffic situation. For instance, when the mission critical users move on the ground and/or when more first responders join the mission critical operation in the disaster area, the UAV- BS should quickly adapt its location and configuration to maintain the service continuity in this changing environment.

[0013] This disclosure aims at mitigating the above problem. Accordingly, in one aspect there is provided a method for dynamic RL. The method includes using an RL algorithm to select a first action and triggering performance of the selected first action. The method also includes after the first action is performed, obtaining a first reward value (Rl) associated with the first action. The method also includes using Rl and/or a performance indicator (PI) to determine whether an algorithm modification condition is satisfied. The method further includes, as a result of determining that the algorithm modification condition is satisfied, modifying the RL algorithm to produce a modified RL algorithm. In this way, the RL algorithm adapts to changes in the environment.

[0014] In another aspect there is provided a computer program comprising instructions which when executed by processing circuitry of a RL agent causes the RL agent to perform any of the methods disclosed herein. In one embodiment, there is provided a carrier containing the computer program wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.

[0015] In another aspect there is provided an RL agent node that is configured to use an RL algorithm to select a first action and trigger performance of the selected first action. The RL agent is also configured to, after the first action is performed, obtain a first reward value (Rl) associated with the first action. The RL agent is also configured to use Rl and/or a performance indicator (PI) to determine whether an algorithm modification condition is satisfied. The RL agent is also configured to, as a result of determining that the algorithm modification condition is satisfied, modify the RL algorithm to produce a modified RL algorithm. In some embodiments, the RL agent comprises memory and processing circuitry coupled to the memory, wherein the memory contains instructions executable by the processing circuitry to configure the RL agent to perform the methods/processes disclosed herein.

[0016] An advantage of the embodiments disclosed herein is that they provide an adaptive RL agent that is able to operate well in a dynamic environment with limited observability of the environment and/or changing state sets over time. That is, embodiments can handle complex system optimization and decision-making problems in dynamic environment with limited environment observability and dynamic state space. Compared to a convention non-adaptive RL agent, the embodiments disclosed herein can respond to changes in the environment and update its RL algorithm to achieve an acceptable level of service quality. In addition, conventional RL agents need to have a retrained RL algorithm completely from scratch when entering a different environment, whereas the embodiment can reuse part of the past learned experience with adjusted algorithm parameters to provide proper and timely decisions in the subsequent changing environments.

BRIEF DESCRIPTION OF THE DRAWINGS

[0017] The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.

[0018] FIG. 1 illustrates the basic flow of Q-leaming algorithm.

[0019] FIG. 2 illustrates an RL system according to some embodiments.

[0020] FIG. 3 illustrates a UAV-BS serving a mission-critical UE.

[0021] FIG. 4 is a flowchart illustrating a process according to some embodiments.

[0022] FIG. 5 is a block diagram of an RL agent according to some embodiments.

DETAILED DESCRIPTION

[0023] FIG. 2 illustrates an RL system 200 according to some embodiments. The RL system consists of an RL agent 201 (or “agent” for short), a set of environment states S, and a set of actions A per state. By performing an action in the environment 202, the environment 202 may transition from one state to another state and the agent 201 receives an immediate “reward” after taking this action. More formally, at a given time t, the agent 201 obtains the current state s t of the environment 202, and then selects an action a t from the set of available actions associated with the current state, which action is subsequently performed. After the action is performed the environment 202 moves to a new state s t+1 and the agent receives a award r t associated with the transition (s t , a t , s t+1 ). The goal of the agent 201 is to learn a policy that maximizes the expected cumulative reward. The policy may be a map or a table that gives the probability of taking an action a when in a state s.

[0024] Agent 201 is configured to adapt the RL algorithm that is employs to select the actions. This enables, among other things, fast decision making in a dynamic environment with limited or/and changing state sets over time. The agent 201, in one embodiment, preforms the following steps: 1) the agent 201 monitors a first set of one or more parameters, 2) the agent 201, based on monitored parameter(s), adjusts the RL algorithm (e.g., adjusts a second set of one or more parameters) to adapt the RL algorithm to the new environment, and

3) selects action using the modified RL algorithm.

[0025] In one embodiment, the first set of parameters includes at least one or a combination of the following: 1) The received immediate reward r t at a given time t, 2) an accumulated reward during a time window, i.e., from time i to time j; and 3) a performance indicator (e.g., a key performance indicator (KPI)). For the UAV-BS in the public safety scenario described above, examples of KPIs include: the drop rate of mission critical users; the worst mission critical user throughput; the wireless backhaul link quality; etc.

[0026] With respect to the accumulated reward, in some embodiments the time window is decided based on: i) the correlation time (changing-scale) the environment and/or ii) application requirements, e.g., the maximum allowed service interruption time. In other embodiments, the time window is the time duration from the beginning till now.

[0027] Dynamic changing of environment (e.g., user equipment (UE) movements, UAV movements, or/and backhaul connection links of the UAV-BS in the public safety scenario) can result in the change of value(s) of one or a combination of the first set of parameters. By detecting/ observing such changes, the agent 201 can automatically adapt the RL algorithm to fit the new environment.

[0028] The triggering event for adjusting the second set of parameters at a given time t can be at least one or a combination of the following:

• The immediate reward r t is less than a lower-bound threshold.

• The immediate reward r t is greater than an upper-bound threshold.

• The difference between the immediate reward at time t and the previous time instance , is larger than a pre-defined threshold.

• The accumulated reward is less than a lower-bound threshold.

• The accumulated reward is greater than an upper-bound threshold.

• The difference between the accumulated reward in the current time window [i, j] and the one in the previous time window larger than a defined threshold. • A key performance parameter is less than a lower-bound threshold.

• A key performance parameter is greater than an upper-bound threshold.

• In all above events, the thresholds can either be pre-defined or dynamically changed based on the changing service requirements or/and changing environment.

[0029] In one embodiment, the second set of parameters consists of algorithm related parameters. The Second set of parameters can include at least one or a combination of the following: i) the exploration probability ε; ii) the learning rate α; iii) the discount factor y; iv) and the replay memory capacity N.

[0030] In one example, the exploration probability ε can be increased to a certain value when an event (e.g., immediate reward is dropped below a threshold) has triggered the update of the algorithm. In another example, the exploration probability 8 can be reduced to a certain value when another event (e.g., accumulated reward has reached an upper-bound threshold) has triggered the update of the algorithm.

[0031] In one example, the learning rate α can be increased to a certain value when an event (e.g., immediate reward is dropped below a threshold) has triggered the update of the algorithm. In another example, the learning rate α can be reduced to a certain value when another event (e.g., accumulated reward has reached an upper-bound threshold) has triggered the update of the algorithm.

[0032] In one example, the discount factor y can be increased to a certain value when an event (e.g., immediate reward is dropped below a threshold) has triggered the update of the algorithm. In another example, the discount factor y can be reduced to a certain value when another event (e.g., accumulated reward has reached an upper-bound threshold) has triggered the update of the algorithm.

[0033] In one example, the replay memory capacity N can be increased to a certain value when an event (e.g., immediate reward is dropped below a threshold) has triggered the update of the algorithm. In another example, the replay memory capacity N can be reduced to a certain value when another event (e.g., accumulated reward has reached an upper-bound threshold) has triggered the update of the algorithm.

[0034] The table below shows pseudo-code for a dynamic reinforcement learning process that is performed by agent 201 in one embodiment. Algorithm 2: Dynamic Reinforcement Learning with reward value monitor and adaptive explorationprobability

Initialize replay memory D to capacity N

Initialize action-value function Q with two random sets of weights 6, 6

Initialize exploration probability ε to 1

Initialize restarting exploration probability E ReStart to 0.1

Initialize ending exploration probability ε End to 0.001

Initialize r previous equal to 0 for Iteration = 1, M do for t = 1, K do

Select a random action a t with probability s

Otherwise, select a t = arg max Q(s t , a; θ) a

Execute action a t , collect reward r t

Observe next state s t+1

Store the transition (s t , a t , r t , s t+1 ) in D

Sample mini-batch of transitions (sj, a j , r j s j+1 ) from D if s j+1 is terminal then

Set y j = r j else end if

Perform a gradient descent step using targets y j with respect to the online parameters 6

Set θ' ← θ end for if Drop threshold then ε = ε ReStart else if r K > Upper reward threshold then ε = εEnd else ε = εx0.995 end if end for [0035] As seen from the above code, the exploration probability ε is adjusted when there is a reward value drop greater than a threshold (a.k.a., the “Drop” threshold). Following the completion of each learning iteration, the last reward value r K will be checked and compared to a pre-defined performance drop tolerant threshold and an upper reward threshold. The adjustment is made to exploration probability 8 based on the reward value r K and the two thresholds.

[0036] In this example, the first set of parameters includes the immediate reward r K , and the second set of parameters consists of the exploration probability 8. There are two triggering events for updating this algorithm related parameter:

[0037] 1) when the immediate reward r K is greater than an upper-bound reward threshold, then the exploration probability ε is reduced to a certain value (e.g., ε = ε End );

[0038] 2) when difference between the immediate reward r K and a previous reward previous is larger than a pre-defined drop threshold, then the exploration probability ε is increased from the ending probability ε = 0.001 to ε = ε ReStart .

[0039] Use Case Example

[0040] Stable connectivity is crucial for improving the situational awareness and operational efficiency in various mission-critical situations. In a catastrophe or emergency scenario, the existing cellular network coverage and capacity in the emergency area may not be available or sufficient to support mission-critical communication needs. In these scenarios, deployable-network technologies like portable base stations (BSs) on UAVs or trucks can be used to quickly provide mission-critical users with dependable connectivity.

[0041] FIG. 3 illustrates a mission-critical scenario, where a macro BS 301 is damaged due to a natural disaster and a UAV-BS 302 is set up to provide temporary wireless access connection to mission-critical users (exemplified by UE 310) that are performing search and rescue missions in the disaster area. The UAV-BS 302 is integrated into the cellular network by logically connecting itself to an on-ground donor BS 304 (e.g., a macro- BS) using a wireless backhaul link 305. The same UAV-BS antenna 306 may be used for both the access and the backhaul links. [0042] In order to best serve the on-ground mission-critical users, and, at the same time, maintaining a good backhaul connection, agent 201 can be employed to autonomously configure the location of the UAV-BS and the electrical tilt for the access and backhaul antenna of the UAV-BS. By employing the RL algorithm adaptation processes disclosed herein, agent 201 is be able to adapt its RL algorithm to the real-time changing environment (e.g., when mission- critical traffic moves on the ground), where traditional reinforcement learning algorithms are not applicable and would result in inappropriate UAV-BS configuration decisions. That is, the agent 201 can be used to automatically control the location of the UAV-BS 302 and the antenna configuration of the UAV-BS in a dynamic changing environment, in order to best serve the on-ground mission critical users and at the same time, maintaining a good backhaul connection between the UAV-BS and an on-ground donor base station.

[0043] FIG. 4 is a flowchart illustrating a process 400 for dynamic RL. Process 400 may be performed by agent 201 and may begin in step s402. Step s402 comprises using an RL algorithm to select a first action. Step s404 comprises triggering performance of the selected first action. Step s406 comprises, after the first action is performed, obtaining a first reward value (Rl) (e.g., the r K value shown above) associated with the first action. Step s408 comprises using Rl and/or a performance indicator (PI) to determine whether an algorithm modification condition is satisfied. Step s410 comprises, as a result of determining that the algorithm modification condition is satisfied, modifying the RL algorithm to produce a modified RL algorithm.

[0044] In some embodiments, modifying the RL algorithm to produce the modified RL algorithm comprises modifying a parameter of the RL algorithm.

[0045] In some embodiments, modifying a parameter of the RL algorithm comprises modifying one or more of: an exploration probability of the RL algorithm; a learning rate of the RL algorithm, a discount factor of the RL algorithm, or a replay memory capacity of the RL algorithm. In some embodiments, using the RL algorithm to select the first action comprises selecting the first action based on the exploration probability, an modifying the RL algorithm to produce the modified RL algorithm comprises modifying the exploration probability.

[0046] In some embodiments, using Rl and/or PI to determine whether the algorithm modification condition is satisfied comprises one or more of i) comparing Rl to a first threshold, ii) comparing ΔR to a second threshold, wherein ΔR is a difference between R1 and a reward value associated with a second action selected using the RL algorithm, or iii) comparing the PI to a third threshold.

[0047] In some embodiments, process 400 also includes: i) before using the RL algorithm to select the first action and obtaining Rl, using the RL algorithm to select a second action; ii) triggering performance of the selected second action; and iii) after the second action is performed, obtaining a second reward value, R2 (e.g,. r previous ). associated with the second action, wherein using Rl and/or PI to determine whether the algorithm modification condition is satisfied comprises performing a decision process comprising: calculating ΔR = R2 - Rl and determining whether ΔR is greater than a drop threshold.

[0048] In some embodiments, using the RL algorithm to select the first action comprises selecting the first action based on an exploration probability (s). The exploration probability specifies the likelihood that the agent will randomly select an action, as opposed to selecting an action that is determined to yield the highest expected reward. For example, if 8 is 0.1, then the agent is configured such that when the agent goes to select an action there is a 10% chance the agent will randomly select an action and a 90% chance that the agent will select an action that is determined to yield the highest expected reward.

[0049] In some embodiments, the algorithm modification condition is satisfied when ΔR is greater than the drop threshold, and modifying the RL algorithm as a result of determining that the algorithm modification condition is satisfied comprises generating a new exploration probability, εnew, for the RL algorithm, wherein 8new equals ε Restart , where ε Newstart is a predetermined exploration probability (e.g., ε Restart = 0.1).

[0050] In some embodiments, the decision process further comprises, as a result of determining that ΔR is not greater than the drop threshold, then determining whether Rl is less than a lower reward threshold.

[0051] In some embodiments, the decision process further comprises, as a result of determining that ΔR is not greater than the drop threshold, then determining whether Rl is greater than an upper reward threshold. In some embodiments, the algorithm modification condition is satisfied when ΔR is not greater than the drop threshold and Rl is greater than the upper reward threshold, an modifying the RL algorithm as a result of determining that the algorithm modification condition is satisfied comprises generating a new exploration probability, εnew, for the RL algorithm, wherein anew equals ε End , where ε End is a predetermined ending exploration probability (e.g., ε End = 0.001).

[0052] In some embodiments, the algorithm modification condition is satisfied when ΔR is not greater than the drop threshold and R1 is not greater than the upper reward threshold, and modifying the RL algorithm as a result of determining that the algorithm modification condition is satisfied comprises generating a new exploration probability, 8new, for the RL algorithm, wherein anew equals (s x c), where c is a predetermined constant.

[0053] In some embodiments, process 400 further includes, prior to using the RL algorithm to select the action: i) using the RL algorithm to select K-l actions, where K > 1; ii) triggering the performance of each one of the K-l actions; and iii) for each one of the K-l actions, obtaining a reward value associated with the action. In some embodiments, using R1 and/or PI to determine whether the algorithm modification condition is satisfied comprises: using R1 and said K-l reward values to generate a reward value that is a function of these K reward values; and comparing the generated reward value to a threshold. In some embodiments, using R1 and/or PI to determine whether the algorithm modification condition is satisfied comprises: using R1 and said K-l reward values to generate a reward value that is a function of these K reward values; and comparing ΔR to a threshold, wherein ΔR is a difference between the generated reward value and a previously generated reward value. In some embodiments, the generated reward value is: a sum of the K reward value, a weighted sum of said K reward values, a weighted sum of a subset of said K reward values, a mean of said K reward values, a mean of a subset of said K reward values, a median of said K reward values, or a median of a subset of said K reward values.

[0054] In some embodiments, the value of K is determined based on a correlation time of the environment and/or application requirements (e.g., the maximum allowed service interruption time). In some embodiments, the value of K is determined based on a maximum allowed service interruption time.

[0055] In some embodiments, one or more of the recited thresholds is dynamically changed based on environment changes and/or service requirement changes.

[0056] In some embodiments, process 400 further includes using the modified RL algorithm to select another action and triggering performance of the another action.

[0057] FIG. 5 is a block diagram of RL agent 201, according to some embodiments. As shown in FIG. 5, RL agent 201 may comprise: processing circuitry (PC) 502, which may include one or more processors (P) 555 (e.g., a general purpose microprocessor and/or one or more other processors, such as an application specific integrated circuit (ASIC), field- programmable gate arrays (FPGAs), and the like), which processors may be co-located in a single housing or in a single data center or may be geographically distributed (i. e. , RL agent 201 may be a distributed computing apparatus); at least one network interface 548 comprising a transmitter (Tx) 545 and a receiver (Rx) 547 for enabling RL agent 201 to transmit data to and receive data from other nodes connected to a network 110 (e.g., an Internet Protocol (IP) network) to which network interface 548 is connected (directly or indirectly) (e.g., network interface 548 may be wirelessly connected to the network 110, in which case network interface 548 is connected to an antenna arrangement); and a storage unit (a.k.a., “data storage system”) 508, which may include one or more non-volatile storage devices and/or one or more volatile storage devices. In an alternative embodiment the network interface 548 may be connected to the network 110 over a wired connection, for example over an optical fiber or a copper cable. In embodiments where PC 502 includes a programmable processor, a computer program product (CPP) 541 may be provided. CPP 541 includes a computer readable medium (CRM) 542 storing a computer program (CP) 543 comprising computer readable instructions (CRI) 544. CRM 542 may be a non-transitory computer readable medium, such as, magnetic media (e.g., a hard disk), optical media, memory devices (e.g., random access memory, flash memory), and the like. In some embodiments, the CRI 544 of computer program 543 is configured such that when executed by PC 502, the CRI causes RL agent 201 to perform steps of the methods described herein (e.g., steps described herein with reference to one or more of the flow charts). In other embodiments, RL agent 201 may be configured to perform steps of the methods described herein without the need for code. That is, for example, PC 502 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software.

[0058] While various embodiments are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of this disclosure should not be limited by any of the above-described exemplary embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context. [0059] Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.