Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
LEARNING DIVERSE SKILLS FOR TASKS USING SEQUENTIAL LATENT VARIABLES FOR ENVIRONMENT DYNAMICS
Document Type and Number:
WIPO Patent Application WO/2022/248718
Kind Code:
A1
Abstract:
This specification relates to methods for controlling agents to perform actions according to a goal (or option) comprising a sequence of local goals (or local options) and corresponding methods for training. As discussed herein, environment dynamics may be modelled sequentially by sampling latent variables, each latent variable relating to a local goal and being dependent on a previous latent variable. These latent variables are used to condition an action-selection policy neural network to select actions according to the local goal. This allows the agents to reach more diverse states than would be possible through a fixed latent variable or goal, thereby encouraging exploratory behavior. In addition, specific methods described herein model the sequence of latent variables through a simple linear and recurrent relationship that allows the system to be trained more efficiently. This avoids the need to learn a state-dependent higher level policy for selecting the latent variables which can be difficult to train in practice.

Inventors:
HANSEN STEVEN STENBERG (GB)
Application Number:
PCT/EP2022/064491
Publication Date:
December 01, 2022
Filing Date:
May 27, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DEEPMIND TECH LTD (GB)
International Classes:
G06N3/08; G06N3/04; G06N7/00
Foreign References:
US20210158162A12021-05-27
Attorney, Agent or Firm:
FISH & RICHARDSON P.C. (DE)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method for controlling an agent while interacting with an environment, wherein the agent is controlled to execute a sequence of actions defining a state-action trajectory for achieving a goal, the goal comprising a sequence of local goals, each local goal being characterized by a corresponding latent variable, wherein each state- action trajectory comprises a set of sub-trajectories and each sub-trajectory is generated according to a corresponding local goal, the method comprising: for each of a plurality of iterations, wherein each iteration corresponds to one of the sub-trajectories: determining the latent variable for the corresponding local goal using a predictor neural network configured to predict the latent variable in accordance with predictor parameter values based on a preceding latent variable; and for each iteration, obtaining the corresponding sub -trajectory by, for each of a plurality of time steps: obtaining an observation characterizing a current state of the environment; processing the observation using the action selection policy neural network to generate a policy output, wherein the action selection policy neural network is conditioned on the latent variable and generates the policy output in accordance with policy parameter values of the action selection policy neural network; and selecting an action to be performed by the agent in response to the observation using the policy output.

2. The method of claim 1 wherein determining the latent variable comprises determining the previous latent variable and adding a perturbation to the previous latent variable to generate the latent variable.

3. The method of claim 2 wherein the perturbation is determined from a perturbation distribution, the perturbation distribution being one of a uniform distribution or a Gaussian distribution.

4. The method of any preceding claim wherein the previous latent variable is determined by determining the previous latent variable that is most likely to have conditioned the action- selection neural network to generate a previous sub -trajectory for the previous iteration according to the predictor neural network.

5. The method of claim 4 wherein the previous latent variable is determined based on a last observation from the previous sub -trajectory.

6. The method of claim 4 or claim 5 wherein determining the previous latent variable comprises inputting the last observation from the previous sub-trajectory into the predictor neural network to predict the previous latent variable.

7. The method of any preceding claim wherein each latent variable is linearly related to the preceding latent variable.

8. The method of any preceding claim wherein the method is initialized by determining a first latent variable based on a first observation characterizing a first state of the environment.

9. The method of any preceding claim, wherein latent variables over the iterations form a Markov chain.

10. The method of any preceding claim further comprising updating one or both of the policy parameter values and the predictor parameter values based at least on one or more of the sub-trajectories.

11. The method of any preceding claim further comprising one or both of: updating the policy parameter values based on a first objective that aims to maximize an entropy of the policy; and updating the predictor parameter values based on a second objective that aims to minimize a cross-entropy between consecutive latent variables.

12. The method of claim 10 or claim 11 wherein updating the policy parameter values comprises a reinforcement learning update based on a reward rt for each iteration where: and ^^ is the latent variable for the ^th iteration, ^்,^ is a last observation of ^ observations for the ^th iteration, and ^ is a probability distribution describing a probability of ^^ given ^்,^ according to the predictor parameter values ^. 13. The method of claim any of claims 10-12 wherein updating the predictor parameter values comprises an update that attempts to minimize a difference between consecutive latent variables. 14. The method of claim 13 wherein a probability distribution ^ describing a probability of a latent variable ^ given an observation ^ according to predictor parameter values ^ is ^௪^^|^^ ൌ ^^ ௪^^^^, 1^ for a function ௪^^^^ where ^^ή,ή^ represents a Gaussian distribution and the update aims to minimize: where ^^^ା^^^ is a last observation of ^ observations for the ^th iteration; ^^^ is a last observation of ^ observations for the ^^ െ 1^th iteration; and ^^ is a perturbation added an output from the predictor parameter neural network at the ith iteration to determine the ith latent variable. 15. The method of any preceding claim wherein the observations relate to a real-world environment and wherein the selected action relates to an action to be performed by a mechanical agent. 16. The method of claim 15 wherein the method controls the agent to perform a task while executing the options, the method further comprising using the predictor neural network and the action selection policy neural network to control the mechanical agent to perform the task while interacting with the real-world environment by obtaining the observations from one or more sensors sensing the real-world environment and using the policy output to select actions to control the mechanical agent to perform the task.

17. A computer-implemented action-selection neural network trained according to the method of any of claims 10-16.

18. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform the operations of the respective method of any one of claims 1-16.

19. One or more computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the respective method of any one of claims 1-16.

20. An agent configured to select actions to perform tasks in an environment, wherein the agent is controlled to execute a sequence of actions defining a state-action trajectory for achieving a goal, the goal comprising a sequence of local goals, each local goal being characterized by a corresponding latent variable, wherein each state-action trajectory comprises a set of sub-trajectories and each sub -trajectory is generated according to a corresponding local goal, the agent comprising: a predictor neural network parameterized by predictor parameter values and configured to determine, for each of a plurality of iterations, wherein each iteration corresponds to one of the sub-trajectories, a latent variable for the corresponding local goal based on a preceding latent variable; and an action selection neural network configured to, for each of a plurality of time steps for each iteration, process an observation characterizing a current state of the environment to generate a policy output, wherein the action selection policy neural network is conditioned on the latent variable for the iteration and generates the policy output in accordance with policy parameter values of the action selection policy neural network; and an agent control system configured to select an action to be performed by the agent in response to the observation using the policy output.

21. The agent of claim 20 wherein the agent is a mechanical agent, the environment is a real-world environment, the mechanical agent comprises one or more el ectromechani cal devices to control movement or locomotion of the mechanical agent in the real-world environment, and the agent control system is coupled to the policy output to provide control signals in accordance with selected actions to control the one or more electromechanical devices for the agent to perform the task.

Description:
LEARNING DIVERSE SKILLS FOR TASKS USING SEQUENTIAL LATENT VARIABLES FOR ENVIRONMENT DYNAMICS

CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims the benefit of the filing date of U.S. Provisional Patent Application Serial No. 63/194,892 for “LEARNING DIVERSE SKILLS FOR TASKS USING SEQUENTIAL LATENT VARIABLES FOR ENVIRONMENT DYNAMICS,” which was filed on May 28, 2021, and which is incorporated herein by reference in its entirety.

BACKGROUND

[0002] This specification relates to reinforcement learning with neural network systems.

[0003] Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters.

[0004] The techniques described herein have applications in the field of reinforcement learning. In a reinforcement learning system, an agent interacts with an environment by performing actions that are selected by the reinforcement learning system in response to receiving observations that characterize the current state of the environment.

[0005] Some reinforcement learning systems select the action to be performed by the agent in response to receiving a given observation in accordance with an output of a neural network.

SUMMARY

[0006] This specification relates to methods for controlling agents to perform actions according to a goal (or option) comprising a sequence of local goals (or local options) and corresponding methods for training.

[0007] According to a first aspect there is provided a computer-implemented method for controlling an agent while interacting with an environment, wherein the agent is controlled to execute a sequence of actions defining a state-action trajectory for achieving a goal. The goal may comprise (or may be characterized by) a sequence of local goals. Each local goal may be characterized by a corresponding latent variable. Each state-action trajectory may comprise a set of sub-trajectories and each sub-trajectory may be generated according to a corresponding local goal. The method may comprise, for each of a plurality of iterations, wherein each iteration corresponds to one of the sub-trajectories, determining the latent variable for the corresponding local goal using a predictor neural network configured to predict the latent variable in accordance with predictor parameter values based on a preceding latent variable. The method may further comprise, for each iteration, obtaining the corresponding sub- trajectory by, for each of a plurality of time steps: obtaining an observation characterizing a current state of the environment; processing the observation using the action selection policy neural network to generate a policy output; and selecting an action to be performed by the agent in response to the observation using the policy output. The action selection policy neural network may be conditioned on the latent variable and may generates the policy output in accordance with policy parameter values of the action selection policy neural network. [0008] According to a second aspect there is provided a system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform a method for controlling an agent while interacting with an environment, wherein the agent is controlled to execute a sequence of actions defining a state-action trajectory for achieving a goal. The goal may comprise (or may be characterized by) a sequence of local goals. Each local goal may be characterized by a corresponding latent variable. Each state-action trajectory may comprise a set of sub-trajectories and each sub-trajectory may be generated according to a corresponding local goal. The method may comprise, for each of a plurality of iterations, wherein each iteration corresponds to one of the sub-trajectories, determining the latent variable for the corresponding local goal using a predictor neural network configured to predict the latent variable in accordance with predictor parameter values based on a preceding latent variable. The method may further comprise, for each iteration, obtaining the corresponding sub- trajectory by, for each of a plurality of time steps: obtaining an observation characterizing a current state of the environment; processing the observation using the action selection policy neural network to generate a policy output; and selecting an action to be performed by the agent in response to the observation using the policy output. The action selection policy neural network may be conditioned on the latent variable and may generates the policy output in accordance with policy parameter values of the action selection policy neural network. [0009] According to a third aspect there is provided one or more computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform a method for controlling an agent while interacting with an environment, wherein the agent is controlled to execute a sequence of actions defining a state-action trajectory for achieving a goal. The goal may comprise (or may be characterized by) a sequence of local goals. Each local goal may be characterized by a corresponding latent variable. Each state-action trajectory may comprise a set of sub-trajectories and each sub- trajectory may be generated according to a corresponding local goal. The method may comprise, for each of a plurality of iterations, wherein each iteration corresponds to one of the sub-trajectories, determining the latent variable for the corresponding local goal using a predictor neural network configured to predict the latent variable in accordance with predictor parameter values based on a preceding latent variable. The method may further comprise, for each iteration, obtaining the corresponding sub-trajectory by, for each of a plurality of time steps: obtaining an observation characterizing a current state of the environment; processing the observation using the action selection policy neural network to generate a policy output; and selecting an action to be performed by the agent in response to the observation using the policy output. The action selection policy neural network may be conditioned on the latent variable and may generates the policy output in accordance with policy parameter values of the action selection policy neural network.

[0010] According to a fourth aspect there is provided an agent configured to select actions to perform tasks in an environment, wherein the agent is controlled to execute a sequence of actions defining a state-action trajectory for achieving a goal, the goal comprising a sequence of local goals, each local goal being characterized by a corresponding latent variable, wherein each state-action trajectory comprises a set of sub-trajectories and each sub -trajectory is generated according to a corresponding local goal. The agent may comprise a predictor neural network, an action selection neural network and an agent control system. The predictor neural network may be parameterized by predictor parameter values and configured to determine, for each of a plurality of iterations, wherein each iteration corresponds to one of the sub-trajectories, a latent variable for the corresponding local goal based on a preceding latent variable. The action selection neural network may be configured to, for each of a plurality of time steps for each iteration, process an observation characterizing a current state of the environment to generate a policy output, wherein the action selection policy neural network is conditioned on the latent variable for the iteration and generates the policy output in accordance with policy parameter values of the action selection policy neural network.

The agent control system may be configured to select an action to be performed by the agent in response to the observation using the policy output. [0011] As discussed herein, environment dynamics may be modelled sequentially by sampling latent variables, each latent variable relating to a local goal and being dependent on a previous latent variable. These latent variables are used to condition an action-selection policy neural network to select actions according to the local goal. This allows the agents to reach more diverse states than would be possible through a fixed latent variable or goal, thereby encouraging exploratory behavior. In addition, specific methods described herein model the sequence of latent variables through a simple linear and recurrent relationship that allows the system to be trained more efficiently. This avoids the need to learn a state- dependent higher level policy for selecting the latent variables which can be difficult to train in practice.

[0012] In certain implementations, each latent variable is determined based on a previous latent variable. This introduces temporal dependence which allows the action selection policy neural network to reach more distant states. In addition, this allows local updates without sacrificing long term influence. Furthermore, certain implementations determine the previous latent variable using a predictor neural network that predicts the previous latent variable based on the previous sub -trajectory. For instance, the previous latent variable may be a latent variable that is most likely to have been selected based on a last (or final) observation in the previous trajectory. This applies hindsight correction to take into account the resulting sub-trajectories. This ensures that predicted latent variables are consistent according to the latent state dynamics.

[0013] In certain implementations, training methods are used to train the action selection neural network to select actions for exploratory behavior based on intrinsic rewards. An intrinsic reward can be a reward that is not received from the environment (e.g. that is not an extrinsic reward). This can be based on a distribution or information content of actions selected and/or states reached by the agent. For instance, an intrinsic reward may reward actions that increase the entropy of the states or actions and/or increase the mutual information between various parameters (such as between the latent variables and the states or actions). By training a system intrinsically, the system can learn effective behaviors (in this case, effective goals) before application to a specific task. An intrinsically trained system can therefore be more easily trained to perform a particular task based on extrinsic rewards. [0014] The systems and methods described herein can therefore train the action selection neural network even in the absence of “task rewards,” e.g., extrinsic rewards that characterize a progress of an agent towards accomplishing a task in the environment. [0015] For example, the method can initially train the action selection neural network using reinforcement learning techniques based on only intrinsic (unsupervised) rewards. After pre training the action selection neural network using unsupervised rewards, the system can train the action selection neural network based on task rewards, or based on a combination of task rewards and unsupervised rewards.

[0016] The subject matter described in this specification can be implemented in particular embodiments so as to realize one or more of the following advantages. As discussed herein, environment dynamics may be modelled sequentially by sampling latent variables, each latent variable relating to a local goal and being dependent on a previous latent variable. These latent variables can be used to condition an action-selection policy neural network to select actions according to the local goal. This allows agents to reach more diverse states than would be possible through a fixed latent variable or goal, thereby encouraging exploratory behavior. In addition, specific methods described herein model the sequence of latent variables through a simple linear and recurrent relationship that allows the system to be trained more efficiently. This can avoid the need to learn a state-dependent higher level policy for selecting the latent variables, which can be difficult to train in practice.

[0017] In certain implementations, each latent variable is determined based on a previous latent variable. This introduces temporal dependence which allows the action selection policy neural network to reach more distant states. In addition, this allows local updates without sacrificing long term influence. Furthermore, certain implementations determine the previous latent variable using a predictor neural network that predicts the previous latent variable based on the previous sub -trajectory. For instance, the previous latent variable may be a latent variable that is most likely to have been selected based on a last (or final) observation in the previous sub -trajectory. This applies hindsight correction to take into account the resulting sub-trajectories. This ensures that predicted latent variables are consistent according to the latent state dynamics.

[0018] The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE FIGURES [0019] FIG. 1 shows an example action selection system that is configured to control an agent interacting with an environment. [0020] FIG. 2 shows graphical models for the determination of states with a single latent variable (top) and for a sequence of latent variables (bottom).

[0021] FIG. 3 shows a stochastic computation graph representing a process for implementing and training the latent predictor neural network and the action selection neural network.

[0022] FIG. 4 is a flow diagram showing a method of implementing (and optionally training) an action selection system (e.g. the action selection neural network 100 in FIG. 1).

[0023] FIG. 5 shows 2D latent codes shaded based on the ground truth x (left) and y (right) coordinates for an agent controlling a point-mass within an environment that includes a TG shaped wall.

[0024] FIG. 6 shows the estimated marginal code entropy H[zi\ (solid) and conditional entropy H[zi\s i+1 \ (dashed) for an implementation (EDDICT) and an alternative system (EDDICT-D).

DETAILED DESCRIPTION

[0025] FIG. 1 shows an example action selection system 100 that is configured to control an agent 104 interacting with an environment 106. The action selection system 100 is an example of a system implemented as one or more computer programs on one or more computers in one or more locations in which the systems, components, and techniques described below are implemented.

[0026] The system 100 is configured to select actions 102 to be performed by the agent 104 to interact with the environment 106 over a sequence of time steps to accomplish a goal (or task). At each time step, the system 100 can receive data characterizing the current state of the environment 106 and select an action 102 to be performed by the agent 104 in response to the received data. Data characterizing a state of the environment 106 will be referred to in this specification as an observation 110 and can include, e.g., an image, or any other appropriate data. In some cases, the agent 104 can be, e.g., a robot, and the observation 110 can include, e.g., joint positions, velocities and torques, or any other appropriate data, as described in more detail below. At each time step, the state of the environment 106 at the time step (as characterized by the observation 110) depends on the state of the environment 106 at the previous time step and the action 102 performed by the agent 104 at the previous time step.

[0027] The system 100 is configured to select actions 102 to be performed by the agent 104 to interact with the environment 106 over a sequence of time steps using an action selection neural network 120 (otherwise known as an “action selection policy neural network”) conditioned on a latent variable 118. The system 100 is configured to determine the latent variable 118 from a set of possible latent variables using a latent predictor neural network 150. The latent predictor neural network 150 may be conditioned on one or more previous observations (e.g. a final observation in a sub-trajectory, as discussed below). The latent variable 118 may be determined by predicting a preceding latent variable based on the one or more previous observations and adding a perturbation 152 to the predicted previous latent variable.

[0028] Any parameters input into the latent predictor neural network 150 (e.g. the one or more previous observations and the perturbation 152) may be stored in replay memory 114 for use in training. In one implementation, the perturbation 152 and the final observation from the most recent sub -trajectory may be stored in replay memory 114 to allow the latent variable to be recalculated during training (e.g. using the most up to date version of the latent predictor neural network 150). Further details of the determination of the latent variable 118 are provided below.

[0029] The system 100 can then use the action selection neural network 120, conditioned on the latent variable 118, to select actions 102 to be performed by the agent 104 over a sequence of time steps. Generally, “conditioning” the action selection neural network 120 on the latent variable 118 can refer to providing the latent variable 118 to the action selection neural network 120 as an input. In some implementations, the set of possible latent variables can include only finitely many possible latent variables, e.g., 10, 100, 1000, 10,000, or any other appropriate number of latent variables. In some implementations, the set of possible latent variables can include infinitely many possible latent variables, e.g., the set of possible latent variables can be the continuous range [0, 1]

[0030] Each latent variable can represent a “skill,” e.g., a set of actions that characterize a behavior of the agent 104 over a sequence of time steps. That is, each latent variable can prompt the agent 104 to interact with the environment 106 in a manner that causes the state of the environment 106 to be altered over a sequence of time steps in a consistent and recognizable way. In other words, a “skill” can refer to an action selection policy defined by conditioning the action selection neural network 120 on the latent variable 118. For example, a skill can refer to an action selection policy n g (a\ s, z) that maps from states s and latent variables z to distributions over actions a, where 0 is a set of policy parameters 124 of the action selection neural network 120. As described in more detail below, by training the action selection neural network 120 on unsupervised rewards, the system 100 can enable the action selection neural network 120 to learn distinct and recognizable skills. [0031] In certain implementations, each latent variable is determined by inputting one or more preceding observations 110 into the latent predictor neural network 150. The latent predictor neural network may attempt to predict the previous latent variable based on the one or more preceding observations 110. A new latent variable 118 may then be determined based on the predicted previous latent variable. This may be determined through a linear auto-regressive process. For instance, this may be determined by adding a perturbation (e.g. a linear perturbation) to the predicted previous latent variable.

[0032] By implementing sequences of latent variables (latent codes) the system is able to reach states much further in the time horizon than previous methods. By utilising a linear autoregressive methodology, the system is able to efficiently implement these sequences of latent codes without having to learn a high-level parametric policy over the latent variables, which can be difficult and which has been shown to lead to fewer learned codes/goal states. Furthermore, hard coding the dynamics to be linear naturally imposes an interpretable Euclidian topology in latent space.

[0033] As a particular example, if the agent 104 is a physical robot exploring the environment, a first latent variable can define a first set of possible actions for the robot (e.g. that include the actions of moving forward, backward, to the left, and to the right) for exploring a certain region of the environment. A second latent variable can define a second set of possible actions of the robot for exploring a different region of the environment. In this case, the first latent variable and the second latent variable can each represent a distinct and recognizable skill of the robot, characterizing different forms of behavior. By implementing a sequence of latent variables (latent codes), the robot can explore further within the environment, reaching sections of the environment that may have previously been unobtainable using only a single set of behaviors.

[0034] After conditioning the action selection neural network 120 on the latent variable 118, the system 100 can use the action selection neural network 120 to select actions 102 to be performed by the agent 104 to interact with the environment 106 over a sequence of time steps. For example, at each time step, the system 100 can process the observation 110 characterizing the current state of the environment 106 at the time step, and the latent variable 118, using the action selection neural network 120, to generate an action selection output 122. In some implementations, in addition to processing the observation 110 characterizing the current state of the environment 106 at the time step, the system 100 can also process one or more observations that each characterize the state of the environment at a respective previous time step. [0035] The action selection output 122 can include a respective score for each action in a set of possible actions that can be performed by the agent 104. In some implementations, the system 100 can select the action having the highest score, according to the action selection output 122, as the action to be performed by the agent 104 at the time step. In some implementations, the system 100 selects the action 102 to be performed by the agent 104 in accordance with an exploration strategy. For example, the system 100 can use an e-greedy exploration strategy. In this example, the system 100 can select the action having a highest score (according to the action selection output 122) with probability 1-e, and select an action randomly with probability e, where e is a number between 0 and 1.

[0036] A trajectory may be accumulated by implementing a series of actions. A trajectory t = [s 0 , a 0 , s t , a t , ... s t , a t , s T ] may include a series of observations characterizing states s t and actions a t over a series of T time steps t. An overall trajectory t may comprise a series of sub-trajectories T ;. Each sub-trajectory may be of length K , = Note that the sub -trajectory here is defined to include s^ i+1 ^ K but not the state s iK from which a iK was sampled.

[0037] Each sub -trajectory may be determined be conditioning the action selection neural network 120 using a corresponding latent variable 118. Each latent variable may be determined based on a previous sub -trajectory T t. A series of latent variables may therefore be determined which condition the action selection neural network 120 to adapt its behavior over time.

[0038] Accordingly, in certain implementations, an action selection neural network 120 is trained to select actions based on a goal (otherwise known as an option or skill) characterized by a sequence of local goals (e.g. local options or local skills). Each local goal may be considered to define a learned behavior of the agent. This allows the action selection neural network 120 to be conditioned to select actions according to the local goal to form a sub- trajectory. The sequence of local goals may form an overall goal. The sequence of sub- trajectories may form an overall trajectory.

[0039] For each iteration, a sub-trajectory may be obtained including, for each time step, an observation and an action. This may be obtained by, over the multiple time-steps, selecting an action, instructing the agent to perform the selected action, and receiving an updated observation characterizing an updated state of the environment after the performance of the action. [0040] After the system 100 selects the action 102 to be performed by the agent 104 at the time step, the agent 104 interacts with the environment 106 by performing that action 102, and the system 100 can receive a reward based on the interaction, e.g., a task reward 108, an unsupervised reward, or both. The training engine 112 may be configured to determine the unsupervised reward either immediately after the interaction, or after a number of actions have been performed (e.g. after a batch of actions have been implemented, e.g. to form a sub- trajectory, which may be stored in the replay memory 114). The task reward 108 is optional, as the system 100 may train the action selection neural network 120 and/or the latent predictor neural network 150 based solely on unsupervised training (e.g. unsupervised rewards). When training the latent predictor neural network 150, predictor parameters 152 for the latent predictor neural network 150 may be updated. Similarly, then training the action selection neural network 120, policy parameters 124 for the action selection neural network 120 may be updated.

[0041] As mentioned above, the implementations described herein implement a sequence of latent variables z t to condition an action selection neural network to adapt its behavior over time to reach a wider variety of states.

[0042] FIG. 2 shows graphical models for the determination of states with a single latent variable (top) and for a sequence of latent variable (bottom). Where a single latent variable is utilized, this single latent variable z is used to condition the action selection neural network, which results in a trajectory t, formed of sub-trajectories T t. Each sub-trajectory t is determined using the same latent variable z. In contrast, in implementations described herein, the latent variable z is updated over time, and a new latent variable z is used to determine each sub -trajectory t. Each subsequent latent variable z is determined from the previous (the immediately preceding) latent variable z (e.g. a latent variable for the iteration immediately preceding the current iteration).

[0043] That is, the determination of the (current) latent variable z may be an autoregressive process. In other words, each latent variable may be dependent on the immediately preceding latent variable in the sequence. In one implementation the determination of the latent variable may be an autoregressive process of order one (e.g. an AR(1) process), such that only the previous latent variable from the sequence of latent variables is used to infer the current latent variable (although one or more perturbations and/or constants may be added). The latent variables z over the iterations may form a Markov chain. That is, the sequence of latent variables z corresponding to the sequence of local goals may form a Markov chain. [0044] Specifically, the determination of each latent code may be a linear autoregressive process. This provides linear dynamics across the sequence of latent variables. This makes the system easier and more computationally efficient to train. A further advantage of ensuring linear dynamics, versus learning a parametric policy over latent variables, is that it naturally imposes an interpretable Euclidian topology in latent variable space. This can provide meaningful latent state representations (e.g. that are descriptive of the environment being explored). [0045] Having said the above, if each latent variable ^ ^ is based on the preceding latent variable ^ ^ି^ that was actually used (rather than a predicted version), then the generation of the latent variables ignores the underlying states in which the code is sampled. This can make training the system brittle in practice. Accordingly, when sampling the latent variables ^ ^ ~^^^ ^ |^ ^ି^ ^, some implementations condition on the code most likely to have yielded the final state of the preceding trajectory. This code (latent variable) may be determined using the latent predictor neural network 150. The use of the latent predictor neural network 150 (otherwise known as a reverse predictor) adds hindsight correction to the methodology. Importantly, as shall be discussed below, this objective induces a cross-entropy term between the target distribution and ^ This ensures that predictions made from ^ ^ା^ are consistent with those from ^ ^ , under the current latent state dynamics. [0046] According to one implementation, each latent variable is determined by predicting the previous latent variable ^ǁ ^ି^ from the preceding sub-trajectory ^ ^ using the latent predictor neural network 150 and adding a perturbation ο to the predicted (inferred) previous latent variable ^ǁ ^ି^ to generate the latent variable ^ ^ : [0047] Specifically, the latent predictor neural network 150 may be configured to predict the latent variable based on the final state ^ ^^ that is reached in the preceding sub-trajectory (note that in the present definition of a sub-trajectory, includes ^ ^^ as the final state): where ௪^ ^ ^ ^ represents a parametric function according to the latent code predictor neural network 150 parameterized by predictor parameters ^. [0048] The perturbation ο may be determined from a perturbation distribution. The perturbation distribution may be one of a uniform distribution (e.g. a uniform distribution on the disc) or a Gaussian distribution. A uniform distribution may be used where the action space is discrete. A Gaussian may be used where the action space is continuous. The Gaussian may be an isotropic Gaussian. The Gaussian may be a truncated Gaussian.

[0049] Given the above, the code predictor may be q w (z\s) 1).

[0050] The perturbation may be a noise signal that encourages exploration. By adding a perturbation at each iteration, the method can ensure that the marginal latent variable entropy increases monotonically with each sub -trajectory (e.g. more states being visited) while the conditional entropy remains constant (e.g. the same number of states being reachable from any given state).

[0051] In an implementation, the method is initialized in a first iteration by determining a first latent variable based on a first observation characterizing a first state s 0 of the environment. The first latent variable z 0 may be predicted through inputting the first state s 0 into the latent predictor neural network. A permutation may be added to this first latent variable Following this, a first sub -trajectory may be obtained through implementing the action selection policy neural network when conditioned on the first latent variable.

Following this, subsequent iterations may be based on the preceding latent variable.

[0052] FIG. 3 shows a stochastic computation graph representing a process for implementing and training the latent predictor neural network and the action selection neural network.

[0053] As discussed herein, the most recent sub -trajectory t is input into the latent predictor neural network to predict the most recent latent variable z. This is added to a perturbation D, which may be sampled from a perturbation distribution (e.g. a Gaussian or uniform distribution) to determine an updated latent variable z'. The updated latent variable z' and the most recent sub -trajectory t are then input into the action selection neural network (represented in FIG. 3 by p) to determine the next sub -trajectory t' . The new sub -trajectory is used to determine a prediction of the most recent latent variable z! (note, this the most recent latent variable is now z’). The ground truth latent variable z’, and the predicted latent variable z! are then used to train the latent predictor neural network using a loss function L.

It should be noted that, in addition to training the latent predictor neural network, the action selection neural network may be trained based on the observed states and actions within the sub-trajectories.

[0054] FIG. 4 is a flow diagram showing a method of implementing (and optionally training) an action selection system (e.g. the action selection neural network 100 in FIG. 1). The process 200 may be performed by a system of one or more computers located in one or more locations. For example, an action selection system, e.g., the action selection system 100 in FIG. 1, appropriately programmed in accordance with this specification, can perform the process 200.

[0055] The method starts by initializing the first latent variable for a first sub -trajectory 210. As described above, this may be determined by inputting a first observation representing a first state of the environment into a latent predictor neural network and adding a perturbation. The method then repeats steps 220, 230, (optionally) 235 and 240 for a number of iterations. [0056] In step 220, the preceding latent variable (the latent variable from the preceding iteration) is used to determine the latent variable for the next sub -trajectory (the latent variable for the current iteration). This is achieved using the latent predictor neural network in accordance with predictor parameter values. The conditioning on the preceding latent variable may include hindsight correction. In this case, instead of inputting the ground truth preceding latent variable, a prediction of the preceding latent variable is determined by inputting at least a portion of the preceding sub -trajectory (e.g. the last observation) into the latent predictor neural network and adding a perturbation.

[0057] In step 230, the next sub-trajectory (the sub -trajectory for the current iteration) is determined using the action-selection neural network conditioned on the (current) latent variable. This may be achieved over a number of time steps. In each time step, the system may: obtain an observation characterizing a current state of the environment; process the observation using the action selection policy neural network to generate a policy output, wherein the action selection policy neural network is conditioned on the latent variable and generates the policy output in accordance with policy parameter values of the action selection policy neural network; and select an action to be performed by the agent in response to the observation using the policy output. The action may be selected using a e-greedy policy (e.g. where a greedy policy (e.g. that selects an action with the maximum expected return) is implemented with a probability of 1- e and a random action is selected with a probability of e).

[0058] In step 235, the parameters of the action selection neural network (policy parameter values) and the parameters of the latent predictor neural network (the predictor parameter values) may, optionally, be updated. This comprises updating one or both of the policy parameter values and the predictor parameter values based at least on one or more of the sub- trajectories. That is, one or both of the action selection policy neural network and the predictor neural network may be trained over one or more of the iterations. [0059] Step 235 is optional, in that it is not be implemented when a trained system is operating, but may be implemented when training the system. The update step 235 may be implemented during each iteration, or may be implemented during a selection of the iterations (e.g. after a number of iterations). [0060] An augmented trajectory may be formed, including a starting observation of a starting state of the environment, each sub-trajectory and each latent variable (or parameters for recalculating each latent variable, such as the perturbations). Training may be performed based on the augmented trajectory. The update step 235 will be described in more detail below. [0061] After each iteration, the method determines whether an end criterion is reached 240 (e.g. a maximum number of iterations have been implemented, or an evaluation or validation criterion has been met). If so, then the method ends 250. If not, then the method implemented another iteration (steps 220, 230, optionally 235, and 240). [0062] The training of the system will now be described. The update step 235 may include an optimization step that updates the parameters of the action selection neural network according to an objective to maximize the entropy of the policy (as defined by the policy parameters). This can be determined through greedy optimization. The maximization of the entropy may be a maximization of effective entropy. That is, one or both of the policy parameter values and the predictor parameter values may be updated based on an objective that aims to maximize the effective entropy. [0063] The effective entropy (^ ^^^^^௧^௩^ ) may be the difference in log probabilities given by the predictor neural network and the probability of each latent variable given the previous latent variable. In other words, the effective entropy may be defined as the difference in log- probabilities given the latent predictor neural network and the high-level policy over options. The effective entropy may be: where describes a probability of the latent variable ^ ^ given the sub-trajectory ^ ^ for an ^ th iteration, and is the probability of a latent variable ^ ^ given a previous latent variable teration. As described above, may be determined from a previous sub-trajectory ^ ^ି^ . In addition, as described above, the conditioning on ^ or ^ may be based on the final state reached by the respective sub-trajectory. [0064] The first half of the above equation relates is used to train the action selection neural network, whereas the second half is used to train the latent predictor neural network. The first half of the effective entropy equation may define relate to a first objective that aims to maximize the entropy of the policy. The first half of the above equation may be maximized through reinforcement learning. An intrinsic reward ^ may be provided at the end of a given sub-trajectory. This reward may be: [0065] According to one implementation, the reward is based on the last state ^ ^ା^ that is achieved in the sub-trajectory ^ ^ : where ^ is a probability distribution describing a probability of ^ ^ given ^ ்,^ according to the predictor parameter values ^. For all time steps before the final time step within the sub- trajectory, the reward may be set to zero. [0066] The rewards may be recorded in the augmented trajectory, and the action selection neural network may be updated using any reinforcement learning algorithm based on the augmented trajectory. This may be based on either the most recent sub-trajectory and rewards, or based on a number (e.g. all) previous sub-trajectories and rewards. Reinforcement learning may be implemented through an ^-greedy approach. According to an implementation, both the policy parameter values and the predictor parameters values are updated using reinforcement learning based on ^ ^ . [0067] When updating the predictor parameter values, the update may be based on updating the predictor parameter values based on a second objective that aims to minimize a cross- entropy between consecutive latent variables. This may be the second part of the above effective entropy equation: െ log ^^^ ^ |^ ^ି^ ^. This term results in an update that attempts to minimize a difference between consecutive latent variables. [0068] As described above, hindsight correction may be applied to predict the previous latent variable and then the next latent variable may be determined by adding a perturbation (e.g. as sampled from a Gaussian distribution) to the predicted previous latent variable. In this case, the probability distribution ^ describing a probability of a latent variable ^ given an observation ^ according to predictor parameter values ^ may be , for a function ௪^ ^^^, e.g. where ௪^ ^^^ is defined by parameter values ^ is modelled by the latent predictor neural network). In this case, the update aim s to minimize: where ^ ^^ା^^^ is a last observation of ^ observations for the ^ th iteration; ^ ^^ is a last observation of ^ observations for the ^^ െ 1^ th iteration; and is a perturbation added an output from the predictor parameter neural network at the i th iteration to determine the i th latent variable ^ ^ . In the above, ௪^ ^ ^ ^ may represent an output of the predictor neural network based on observation ^. The above minimization be implemented through gradient descent based on [0069] The above loss is intuitive: the reverse predictor is trained such that the inferred latent state from ^ ^ , matches the inferred state from ^ ^ା^ under the present latent dynamics (i.e. to minimize the difference between the inferred latent state from ^ ^ and the inferred latent state from ^ ^ା^ ). It has been found that an uninformative prior performed best in practice (despite the choice of isotropic Gaussian for the predictor), and thus in one implementation ^ ^ is sampled from a uniform distribution on the disc. [0070] In addition to training based on an intrinsic objective (e.g. based on intrinsic rewards), the method may also train using extrinsic rewards received from the environment. This may occur after a number of iterations of intrinsic training (i.e. without extrinsic rewards). The extrinsic training may be implemented in conjunction with intrinsic training, or based solely on the extrinsic rewards. In either case, the method of FIG. 4 may be implemented, where step 235 includes updating the parameters of one or both of the predictor neural network and the action selection policy neural network based at least partially on extrinsic rewards (one or more rewards received from the environment in response to one or more selected actions). For instance, an extrinsic reward may be received for each selected action (during step 230). The update 235 may then be based on the extrinsic rewards and, optionally, on the intrinsic rewards. For instance, the extrinsic reward for each time step may be added to the intrinsic reward for each time step (potentially with a scaling factor applied to one of the intrinsic and extrinsic rewards). [0071] FIG. 5 shows 2D latent codes shaded based on the ground truth x (left) and y (right) coordinates for an agent controlling a point-mass within an environment that includes a ‘IT shaped wall. The environment including the U-shaped wall is shown in the center of the figure. The results on the top relate to a version of the present methodology that does not include autorecurrent latent dynamics (i.e. where each code is z i+1 : = A t ), termed EDDICT- D. The results on the bottom relate to implementations that make use of the latent dynamics described herein, termed EDDICT (Entropic Desired Dynamics for Intrinsic Control).

[0072] As shown in FIG. 5, EDDICT successfully recovers the ground truth coordinates of the point mass position (under the agent’ s control), but not the target position (randomly set per episode and not under the agent’ s control). The topological structure of the environment is shown in the latent variables. This can be seen through the correlation between shading (representing the ground truth coordinates) and plot position (representing the latent variable). This is caused by the additive autorecurrent latent dynamics, which encode a notion of code proximity. There is clear discontinuity in the codes determined by EDDICT, showing where the agent is unable to pass through the wall. This shows that the present implementations yield latent state representations that are meaningful. In contrast, the EDDICT-D, which does not implement the desired latent dynamics described herein, does not yield latent state representations that correspond to the topology of the environment.

[0073] The methodology described herein was also tested using the Atari™ game Montezuma’ s Revenge, which is known to be a difficult game for testing exploration.

[0074] FIG. 6 shows the estimated marginal code entropy H[zi\ (solid) and conditional entropy H[Zi\s i+1 \ (dashed) for EDDICT and EDDICT-D. It can be seen that EDDICT allows the marginal code entropy to increase over time, whereas EDDICT-D shows a fixed marginal code entropy over time. Whilst the conditional entropy (how predictable the code is from the final state) is lower for EDDICT, EDDICT achieves higher mutual information 0 p (Zi; s i+1 ) (the difference between the marginal code entropy and the conditional entropy). This higher mutual information shows the increased dependence between the latent variables and the final state, meaning that the implementations described herein achieve greater control over the environment that fixed code alternatives.

[0075] In addition to the above, EDDICT has been found to outperform other intrinsic control methods with regard to exploration, especially when applied to Montezuma’ s Revenge, which is known to be one of the hardest exploration games. [0076] In light of the above, unsupervised reinforcement learning may implemented with improved exploration by chaining together latent variables through fixed additive latent dynamics. By implementing unsupervised reinforcement learning before supervised learning (e.g. based on one or more specific tasks), the system may leverage its learned understanding of the environment to make the subsequent training based on tasks more efficient. Notably, the methods described herein result in latent dynamics that accurately represent the dynamics of the environment and indicate greater control of the environment.

[0077] The above described methods and systems may be incorporated into a reinforcement learning system that trains an action-selection neural network (and potentially a predictor neural network) through reinforcement learning for use in controlling an agent to perform a reinforcement learning task while interacting with an environment. The methods described above may train the agent based on intrinsic rewards (e.g. rewards that are not received from the environment, such as entropy-based rewards). Subsequent to this, the system may be trained based on extrinsic rewards to perform a particular task.

[0078] In particular, at each time step during the training, the reinforcement learning system receives data characterizing the current state of the environment. Data characterizing the state of the environment will be referred to as an observation. In response to the observation, the system selects an action to be performed by the agent and causes the agent to perform the selected action. Once the agent has performed the selected action, the environment transitions into a new state and the system receives a reward. This may be an intrinsic reward or an extrinsic reward, i.e. an external reward as contrasted with the previously described entropy-based reward.

[0079] In general, the reward is a numerical value. The reward may indicate whether the agent has accomplished the task, or the progress of the agent towards accomplishing the task. For example, if the task specifies that the agent should navigate through the environment to a goal location, then the reward at each time step may have a positive value once the agent reaches the goal location, and a zero value otherwise. As another example, if the task specifies that the agent should explore the environment, then the reward at a time step may have a positive value if the agent navigates to a previously unexplored location at the time step, and a zero value otherwise.

[0080] The reinforcement learning system may use any model-based or model-free reinforcement learning method, for example a policy gradient technique such as an actor- critic (A-C) method, a Trust Region Policy Optimization (TRPO) method, or a Deep Deterministic Policy Gradient (DDPG) method; or a function approximation technique such as a Deep Q-Network (DQN) method.

[0081] For example in an actor-critic based implementation action-selection neural network may have a value head to generate a value estimate and a policy head to provide an action selection output. The value estimate may represent a value of the environment being in the current state to successfully performing the task. For example, it may comprise an estimate of the return for the task resulting from the environment being in a current state characterized by the observation.

[0082] Generally, an action-selection neural network receives a network input including an observation and generates a network output that defines an action selection policy for selecting an action to be performed by the agent in response to the observation.

[0083] In some implementations, the network output defines a likelihood distribution over actions in a set of possible actions. For example, the network output may include a respective numerical likelihood value for each action in the set of possible actions. As another example, the network output may include respective numerical values defining the parameters of a parametric probability distribution (e.g., the mean and standard deviation of a Normal distribution). In this example, the set of possible actions may be a continuous set (e.g., a continuous range of real numbers). In some of these implementations, the system selects the action to be performed by the agent by sampling an action from the set of possible actions based on the likelihood distribution.

[0084] In some implementations, the network output identifies an action from the set of possible actions. For example, if the agent is a robotic agent, the network output may identify the torques to be applied to the joints of the agent. In some of these implementations, the system selects the action identified by the network output as the action to be performed by the agent or adds noise to the identified action and selects the noisy action as the action to be performed.

[0085] In some implementations, the network input includes both the observation and a given action from the set of possible actions, and the network output is an estimate of a return that will be received by the system if the agent performs the given action in response to the observation. A return refers to a cumulative measure of reward received by the system as the agent interacts with the environment over multiple time steps. For example, a return may refer to a long-term time-discounted reward received by the system. In some of these implementations, the system can select the action that has the highest return as the action to be performed or can apply an epsilon-greedy action selection policy. [0086] In some implementations, the environment is a real-world environment, the agent is a mechanical agent interacting with the real-world environment, e.g., a robot or an autonomous or semi-autonomous land, air, or sea vehicle operating in or navigating through the environment, and the actions are actions taken by the mechanical agent in the real-world environment to perform the task. For example, the agent may be a robot interacting with the environment to accomplish a specific task, e.g., to locate an object of interest in the environment or to move an object of interest to a specified location in the environment or to navigate to a specified destination in the environment. Intrinsic (unsupervised) training, as discussed herein, can be implemented initially to train the agent to understand its environment. Following this, the agent may be trained for the specific task.

[0087] In these implementations, the observations may include, e.g., one or more of: images, object position data, and sensor data to capture observations as the agent interacts with the environment, for example sensor data from an image, distance, or position sensor or from an actuator. For example in the case of a robot, the observations may include data characterizing the current state of the robot, e.g., one or more of: joint position, joint velocity, joint force, torque or acceleration, e.g., gravity-compensated torque feedback, and global or relative pose of an item held by the robot. In the case of a robot or other mechanical agent or vehicle the observations may similarly include one or more of the position, linear or angular velocity, force, torque or acceleration, and global or relative pose of one or more parts of the agent. The observations may be defined in 1, 2 or 3 dimensions, and may be absolute and/or relative observations. The observations may also include, for example, sensed electronic signals such as motor current or a temperature signal; and/or image or video data for example from a camera or a LIDAR sensor, e.g., data from sensors of the agent or data from sensors that are located separately from the agent in the environment.

[0088] In these implementations, the actions may be control signals to control the robot or other mechanical agent, e.g., torques for the joints of the robot or higher-level control commands, or the autonomous or semi-autonomous land, air, sea vehicle, e.g., torques to the control surface or other control elements e.g. steering control elements of the vehicle, or higher-level control commands. The control signals can include for example, position, velocity, or force/torque/ accel erati on data for one or more joints of a robot or parts of another mechanical agent. The control signals may also or instead include electronic control data such as motor control data, or more generally data for controlling one or more electronic devices within the environment the control of which has an effect on the observed state of the environment. For example in the case of an autonomous or semi-autonomous land or air or sea vehicle the control signals may define actions to control navigation e.g. steering, and movement e.g., braking and/or acceleration of the vehicle.

[0089] In some implementations the environment is a simulation of the above-described real- world environment, and the agent is implemented as one or more computers interacting with the simulated environment. For example the simulated environment may be a simulation of a robot or vehicle and the reinforcement learning system may be trained on the simulation and then, once trained, used in the real-world.

[0090] In some implementations the environment is a real-world manufacturing environment for manufacturing a product, such as a chemical, biological, or mechanical product, or a food product. As used herein a “manufacturing” a product also includes refining a starting material to create a product, or treating a starting material e.g. to remove pollutants, to generate a cleaned or recycled product. The manufacturing plant may comprise a plurality of manufacturing units such as vessels for chemical or biological substances, or machines, e.g. robots, for processing solid or other materials. The manufacturing units are configured such that an intermediate version or component of the product is moveable between the manufacturing units during manufacture of the product, e.g. via pipes or mechanical conveyance. As used herein manufacture of a product also includes manufacture of a food product by a kitchen robot.

[0091] The agent may comprise an electronic agent configured to control a manufacturing unit, or a machine such as a robot, that operates to manufacture the product. That is, the agent may comprise a control system configured to control the manufacture of the chemical, biological, or mechanical product. For example the control system may be configured to control one or more of the manufacturing units or machines or to control movement of an intermediate version or component of the product between the manufacturing units or machines.

[0092] As one example, a task performed by the agent may comprise a task to manufacture the product or an intermediate version or component thereof. As another example, a task performed by the agent may comprise a task to control, e.g. minimize, use of a resource such as a task to control electrical power consumption, or water consumption, or the consumption of any material or consumable used in the manufacturing process.

[0093] The actions may comprise control actions to control the use of a machine or a manufacturing unit for processing a solid or liquid material to manufacture the product, or an intermediate or component thereof, or to control movement of an intermediate version or component of the product within the manufacturing environment e.g. between the manufacturing units or machines. In general the actions may be any actions that have an effect on the observed state of the environment, e.g. actions configured to adjust any of the sensed parameters described below. These may include actions to adjust the physical or chemical conditions of a manufacturing unit, or actions to control the movement of mechanical parts of a machine or joints of a robot. The actions may include actions imposing operating conditions on a manufacturing unit or machine, or actions that result in changes to settings to adjust, control, or switch on or off the operation of a manufacturing unit or machine.

[0094] The (extrinsic) rewards or return may relate to a metric of performance of the task.

For example in the case of a task that is to manufacture a product the metric may comprise a metric of a quantity of the product that is manufactured, a quality of the product, a speed of production of the product, or to a physical cost of performing the manufacturing task, e.g. a metric of a quantity of energy, materials, or other resources, used to perform the task. In the case of a task that is to control use a resource the matric may comprise any metric of usage of the resource.

[0095] In general observations of a state of the environment may comprise any electronic signals representing the functioning of electronic and/or mechanical items of equipment. For example a representation of the state of the environment may be derived from observations made by sensors sensing a state of the manufacturing environment, e.g. sensors sensing a state or configuration of the manufacturing units or machines, or sensors sensing movement of material between the manufacturing units or machines. As some examples such sensors may be configured to sense mechanical movement or force, pressure, temperature; electrical conditions such as current, voltage, frequency, impedance; quantity, level, flow/movement rate or flow/movement path of one or more materials; physical or chemical conditions e.g. a physical state, shape or configuration or a chemical state such as pH; configurations of the units or machines such as the mechanical configuration of a unit or machine, or valve configurations; image or video sensors to capture image or video observations of the manufacturing units or of the machines or movement; or any other appropriate type of sensor. In the case of a machine such as a robot the observations from the sensors may include observations of position, linear or angular velocity, force, torque or acceleration, or pose of one or more parts of the machine, e.g. data characterizing the current state of the machine or robot or of an item held or processed by the machine or robot. The observations may also include, for example, sensed electronic signals such as motor current or a temperature signal, or image or video data for example from a camera or a LIDAR sensor. Sensors such as these may be part of or located separately from the agent in the environment.

[0096] In some implementations the environment is the real-world environment of a service facility comprising a plurality of items of electronic equipment, such as a server farm or data center, for example a telecommunications data center, or a computer data center for storing or processing data, or any service facility. The service facility may also include ancillary control equipment that controls an operating environment of the items of equipment, for example environmental control equipment such as temperature control e.g. cooling equipment, or air flow control or air conditioning equipment. The task may comprise a task to control, e.g. minimize, use of a resource, such as a task to control electrical power consumption, or water consumption. The agent may comprise an electronic agent configured to control operation of the items of equipment, or to control operation of the ancillary, e.g. environmental, control equipment.

[0097] In general the actions may be any actions that have an effect on the observed state of the environment, e.g. actions configured to adjust any of the sensed parameters described below. These may include actions to control, or to impose operating conditions on, the items of equipment or the ancillary control equipment, e.g. actions that result in changes to settings to adjust, control, or switch on or off the operation of an item of equipment or an item of ancillary control equipment.

[0098] In general observations of a state of the environment may comprise any electronic signals representing the functioning of the facility or of equipment in the facility. For example a representation of the state of the environment may be derived from observations made by any sensors sensing a state of a physical environment of the facility or observations made by any sensors sensing a state of one or more of items of equipment or one or more items of ancillary control equipment. These include sensors configured to sense electrical conditions such as current, voltage, power or energy; a temperature of the facility; fluid flow, temperature or pressure within the facility or within a cooling system of the facility; or a physical facility configuration such as whether or not a vent is open.

[0099] The (extrinsic) rewards or return may relate to a metric of performance of the task.

For example in the case of a task to control, e.g. minimize, use of a resource, such as a task to control use of electrical power or water, the metric may comprise any metric of use of the resource.

[0100] In some implementations the environment is the real-world environment of a power generation facility e.g. a renewable power generation facility such as a solar farm or wind farm. The task may comprise a control task to control power generated by the facility, e.g. to control the delivery of electrical power to a power distribution grid, e.g. to meet demand or to reduce the risk of a mismatch between elements of the grid, or to maximize power generated by the facility. The agent may comprise an electronic agent configured to control the generation of electrical power by the facility or the coupling of generated electrical power into the grid. The actions may comprise actions to control an electrical or mechanical configuration of an electrical power generator such as the electrical or mechanical configuration of one or more renewable power generating elements e.g. to control a configuration of a wind turbine or of a solar panel or panels or mirror, or the electrical or mechanical configuration of a rotating electrical power generation machine. Mechanical control actions may, for example, comprise actions that control the conversion of an energy input to an electrical energy output, e.g. an efficiency of the conversion or a degree of coupling of the energy input to the electrical energy output. Electrical control actions may, for example, comprise actions that control one or more of a voltage, current, frequency or phase of electrical power generated.

[0101] The (extrinsic) rewards or return may relate to a metric of performance of the task.

For example in the case of a task to control the delivery of electrical power to the power distribution grid the metric may relate to a measure of power transferred, or to a measure of an electrical mismatch between the power generation facility and the grid such as a voltage, current, frequency or phase mismatch, or to a measure of electrical power or energy loss in the power generation facility. In the case of a task to maximize the delivery of electrical power to the power distribution grid the metric may relate to a measure of electrical power or energy transferred to the grid, or to a measure of electrical power or energy loss in the power generation facility.

[0102] In general observations of a state of the environment may comprise any electronic signals representing the electrical or mechanical functioning of power generation equipment in the power generation facility. For example a representation of the state of the environment may be derived from observations made by any sensors sensing a physical or electrical state of equipment in the power generation facility that is generating electrical power, or the physical environment of such equipment, or a condition of ancillary equipment supporting power generation equipment. Such sensors may include sensors configured to sense electrical conditions of the equipment such as current, voltage, power or energy; temperature or cooling of the physical environment; fluid flow; or a physical configuration of the equipment; and observations of an electrical condition of the grid e.g. from local or remote sensors. Observations of a state of the environment may also comprise one or more predictions regarding future conditions of operation of the power generation equipment such as predictions of future wind levels or solar irradiance or predictions of a future electrical condition of the grid.

[0103] As another example, the environment may be a chemical synthesis or protein folding environment such that each state is a respective state of a protein chain or of one or more intermediates or precursor chemicals and the agent is a computer system for determining how to fold the protein chain or synthesize the chemical. In this example, the actions are possible folding actions for folding the protein chain or actions for assembling precursor chemicals/intermediates and the result to be achieved may include, e.g., folding the protein so that the protein is stable and so that it achieves a particular biological function or providing a valid synthetic route for the chemical. As another example, the agent may be a mechanical agent that performs or controls the protein folding actions or chemical synthesis steps selected by the system automatically without human interaction. The observations may comprise direct or indirect observations of a state of the protein or chemical/ intermediates/ precursors and/or may be derived from simulation.

[0104] In a similar way the environment may be a drug design environment such that each state is a respective state of a potential pharmachemical drug and the agent is a computer system for determining elements of the pharmachemical drug and/or a synthetic pathway for the pharmachemical drug. The drug/synthesis may be designed based on a reward derived from a target for the drug, for example in simulation. As another example, the agent may be a mechanical agent that performs or controls synthesis of the drug.

[0105] In some further applications, the environment is a real-world environment and the agent manages distribution of tasks across computing resources e.g. on a mobile device and/or in a data center. In these implementations, the actions may include assigning tasks to particular computing resources.

[0106] As further example, the actions may include presenting advertisements, the observations may include advertisement impressions or a click-through count or rate, and the reward may characterize previous selections of items or content taken by one or more users. [0107] In some cases, the observations may include textual or spoken instructions provided to the agent by a third-party (e.g., an operator of the agent). For example, the agent may be an autonomous vehicle, and a user of the autonomous vehicle may provide textual or spoken instructions to the agent (e.g., to navigate to a particular location). [0108] As another example the environment may be an electrical, mechanical or electro mechanical design environment, e.g. an environment in which the design of an electrical, mechanical or electro-mechanical entity is simulated. The simulated environment may be a simulation of a real-world environment in which the entity is intended to work. The task may be to design the entity. The observations may comprise observations that characterize the entity, i.e. observations of a mechanical shape or of an electrical, mechanical, or electro mechanical configuration of the entity, or observations of parameters or properties of the entity. The actions may comprise actions that modify the entity e.g. that modify one or more of the observations. The (extrinsic) rewards or return may comprise one or more metric of performance of the design of the entity. For example rewards or return may relate to one or more physical characteristics of the entity such as weight or strength or to one or more electrical characteristics of the entity such as a measure of efficiency at performing a particular function for which the entity is designed. The design process may include outputting the design for manufacture, e.g. in the form of computer executable instructions for manufacturing the entity. The process may include making the entity according to the design. Thus a design of an entity may be optimized, e.g. by reinforcement learning, and then the optimized design output for manufacturing the entity, e.g. as computer executable instructions; an entity with the optimized design may then be manufactured.

[0109] As previously described the environment may be a simulated environment. Generally in the case of a simulated environment the observations may include simulated versions of one or more of the previously described observations or types of observations and the actions may include simulated versions of one or more of the previously described actions or types of actions. For example the simulated environment may be a motion simulation environment, e.g., a driving simulation or a flight simulation, and the agent may be a simulated vehicle navigating through the motion simulation. In these implementations, the actions may be control inputs to control the simulated user or simulated vehicle. Generally the agent may be implemented as one or more computers interacting with the simulated environment.

[0110] The simulated environment may be a simulation of a particular real-world environment and agent. For example, the system may be used to select actions in the simulated environment during training or evaluation of the system and, after training, or evaluation, or both, are complete, may be deployed for controlling a real-world agent in the particular real-world environment that was the subject of the simulation. This can avoid unnecessary wear and tear on and damage to the real-world environment or real-world agent and can allow the control neural network to be trained and evaluated on situations that occur rarely or are difficult or unsafe to re-create in the real-world environment. For example the system may be partly trained using a simulation of a mechanical agent in a simulation of a particular real-world environment, and afterwards deployed to control the real mechanical agent in the particular real-world environment. Thus in such cases the observations of the simulated environment relate to the real-world environment, and the selected actions in the simulated environment relate to actions to be performed by the mechanical agent in the real- world environment.

[0111] Optionally, in any of the above implementations, the observation at any given time step may include data from a previous time step that may be beneficial in characterizing the environment, e.g., the action performed at the previous time step, the reward received at the previous time step, or both.

[0112] Once trained the system may be used to perform the task for which it was trained, optionally with training continuing during such use. The task may be, e.g., any of the tasks described above. In general the trained system may be used to control the agent to achieve rewards or minimize costs as described above. Merely by way of example, once trained the system may be used to control a robot or vehicle to perform a task such as manipulating, assembling, treating or moving one or more objects; or to control equipment e.g. to minimize energy use; or in healthcare, to suggest medical treatments.

[0113] For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions.

[0114] Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. The computer storage medium is not, however, a propagated signal.

[0115] The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

[0116] A computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

[0117] As used in this specification, an “engine,” or “software engine,” refers to a software implemented input/output system that provides an output that is different from the input. An engine can be an encoded block of functionality, such as a library, a platform, a software development kit (“SDK”), or an object. Each engine can be implemented on any appropriate type of computing device, e.g., servers, mobile phones, tablet computers, notebook computers, music players, e-book readers, laptop or desktop computers, PDAs, smart phones, or other stationary or portable devices, that includes one or more processors and computer readable media. Additionally, two or more of the engines may be implemented on the same computing device, or on different computing devices. [0118] The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). For example, the processes and logic flows can be performed by and apparatus can also be implemented as a graphics processing unit (GPU).

[0119] Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few. [0120] Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

[0121] To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s client device in response to requests received from the web browser.

[0122] Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.

[0123] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

[0124] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

[0125] Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[0126] Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.