METHOD, MACHINE-READABLE MEDIUM AND CONTROL SYSTEM FOR CONTROL AT LEAST ONE HEATING, VENTILATION AND AIR CONDITIONING (HVAC) DEVICE

Title:

METHOD, MACHINE-READABLE MEDIUM AND CONTROL SYSTEM FOR CONTROL AT LEAST ONE HEATING, VENTILATION AND AIR CONDITIONING (HVAC) DEVICE

Document Type and Number:

WIPO Patent Application WO/2023/073336

Kind Code:

Abstract:

There is a described method for control of at least one HVAC device comprising a neural network, the neural network being trained using the following approach: providing an input dataset; providing an environment associated with one or more HVAC devices; providing a baseline simulator arranged to simulate the operation of the HVAC devices based on the input dataset so as to obtain the value of a target metric; providing a learning system comprising the neural network; and training the neural network in dependence on the input dataset, the environment and the value of the target metric.

Inventors:

SHALUNOV SERGEY (GB)
ZUBOV MAKSIM (AE)
BELIAEV ALEKSANDR (RU)
PUSHMIN VLADIMIR (SG)

Application Number:

PCT/GB2022/051855

Publication Date:

May 04, 2023

Filing Date:

July 18, 2022

Export Citation:

Click for automatic bibliography generation Help

Assignee:

ARLOID AUTOMATION LTD (GB)

International Classes:

G05B13/02; G05B15/02

Foreign References:

US20030074338A1	2003-04-17
US20060120596A1	2006-06-08
RU2686030C1	2019-04-23

Download PDF:

View/Download PDF PDF Help

Claims:

1. The control system for control at least one heating, ventilation and air conditioning (HVAC) device, comprising:

- a neural network of a control system for at least one HVAC device wherein the neural network is configured for training by the training system;

- a training system configured for training the specified neural network, comprises a global system and one or more child systems, wherein the training system comprises: a global actor-critic system comprising a global actor neural network and a global critic neural network and one or more child actor-critic systems each comprising a child actor neural network and a child critic neural network;

- a baseline simulator;

- a virtual environment module;

- a controller; and

- a memory that storing instructions prompting the specified training system to train the specified neural network in accordance with steps including: a) providing an input dataset; b) generating, by the virtual environment module, of a virtual environment associated with one or more HVAC devices wherein specified generation based on the provided an input dataset, wherein the specified virtual environment comprising at least one virtual model of the HVAC device, and at least one virtual model of the room that comprise, at least one virtual model of the HVAC device; c) executing, by the specified baseline simulator, modeling of the operating mode of the specified at least one virtual HVAC device in the specified virtual model of the room wherein specified modeling is based on the provided an input dataset; d) obtaining a target metric based on the performed modeling; and e) training the specified neural network in accordance with the obtained target metric and in accordance with the input dataset;

43 - wherein the specified controller is configured to generate control instructions and transmit the specified instructions to at least one HVAC device, where the specified control instructions are generated based on the obtained target metric in accordance with step d);

- wherein the neural network is trained using a method comprising: providing gradients from at least one child training system to the global training system and updating the parameters of the global training system based on the gradients received from at least one of the child training systems; copying the parameters from the global training system to at least one of the child training systems; and repeating the steps of providing gradients and copying parameters, wherein specified repeating steps of providing gradients and copying parameters are until each of the neural networks of the child system and the global system have converged.

2. The control system of claim 1, wherein the target metric relates to one or more of: an electricity and/or power usage; an electricity and/or power cost; a time of device operation; a number of times the operation of a device is altered, and a deviation from desired conditions.

3. The control system of claim 1, wherein the training system comprises an actor-critic system comprising an actor neural network and a critic neural network.

4. The control system of claim 1, wherein the specified training neural network includes determining that two or more of the neural networks has converged to provide a predetermined mode of operation of at least one virtual HVAC device in accordance with the specified virtual environment.

5. The control system of claim 1, wherein the neural network is trained using a method comprising: training each child system in dependence on a separate copy of the virtual environment, wherein each of the specified copies of the virtual environment has different initialization properties, wherein the initialization properties comprise at least one of: different population densities and/or different configurations of at least one virtual HVAC device and/or different configurations of the virtual model of the room or their combination.

6. The control system of claim 1, wherein the baseline simulator is arranged to provide a separate target metric to at least one of the child systems, wherein the specified separate target metrics being determined in dependence on the respective separate copy of the virtual environment.

7. The control system of claim 1, wherein the neural network is trained in dependence on a set of target conditions provided by the baseline simulator, wherein the set of target conditions comprises one or more of: a temperature in the room and/or a temperature outside the room, a humidity in the room and/or a humidity outside the room.

8. The control system of claim 1, wherein the neural network is trained in dependence on a plurality of target metrics and/or sets of target conditions provided by the baseline simulator, wherein each target metric is used by the training system for a single training step.

9. The control system of claim 1, wherein the neural network is trained using a method comprising providing a reinforcement training agent arranged to operate in dependence on the training system, wherein the specified agent being arranged to interact with one or more HVAC devices in the specified virtual environment.

10. The control system of claim 9, wherein the baseline simulator is arranged to provide a plurality of target metrics and/or target conditions in dependence on a maximum number of actions that can be taken by the specified agent over a predetermined training period.

11. The control system of claim 1, wherein the baseline simulator is arranged to provide the target metric in dependence on an input dataset comprising a timelimited dataset, wherein the neural network is trained in dependence on the same time limited dataset.

12. The control system of claim 1, wherein the baseline simulator is arranged to provide the target metric without using a neural network and/or wherein the baseline simulator is arranged to provide the target metric by using one or more of: decision tree methods; linear transformation of data; methods based on connection analysis; algorithms using gradient boosting.

13. The control system of claim 1, wherein the baseline simulator is arranged to provide the target metric in dependence on: a user input to control the operation of at least one HVAC device in the specified virtual environment; and/or historic data relating to the operation of the HVAC devices.

14. The control system of claim 1, wherein the input dataset comprises one or more of: description data and/or configuration data of the HVAC devices in the specified virtual environment; a building layout data and/or zonal plan data; and weather conditions data.

15. The control system of claim 1, wherein the neural network is trained using a method comprising providing a plurality of the specified agents, wherein each of the specified agents is configured to perform a plurality of actions and/or communicate with at least one HVAC device, wherein each HVAC device in the environment is associated with a separate agent and/or wherein separate groups of similar HVAC devices in the environment are associated with separate agents.

16. The control system of claim 1, wherein the neural network is trained using a method comprising training a plurality of training systems, preferably training a separate training system for each virtual HVAC device in the environment and/or for each of a plurality of groups of similar virtual HVAC devices in the virtual environment.

17. The control system of claim 1, wherein the neural network is trained using a method comprising training a plurality of training systems associated with a plurality of the specified agents, wherein each training system is arranged to interact with a respective agent.

18. The control system of claim 1, wherein the neural network is trained using a method comprising:

- training a first child system of a first training system and a first child system of a second training system in dependence on a first copy of the environment; - training a second child system of the first training system and a second child system of the second training system in dependence on a second copy of the environment; wherein the first child system of the first training system and the first child system of the second training system are arranged to interact with the first copy of the environment substantially simultaneously and/or wherein the first child system of the first training system and the first child system of the second training system are arranged to interact with the first copy of the environment according to an order, preferably a predetermined order.

19. The method for control at least one heating, ventilation and air conditioning (HVAC) device, including the steps of: a) providing an input dataset; b) generating, by the virtual environment module, a virtual environment associated with one or more HVAC devices wherein specified generation based on the received input dataset, wherein the specified virtual environment comprising at least one virtual model of the HVAC device, and at least one virtual model of the room that comprise, at least one virtual model of the HVAC device; c) executing, by the specified baseline simulator, of modeling the operating mode of the specified at least one virtual HVAC device in the specified virtual model of the room wherein specified modeling is based on the provided an input dataset; d) obtaining a target metric based on the performed modeling; and e) training the specified neural network in accordance with the obtained target metric and in accordance with the input dataset, wherein training the specified neural network executing by the training system comprises a global actor-critic system and one or more child actor-critic systems;

- generating control instructions by the controller; and

- transmitting the specified instructions to at least one HVAC device, where the specified control instructions are generated based on the obtained target metric in accordance with step d); - wherein the neural network is trained using a method comprising: providing gradients from at least one child training system to the global training system and updating the parameters of the global training system based on the gradients received from at least one of the child training systems; copying the parameters from the global training system to at least one of the child training systems; and repeating the steps of providing gradients and copying parameters, wherein specified repeating steps of providing gradients and copying parameters are until each of the neural networks of the child system and the global system have converged.

20. The computer-readable medium storing a computer program product containing instructions that, when executed by a processor, cause the processor to perform the method of claim 19.

21. The method for training a neural network of a control system for control at least one heating, ventilation and air conditioning (HVAC) device, including the steps of: a) providing an input dataset; b) generation, by the virtual environment module, a virtual environment associated with one or more HVAC devices wherein specified generation based on the provided an input dataset, wherein the specified virtual environment comprising at least one virtual model of the HVAC device, and at least one virtual model of the room that comprise, at least one virtual model of the HVAC device; c) executing, by the specified baseline simulator, of modeling the operating mode of the specified at least one virtual HVAC device in the specified virtual model of the room wherein specified modeling is based on the provided an input dataset; d) obtaining a target metric based on the performed modeling; and e) training the specified neural network in accordance with the obtained target metric and in accordance with the input dataset, wherein training the specified neural network executing by the training system comprises a global actor-critic system and one or more child actor-critic systems;

48 - wherein the neural network is trained using a method comprising: providing gradients from at least one child training system to the global training system and updating the parameters of the global training system based on the gradients received from at least one of the child training systems; copying the parameters from the global training system to at least one of the child training systems; and repeating the steps of providing gradients and copying parameters, wherein specified repeating steps of providing gradients and copying parameters are until each of the neural networks of the child system and the global system have converged.

Description:

METHOD, MACHINE-READABLE MEDIUM AND CONTROL SYSTEM FOR CONTROL AT LEAST ONE HEATING, VENTILATION AND AIR CONDITIONING (HVAC) DEVICE

TECHNICAL FIELD:

[0001] The present invention relates to a control system for the control of at least one heating, ventilation and air conditioning (HVAC) device, as well as for training a training system to work with an HVAC device.

BACKGROUND ART:

[0002] Neural networks are machine learning models that use one or more layers which non-linearly transform data from blocks called neurons to predict an output for a given input. Some of these networks are deep networks, with one or more hidden layers in addition to the output layer. The output of each layer is used as an input for the next neural network layer, e.g. the hidden layer or the output layer. Each layer generates its output from the received input in accordance with the current values of a set of parameters.

[0003] There are many known methods for training agents which control complex systems. These methods attempt to train a system that can choose optimal solutions from a fixed set of actions. Such systems usually require a large amount of training data and are often unable to adapt to situations that have not been included in the training dataset. When training several agents, the computational complexity can grow exponentially leading to excessive training times.

[0004] One example of such systems is the method of continuous control using deep reinforcement learning, described in RU 2686030 CL The known method includes obtaining a mini-package of experimental tuples and updating the current values of the parameters of the neural network-actor (neural network-actor), containing for each experimental tuple in the mini-package: processing of educational observation and educational action in an experimental network for the experimental tuple and defining the predictive output of the neural network for the experimental tuple. Updating the current values of the parameters of the neural network-critic using errors between the predictive outputs of the neural network and the outputs of the neural network and updating the current values of the parameters of the neural network-actor using the neural network-critic.

[0005] However, the known invention has disadvantages. In the case of the application of the known method for use in the control of the HVAC device, among the disadvantages of the known method may be indicated the low control accuracy of at least one HVAC device to ensure optimal heating, ventilation and air conditioning in the room. Also among the disadvantages should be indicated the low accuracy of maintaining a constant internal microclimate. These disadvantages are due to the fact that the use in the known method of the predictive neural network-actor and the predictive neural network-critic to determine updates for the current values of the parameters of the neural network-actor leads to a significant increase in computational complexity and, consequently, to an increase in the time required for training the neural network. The increase in computational complexity leads to a decrease in the control accuracy of the HVAC device, since this circumstance leads to computational errors.

[0006] When training multiple agents, computational complexity can grow exponentially, resulting in excessive training time. It is desirable to develop new methods for training neural networks that ensure good performance of neural networks while minimizing the time required to train such neural networks.

SUMMARY:

[0007] The problem solved by the claimed invention is to eliminate at least one of the above disadvantages.

[0008] The technical result of the present invention is to improve the control accuracy of at least one heating, ventilation and air conditioning (HVAC) device to improve the accuracy of maintaining a constant internal microclimate and, as a consequence, reduce energy costs.

[0009] An additional technical result is to minimize the time and computational resources required to train the neural network of the control system with at least one HVAC device.

[0010] The first possible embodiment of the present invention provides the control system for control at least one heating, ventilation and air conditioning (HVAC) device, comprising: a neural network of a control system for at least one HVAC device wherein the neural network is configured for training by the training system; a training system configured for training the specified neural network, comprises a global system and one or more child systems, wherein the training system comprises: a global actor-critic system comprising a global actor neural network and a global critic neural network and one or more child actor-critic systems each comprising a child actor neural network and a child critic neural network; a baseline simulator; a virtual environment module; a controller; and a memory that storing instructions prompting the specified training system to train the specified neural network in accordance with steps including: a) providing an input dataset; b) generating, by the virtual environment module, of a virtual environment associated with one or more HVAC devices wherein specified generation based on the provided an input dataset, wherein the specified virtual environment comprising at least one virtual model of the HVAC device, and at least one virtual model of the room that comprise, at least one virtual model of the HVAC device; c) executing, by the specified baseline simulator, modeling of the operating mode of the specified at least one virtual HVAC device in the specified virtual model of the room wherein specified modeling is based on the provided an input dataset; d) obtaining a target metric based on the performed modeling; and e) training the specified neural network in accordance with the obtained target metric and in accordance with the input dataset; wherein the specified controller is configured to generate control instructions and transmit the specified instructions to at least one HVAC device, where the specified control instructions are generated based on the obtained target metric in accordance with step d); wherein the neural network is trained using a method comprising: providing gradients from at least one child training system to the global training system and updating the parameters of the global training system based on the gradients received from at least one of the child training systems; copying the parameters from the global training system to at least one of the child training systems; and repeating the steps of providing gradients and copying parameters, wherein specified repeating steps of providing gradients and copying parameters are until each of the neural networks of the child system and the global system have converged.

[0011] Additionally the target metric relates to one or more of: an electricity and/or power usage; an electricity and/or power cost; a time of device operation; a number of times the operation of a device is altered, and a deviation from desired conditions.

[0012] Additionally the training system comprises an actor-critic system comprising an actor neural network and a critic neural network.

[0013] Additionally the specified training neural network includes determining that two or more of the neural networks has converged to provide a predetermined mode of operation of at least one virtual HVAC device in accordance with the specified virtual environment.

[0014] Additionally the neural network is trained using a method comprising: training each child system in dependence on a separate copy of the virtual environment, wherein each of the specified copies of the virtual environment has different initialization properties, wherein the initialization properties comprise at least one of: different population densities and/or different configurations of at least one virtual HVAC device and/or different configurations of the virtual model of the room or their combination.

[0015] Additionally the baseline simulator is arranged to provide a separate target metric to at least one of the child systems, wherein the specified separate target metrics being determined in dependence on the respective separate copy of the virtual environment.

[0016] Additionally the neural network is trained in dependence on a set of target conditions provided by the baseline simulator, wherein the set of target conditions comprises one or more of: a temperature in the room and/or a temperature outside the room, a humidity in the room and/or a humidity outside the room.

[0017] Additionally the neural network is trained in dependence on a plurality of target metrics and/or sets of target conditions provided by the baseline simulator, wherein each target metric is used by the training system for a single training step.

[0018] Additionally the neural network is trained using a method comprising providing a reinforcement training agent arranged to operate in dependence on the training system, wherein the specified agent being arranged to interact with one or more HVAC devices in the specified virtual environment.

[0019] Additionally the baseline simulator is arranged to provide a plurality of target metrics and/or target conditions in dependence on a maximum number of actions that can be taken by the specified agent over a predetermined training period.

[0020] Additionally the baseline simulator is arranged to provide the target metric in dependence on an input dataset comprising a time-limited dataset, wherein the neural network is trained in dependence on the same time limited dataset.

[0021] Additionally the baseline simulator is arranged to provide the target metric without using a neural network and/or wherein the baseline simulator is arranged to provide the target metric by using one or more of: decision tree methods; linear transformation of data; methods based on connection analysis; algorithms using gradient boosting. [0022] Additionally the baseline simulator is arranged to provide the target metric in dependence on: a user input to control the operation of at least one HVAC device in the specified virtual environment; and/or historic data relating to the operation of the HVAC devices.

[0023] Additionally the input dataset comprises one or more of: description data and/or configuration data of the HVAC devices in the specified virtual environment; a building layout data and/or zonal plan data; and weather conditions data.

[0024] Additionally the neural network is trained using a method comprising providing a plurality of the specified agents, wherein each of the specified agents is configured to perform a plurality of actions and/or communicate with at least one HVAC device, wherein each HVAC device in the environment is associated with a separate agent and/or wherein separate groups of similar HVAC devices in the environment are associated with separate agents.

[0025] Additionally the neural network is trained using a method comprising training a plurality of training systems, preferably training a separate training system for each virtual HVAC device in the environment and/or for each of a plurality of groups of similar virtual HVAC devices in the virtual environment.

[0026] Additionally the neural network is trained using a method comprising training a plurality of training systems associated with a plurality of the specified agents, wherein each training system is arranged to interact with a respective agent.

[0027] Additionally the neural network is trained using a method comprising: training a first child system of a first training system and a first child system of a second training system in dependence on a first copy of the environment; training a second child system of the first training system and a second child system of the second training system in dependence on a second copy of the environment; wherein the first child system of the first training system and the first child system of the second training system are arranged to interact with the first copy of the environment substantially simultaneously and/or wherein the first child system of the first training system and the first child system of the second training system are arranged to interact with the first copy of the environment according to an order, preferably a predetermined order.

[0028] The second possible embodiment of the present invention provides the method for control at least one heating, ventilation and air conditioning (HVAC) device, including the steps of: a) providing an input dataset; b) generating, by the virtual environment module, a virtual environment associated with one or more HVAC devices wherein specified generation based on the received input dataset, wherein the specified virtual environment comprising at least one virtual model of the HVAC device, and at least one virtual model of the room that comprise, at least one virtual model of the HVAC device; c) executing, by the specified baseline simulator, of modeling the operating mode of the specified at least one virtual HVAC device in the specified virtual model of the room wherein specified modeling is based on the provided an input dataset; d) obtaining a target metric based on the performed modeling; and e) training the specified neural network in accordance with the obtained target metric and in accordance with the input dataset, wherein training the specified neural network executing by the training system comprises a global actor-critic system and one or more child actor-critic systems; generating control instructions by the controller; and transmitting the specified instructions to at least one HVAC device, where the specified control instructions are generated based on the obtained target metric in accordance with step d); wherein the neural network is trained using a method comprising: providing gradients from at least one child training system to the global training system and updating the parameters of the global training system based on the gradients received from at least one of the child training systems; copying the parameters from the global training system to at least one of the child training systems; and repeating the steps of providing gradients and copying parameters, wherein specified repeating steps of providing gradients and copying parameters are until each of the neural networks of the child system and the global system have converged.

[0029] The third possible embodiment of the present invention provides the computer-readable medium storing a computer program product containing instructions that, when executed by a processor, cause the processor to perform the method for control at least one heating, ventilation and air conditioning (HVAC) device.

[0030] The fourth possible embodiment of the present invention provides the method for training a neural network of a control system for control at least one heating, ventilation and air conditioning (HVAC) device, including the steps of: a) providing an input dataset; b) generation, by the virtual environment module, a virtual environment associated with one or more HVAC devices wherein specified generation based on the provided an input dataset, wherein the specified virtual environment comprising at least one virtual model of the HVAC device, and at least one virtual model of the room that comprise, at least one virtual model of the HVAC device; c) executing, by the specified baseline simulator, of modeling the operating mode of the specified at least one virtual HVAC device in the specified virtual model of the room wherein specified modeling is based on the provided an input dataset; d) obtaining a target metric based on the performed modeling; and e) training the specified neural network in accordance with the obtained target metric and in accordance with the input dataset, wherein training the specified neural network executing by the training system comprises a global actor-critic system and one or more child actor-critic systems; wherein the neural network is trained using a method comprising: providing gradients from at least one child training system to the global training system and updating the parameters of the global training system based on the gradients received from at least one of the child training systems; copying the parameters from the global training system to at least one of the child training systems; and repeating the steps of providing gradients and copying parameters, wherein specified repeating steps of providing gradients and copying parameters are until each of the neural networks of the child system and the global system have converged.

[0031] It is obvious that both the previous general description and the following detailed description are given by way of example and explanation only and are not limitations of the present invention.

BRIEF DESCRIPTION OF THE FIGURES:

[0032] FIG. 1 shows an exemplary view of a virtual environment containing an HVAC device.

[0033] FIG. 2 shows a system incorporating a reinforcement learning system.

[0034] FIG. 3 shows a computing device on which the system of FIG. 2.

[0035] FIG. 4 shows a neural network of a control system for at least one HVAC device.

[0036] FIG. 5 shows the system "actor-critic".

[0037] FIG. 6 shows a detailed embodiment of a reinforcement learning system for the system of FIG. 2.

[0038] FIG. 7 shows the interaction of child and global neural networks in accordance with FIG. 6.

[0039] FIG. 8 shows a method for updating the weights of child neural networks and the global neural network in accordance with FIG. 6.

[0040] FIG. 9 illustrates the interactions between agents, reinforcement learning systems, and virtual environment replicas that may be present in the system in accordance with FIG. 2.

[0041] FIG. 10 shows a method for providing a result based on an action suggested by a reinforcement learning system.

DETAILED DESCRIPTION: [0042] Referring to FIG. 1, the description discloses an exemplary view of a virtual environment containing an HVAC device. In particular, in FIG. 1 shows a room 10, which includes two radiators 12-1, 12-2 (special cases of HVAC devices) and an air conditioning unit 14 (a special case of HVAC devices). Each of these HVAC devices can be used to change indoor conditions. For example, radiators can be used to heat a room, and an air conditioner can be used to cool a room.

[0043] The use of these devices requires certain costs, for example, the cost of electricity and the cost of maintaining the device. Therefore, it is desirable that these devices operate efficiently to maintain the desired indoor conditions with precision matching the desired or targeted indoor conditions.

[0044] In this case, a neural network for controlling an HVAC device is disclosed, as well as a method for teaching this neural network and a control system for at least one HVAC device containing such a neural network. It should be understood that the methods and systems disclosed herein can also be used in other situations (not related to HVAC).

[0045] The neural network may be present on the HVAC device itself. Equally, the HVAC device may be present on a control system that controls the device. In particular, there may be provided a controller comprising one or more neural network, wherein the controller may control an HVAC system, wherein this HVAC system comprises a plurality of HVAC devices.

[0046] Referring to FIG. 2, a system is described in which a neural network training system 110 of a reinforcement control system (hereinafter, a reinforcement learning (training) system 110) is trained depending on: one or more input datasets 102, 103, 104, a virtual environment module 100, and a baseline simulator 400. The virtual environment module 100 can be implemented by a separate computing device in conjunction with software and is configured to create a virtual environment associated with one or more HVAC devices, including creating, based on the received input dataset, at least one virtual model of the HVAC device, and at least one virtual model of the room in which the at least one virtual model of the HVAC device is located.

[0047] With the example of Figure 2, the input datasets comprise a device descriptions and configurations dataset 102, a building description dataset 103, and a weather conditions dataset 104. It will be appreciated that more generally the input datasets may comprise a range of data inputs and that in practice different datasets may be used.

[0048] The exemplary datasets of Figure 2 are of particular use where the training system is applied to control a heating, ventilation, and air conditioning (HVAC) system. It will be appreciated that other datasets may be used in particular where the training system is applied to other systems.

[0049] The device descriptions and configurations dataset may, for example, relate to the available configurations of air conditioners, thermostats, humidifiers, heat exchangers, etc. More generally, one of the input datasets is typically a dataset that defines an action space and/or a set of available agent actions (e.g. turning on or off an air conditioner). Suitable actions for the agent to take in practice are then identified based on the reinforcement learning system 110 (once the reinforcement learning system has been trained). For example, a set of weather conditions may be used as an input for the reinforcement learning system, and the agent may turn an air conditioner either on or off based on an output of the reinforcement learning system. In a basic example, the reinforcement learning system may identify that an air conditioning unit should be turned on when there is a weather condition with high temperature. In the context of the present invention, an agent is understood as an executor of control actions. It has one or more service capabilities that form a single and complex execution model, which can include access to external software, users, communications, HVAC devices, etc. In other words, an agent is a module for controlling external devices. The agent is implemented in the form of software and hardware and can be embodied, for example, in the form of a computing device for controlling HVAC devices, a controller for controlling an HVAC device located in and / or outside the HVAC device, etc.

[0050] The input datasets may include a building description 103. The building description may specify dimensions, plans and/or zonal divisions of a building. The reinforcement learning system 110 is then trained to control the available devices to achieve desirable conditions within the building.

[0051] The input datasets may include a weather conditions dataset 104 (which may, more generally, be a dataset of environmental conditions). This dataset can be used alongside the building description for the determination of suitable agent actions, which actions relate to the operation of HVAC components. In particular, a change in the weather conditions (e.g. a change in the temperature) may necessitate a change in the operation of a device if desirable conditions are to be maintained in the environment.

[0052] Each input dataset 102, 103, 104 is arranged to feed into an environment module 100. The environment module is arranged to feed into the reinforcement learning system 110. In turn, the reinforcement learning system is arranged to feed into an agent 200, which agent is arranged to control an aspect of the environment (e.g. to control one or more devices in a building). In this way, the reinforcement learning system is arranged to be trained in dependence on the environment module and therefore the input datasets.

[0053] The environment module 100 comprises information relating to an environment; for example, the environment module may comprise a database of devices alongside their possible control signals and their locations (which information can be obtained from the input datasets). In this way, the agent 200 is able to interact with the environment module in order to control the devices present in the environment.

[0054] With the exemplary environment of the room 10 of Figure 1, the environment module may comprise the dimensions of the room as well as information (e.g. the possible states) for each of the HVAC devices 12-1, 12-2, 14 in the room. The environment module may further comprise information about the configuration of the room, e.g. whether there is furniture in the room and/or whether the door is open; and information about the occupation of the room, e.g. whether the room is crowded or is empty). Typically, the environment typically relates to a larger environment than the room (e.g. a building, or a plurality of buildings).

[0055] The baseline simulator 400 is arranged to provide a baseline simulation relating to the control of the devices of the environment. Typically, the baseline simulator is determined using a method other than reinforcement learning, such as a statistical model.

[0056] Typically, the baseline simulator is arranged to receive inputs from at least one of the input datasets. In particular, the baseline simulator is typically arranged to receive the device descriptions and configurations dataset 102 and/or the building description dataset 103. Typically, the baseline simulator is arranged to receive each of the input dataset so as to provide a baseline set of outputs (given these datasets).

[0057] Typically, the baseline simulator 400 is arranged to provide an achievable output relating to the controlling of the environment using conventional modelling methods. For example, the baseline simulator may simulate to a method of controlling the devices in the environment based on equations or on conditional logic (e.g. “if temperature rises above 27°C turn on an air conditioner”). In this way, for any input dataset the baseline simulator is able to simulate output conditions (humidifies and temperatures) that can be achieved using known means and/or existing models.

[0058] In some embodiments, the baseline simulator 400 is used to determine and provide desired conditions for the environment. For example, the baseline simulator may provide a desired temperature and/or humidity. These desired conditions may be based on one or more of: a user input, historic data, and/or the simulation (e.g. a simulation that determines achievable values given a desired user input). The reinforcement learning system may then be arranged to also achieve these target conditions. [0059] Additionally, or alternatively, the baseline simulator is typically arranged to provide a key metric relating to the provision of certain conditions. For example, the baseline simulator may provide an electricity cost, device usage statistics, and/or device wear statistics that are required to obtain a set of conditions. This metric can be used as a target by the reinforcement learning system 110, where the reinforcement learning system is arranged to provide the same or similar conditions with an improved metric (e.g. a reduced electricity cost).

[0060] In various embodiments, the metric relates to one or more of: an electricity and/or power usage; an electricity and/or power cost (electricity is typically be cheaper during certain periods of the day, so minimizing usage and minimizing cost may require different actions); a time of device operation; a number of times the operation of a device is altered (this may affect the lifetime of a device); and a deviation from desired conditions (e.g. a maximum deviation, an average deviation, and/or a sum of deviations).

[0061] In some embodiments, the baseline simulator provides a plurality of metrics. In some embodiments, the provided metric may relates to a plurality of constituent metrics. In each of these situations, the baseline simulator may indicate (e.g. based on a user input) a priority order for the metrics. As an example, the baseline simulator may indicate that the reinforcement simulator should prioritise meeting a target set of conditions and then seek to minimize cost. Alternatively, the baseline simulator may indicate that a small deviation from the target set of conditions is acceptable if this deviation enables a substantial reduction in electricity usage. Moreover, training the neural network of the control system in accordance with the stages and, ultimately, obtaining a target metric for further control of at least one HVAC device ensures high accuracy of maintaining a constant indoor microclimate to achieve comfortable indoor climate conditions. This circumstance is due to the fact that continuous training in accordance with the data characterizing the particular versions of both the HVAC equipment and the possible versions of the premises ensures the achievement of positive results for maintaining comfortable climatic conditions in the room.

[0062] In some embodiments, the baseline simulator 400 is arranged to use historic data. For example, the operation of an existing HVAC system over a period of months may be used to form the baseline simulator. This historic data is useable to determine the conditions that are desired by the occupants of that building and to determine the operation of the devices that was required to obtain these conditions. This historic data is also useable to form a target for the reinforcement learning system 110 to match and/or beat.

[0063] In some embodiments, the method of configuring the baseline simulator 400 comprises monitoring the environment and/or the devices in order to determine a baseline operation, where this monitoring preferably takes place over a period of at least a week, at least two weeks, and/or at least a month. The baseline simulator can then be arranged to provide an indication of the operation of an existing system in the environment.

[0064] The target metric may depend on historical data and/or monitoring of the environment. Equally, the target metric may be based on a predicted improvement. For example, a user may be able to predict that an efficiency improvement of 20% over existing systems will be achievable. The target metric may then be based on a 20% reduction in a historically determined energy usage.

[0065] Simulating the baseline of the environment may include one or more of the following steps:

1. Generating, and/or receiving, physical parameters necessary for simulation, for example: i. user inputs (e.g. user input temperatures, humidities, etc.); ii. the building information 103, and the device description and configuration 102; iii. information about weather conditions 104 during the period corresponding to the period that is to be simulated. For example, for the month of August in any given year. It is also possible to use multiple time periods (over several years) to improve the reliability of the simulation.

2. Conducting simulation for a selected period of time, typically by methods other than reinforcement learning. These methods may include, but are not limited to: i. User inputs (e.g. user input equations); ii. Decision tree methods; iii. Linear transformation of data; iv. Methods based on connection analysis (regression, etc.); v. Algorithms using gradient boosting; vi. Deterministic models; vii. Stochastic models >

[0066] Typically, the baseline simulator 400 uses models trained with available, e.g. time-limited, datasets. The models may use methods based on decision trees or methods using gradient boosting or similar. The simulation obtained is subsequently transmitted as part or all of the data characterizing the state of the environment to the reinforcement learning system 110. As an example, the baseline simulator may be given the weather conditions for a historic month (e.g. September 2000) and may thereafter provide a simulation output relating to the actions needed to maintain desired conditions throughout this time period given these weather conditions. This output may contain performance metrics such as an electricity usage, where the reinforcement learning system is then able to use these performance metrics as a target.

[0067] Therefore, using an input dataset the baseline simulator 400 is able to provide a target performance metric and/or a target set of conditions for the reinforcement learning system 110. The reinforcement learning system 110 comprises at least one neural network, which neural network is trained using the same input dataset, and maps inputs from this dataset onto a set of actions to be performed by the agent 200 in order to achieve or improve the target conditions and/or the metric. In such a manner, the reinforcement learning system is useable to achieve desired/target building conditions given a certain environment and certain inputs (e.g. the reinforcement learning system may be able to maintain a certain desirable temperature range in a building). The reinforcement learning system is typically able to achieve these conditions while outperforming the baseline simulator in relation to the target metric (e.g. the reinforcement learning system may reduce electricity usage).

[0068] The number of outputs from the baseline simulator 400 typically relates to the number of steps being used for the training of the reinforcement learning system 100; this number of steps corresponds to a maximum number of actions that can be performed by the agent 200 before the end of the training. Thus, if the time interval selected for training the reinforcement learning system is one month, and the agent is arranged to perform actions (e.g. turn on or off HVAC devices) once a day, then the baseline simulator may be arranged to provide a number of output conditions corresponding to the number of days in this month so that at each step completed by the agent during the time interval selected for training, there is a known set of target output conditions provided by the baseline simulator 400. The baseline simulator is typically arranged to provide a metric for each step (e.g. a daily electricity cost); the baseline simulator may additionally or alternatively provide a metric for the situation as a whole (e.g. a total electricity cost).

[0069] The baseline simulator (and the reinforcement learning system 110) are typically trained using data for a substantial time period, e.g. a number of months. Within this period the weather conditions and/or desired environmental conditions may vary. For example, during winter the weather conditions may be colder than average and the electricity usage output by the baseline simulator 400 may be correspondingly higher than average. By considering an extended period of time, the reinforcement learning system can be trained to perform at a high level under a range of differing conditions.

[0070] While the baseline simulator 400 may perform a complex simulation in order to obtain target conditions/metrics, these conditions/metrics may also be obtained in a simpler way, e.g. a simple linear or non-linear transformation of the physical parameters that were present before the simulation, as well as by methods that use the analysis of relationships between these physical parameters.

[0071] An example of a simulation that may be provided by the baseline simulator 400 is the method described in the E+ (EnergyPlus) specification (“Engineering Reference, EnergyPlusTM Version 9.4.0 Documentation (September 2020); U.S. Department of Energy https://energyplus.net/sites/all/modules/custom/nrel_custom/ pdfs/pdfs_v9.4.0/ EngineeringReference.pdf”). This open-source software requires as an input a set of historical data on weather conditions during a corresponding time period or periods. For example, if the simulation is to be performed in the month of September, data for the month of September in the previous year or years is required. As an output, there is provided required usage information for HVAC devices in order to achieve desired conditions.

[0072] The agent 200 is arranged to interact with the environment module 100 based on an output of the reinforcement learning system 110. In particular, the agent is arranged to control the devices relevant to the environment in order to achieve desired environmental conditions.

[0073] The agent 200 operates based on outputs from the reinforcement learning system 110, where the reinforcement learning system receives a set of input values and thereafter provides a set of actions (e.g. the outputs) to the agent, where these actions relate to the operation of the devices in the environment. The agent then implements the actions. For example, a set of weather conditions may be used as the inputs for the reinforcement learning system; the reinforcement learning system then determines (using a neural network as described below) appropriate actions to take to obtain a desired set of conditions given these weather conditions. These appropriate actions are fed into the agent, which interacts with the devices in the environment to implement the actions (e.g. to turn on or off an air conditioner). [0074] The agent 200 may be combined with the reinforcement learning system 110, where the reinforcement learning system comprises the agent and is arranged to interact directly with the devices in the environment.

[0075] There may be a separate agent 200 and/or reinforcement learning system 110 provided for each of the devices in the environment. Equally, there may be a separate agent 200 and/or reinforcement learning system 110 provided for each group of the devices in the environment (e.g. a group of air conditioners and a group of radiators may use two separate agents). Equally, there may be a single agent 200 and/or reinforcement learning system 110 that is arranged to control all of the devices in the environment.

[0076] The present disclosure relates to a reinforcement learning process, in which a reinforcement learning system 110 is trained so that it is suitable for controlling the agent 200. To interact with the environment module 100, the agent 200 typically receives data characterizing its state and takes an action from a action space as a reaction to this data, this action being determined by the reinforcement learning system. Data characterizing the state of the environment can be termed as an observation or an input. The goal of the system as a whole is to maximize the ‘reward’ received for performing actions in the described environment, where the reward relates to the meeting/exceeding of one or more metrics (e.g. as provided by the baseline simulator 400). The present disclosure also relates to a method of controlling one or more devices based on a trained reinforcement learning system and/or agent 200.

[0077] Typically, the reinforcement learning system selects actions to be performed by the agent 200 by interacting with the environment module 100. That is, the reinforcement learning system receives data relating to the environment and environmental conditions and selects an action from a space to be performed by the agent 200. Typically, the action is selected from a continuous action space. Equally, the action may be selected from a discrete action space (e.g. a limited number of possible actions). [0078] Referring to Figure 3, each of the components of the described system is typically implemented using a computer device 1000. These components may each be implemented using the same computer device, or the components may be implemented using a plurality of computer devices.

[0079] The computer device 1000 comprises a processor in the form of a CPU 1002, a communication interface 1004, a memory 1006, storage 1008, removable storage 1010 and a user interface 1012 coupled to one another by a bus 1014. The user interface 1012 comprises a display 1016 and an input/output device, which in this embodiment is a keyboard 1018 and a mouse 1020.

[0080] The CPU 1002 executes instructions, including instructions stored in the memory 1006, the storage 1008 and/or removable storage 1010.

[0081] The communication interface 1004 is typically an Ethernet network adaptor coupling the bus 1014 to an Ethernet socket. The Ethernet socket is coupled to a network. The Ethernet socket is usually coupled to the network via a wired connection, but the connection could alternatively be wireless. It will be appreciated that a variety of other communications mediums may be used (e.g. Bluetooth®, Infrared, etc.).

[0082] The memory 1006 stores instructions and other information for use by the CPU 1002. The memory is the main memory of the computer device 1000. It usually comprises both Random Access Memory (RAM) and Read Only Memory (ROM).

[0083] The storage 1008 provides mass storage for the computer device 1000. In different implementations, the storage is an integral storage device in the form of a hard disk device, a flash memory or some other similar solid state memory device, or an array of such devices.

[0084] The removable storage 1010 provides auxiliary storage for the computer device 1000. In different implementations, the removable storage is a storage medium for a removable storage device, such as an optical disk, for example a Digital Versatile Disk (DVD), a portable flash drive or some other similar portable solid state memory device, or an array of such devices. In other embodiments, the removable storage is remote from the computer device 1000, and comprises a network storage device or a cloud-based storage device.

[0085] The computer device 1000 may comprise one or more graphical processing units (GPUs), application specific integrated circuits (ASICs) and/or one or more field programmable gate arrays (FPGAs).

[0086] A computer program product is provided that includes instructions for carrying out aspects of the method(s) described below. The computer program product is stored, at different stages, in any one of the memory 1006, storage device 1008 and removable storage 1010. The storage of the computer program product is non-transitory, except when instructions included in the computer program product are being executed by the CPU 1002, in which case the instructions are sometimes stored temporarily in the CPU or memory. It should also be noted that the removable storage is removable from the computer device 1000, such that the computer program product is held separately from the computer device from time to time.

[0087] Typically, the reinforcement learning system 110 is trained using a first computer device. Once the reinforcement learning system has been trained on this first computer device, it can be used to control a system (e.g. an HVAC system). To do this, the trained reinforcement learning system and/or the agent may be output and/or transmitted to another computer device.

[0088] Typically, where the reinforcement learning system 110 is used to control a system, the computer device 1000 is arranged to receive an input, either via a sensor or via the communication input 1004. The input may comprise one or more of: environmental conditions, weather conditions (e.g. a weather forecast), desired conditions (e.g. a user input desired temperature), and/or an environmental configuration (e.g. an indication of whether each door in a building is closed or open).

[0089] Referring to Figure 4, there is shown an exemplary neural network 10 that may be form a part of the learning reinforcement system 110. [0090] The neural network 10 of Figure 4 is a deep neural network that comprises an input layer 12, one or more hidden layers 14, and an output layer 16. It will be appreciated that the example of Figure 4 is only a simple example and in practice a neural network may comprise further layers. Furthermore, while this exemplary neural network is a deep neural network comprising a hidden layer 14, more generally the neural network may simply comprise any layer that maps an input layer to an output layer.

[0091] The neural network is based on parameters that map the inputs (e.g. observations) to the outputs (e.g. actions) in order to achieve a desired output given an input. These parameters typically comprise weightings. In order to determine parameters that achieve advantageous operation, the neural network is trained using training sets of input data. In the present system, the inputs to the neural network may be termed as observations, with each observation characterizing a state of the environment (e.g. the present conditions).

[0092] Typically, the training of the neural network comprises a number of steps, where the parameters of the neural network are updated following each training stage based on the performance of the neural network during that step. For example, if a change in parameters is determined to improve the performance of the neural network during a first training step, a similar change may be implemented before a second training step. Typically, the training of the neural network comprises a number of steps to ensure that the parameters provide a desired output for a range of input datasets.

[0093] According to the present disclosure, a neural network is trained based on the baseline simulator 400, where the metric provided by the baseline simulator enables the neural network to be trained more quickly than with conventional training methods.

[0094] Referring to Figure 5, the reinforcement learning system 110 typically comprises an actor-critic system 20. [0095] The actor-critic system 20 comprises two separate neural networks, one of which is an actor neural network 22 and one of which is a critic neural network 24.

[0096] The actor neural network 22 is arranged to receive a state from the environment 100 and to provide, in response, an action that the agent can perform. Using the example of an HVAC system, the actor neural network may receive a temperature at a first time as an input from the environment and then determine that the agent should turn on an air conditioning unit (it will be appreciated that the actor neural network typically maps input tuples to output tuples without knowledge of the meaning of these tuples, so the actor neural network is not aware that a given output relates to an air conditioner being activated).

[0097] The critic neural network 24 is arranged to receive the state from the environment module 100 and the action from the actor neural network 22 and to determine a value relating to this action and state. Therefore, the critic neural network determines the outcome of the action suggested by the actor neural network.

[0098] In a (basic) practical example of the actor-critic system 20, the baseline simulator 400 may indicate that a temperature of 23 °C is desired. The environment 100 may then provide a state value that indicates the temperature at a first time is 25 °C. Given this input state, the actor neural network 22 may return an action to turn on a radiator. The critic neural network 24 determines the outcome of this action - e.g. this action will likely result in the temperature increasing (so the temperature at a second time may be 27°C). The critic neural network 24 then receives the state values at the first time and the second time and identifies that the temperature has risen and that the suggested action has had a negative effect. This is fed back into the actor neural network, and in response the parameters of the actor neural network are modified so that given the same situation, a similar action is not suggested in the future. Therefore, the parameters of the actor neural network are altered based on feedback from the critic neural network.

[0099] The information provided to the actor neural network 22 by the critic neural network 24 comprises a temporal difference (TD) error. This TD error takes into account the passage of time (so that it can be taken into account, for example, that the radiator being turned on might only have an effect after a certain waiting period).

[0100] The feedback from the critic neural network 24 may comprise an indication of an error between an achieved condition and a desired condition, where the parameters of the actor neural network 22 are updated so as to minimize this error.

[0101] The critic neural network 24 provides a subjective assessment of the benefit of the present state of the actor neural network 22. In order to improve this assessment, particularly during the early stages of training, the baseline simulator 400 may be used to determine appropriate parameters for the critic neural network. In particular, a reward used to train one of the neural networks may be based on a difference between the target metric a metric associated with the current parameters of this neural network. In practice, the action suggested by the actor neural network is typically associated with a metric (e.g. an electricity usage). The modifications made to the parameters of the actor neural network c(and/or the critic neural network) in a training stage may be dependent on the difference between this metric and the target metric provided by the baseline simulator 400.

[0102] In this regard, the parameters of the critic neural network 24 and thus the determined value may be dependent on a condition and/or metric received from the baseline simulator 400. In a conventional system a critic neural network may identify that a set of actions suggested by the actor neural network 22 relates to an electricity cost metric of lOOkWh; however, without context the critic neural network is not able to determine whether this cost is good or bad (and the critic neural network typically needs to learn whether a certain output is good or bad over the course of numerous training steps). Therefore, at early stages of training the critic neural network is only able to provide limited feedback to the actor neural network and the parameters of the actor neural network can only be changed slowly. By providing a metric from the baseline simulator, the critic is able to learn much more quickly. For example, if the baseline simulator has achieved an electricity cost metric of 20kWh for the same situation, the critic neural network is able to rapidly identify that the current parameters of the actor neural network that have led to an electricity cost metric of lOOkWh are substantially sub-optimal.

[0103] In such a way, by training the critic neural network 24 and/or the actor neural network 22 in dependence on a condition and/or metric provided by the baseline simulator 400 the training time of the reinforcement learning system 110 can be reduced.

[0104] The baseline simulator 400 is useable to rapidly train the actor neural network 22 and the critic neural network 24 so as to provide an agent that achieves at least a similar performance to the baseline simulator. The actor neural network and the critic neural network can then continue to be trained so that the reinforcement learning system begins to outperform the baseline simulator. The use of the baseline simulator reduces the total training time necessary for the reinforcement learning system 110.

[0105] The training of the actor-critic system may involve: a. Receiving a minibatch of experience tuples from the environment. Each experience tuple contains a training observation characterizing the training state of the environment, a training action from the space of possible actions for the agent, a training reward associated with the agent for performing a training action, and a next training observation characterizing the next training state of the environment; b. Updating the current values of the parameters of the actor neural network using the experience tuple minibatch. For each experience tuple in minibatch, the new update typically contains the following stages: i. Processing the training observation and the training action in the experience tuple using the critic neural network to determine the neural network output for the experience tuple in accordance with the current values of the critic neural network parameters and updating the current values of the parameters of the actor neural network using the critic neural network; ii. Simultaneously, generating a new experience tuple. This generation typically contains the following stages: a. Obtaining a new training observation; b. Processing this new training observation using the actor neural network to select a new training action to be performed by the agent, in accordance with the current values of the parameters of the actor neural network; c. Obtaining a new training reward in response to the agent performing this new training action; d. Obtaining a new next training observation; e. Generating a new experience tuple, which includes, as described above, the new training observation, the new training action, the new training reward and the new next training observation.

[0106] An example of an existing system that uses an actor-critic arrangement is the A3C system (asynchronous advantage actor critic). The methods disclosed herein may be implemented using such a system. Equally, the methods disclosed herein may equally be implemented using a Mixture - Density Recurrent Network (MDN-RNN).

[0107] In some embodiments, the reinforcement learning system 110 does not comprise an actor-critic system. The reinforcement learning system will still comprise at least one neural network. In such systems, the baseline simulator 400 typically provides feedback to whichever reward model is used to evaluate the performance of the neural network.

[0108] Referring to Figure 6, an embodiment of the learning reinforcement system 110 is shown in further detail.

[0109] In this embodiment, the reinforcement learning system 110 is arranged to select actions (outputs) using a global neural network 300. The global neural network is a neural network that is configured to receive a set of input data and to process the input data to associate this input data with an action (e.g. if the temperature goes up, turn on the air conditioning). Typically, this comprises the global neural network selecting a point in a continuous action space that defines an action to be performed by the agent 200 in response to an input.

[0110] This global neural network 300 typically learns by means of reinforcement learning. Equally the global neural network could learn by means of supervised or unsupervised learning (so that the disclosure of the reinforcement learning system 110 is more generally disclosure of a learning system that comprises at least one neural network).

[0111] The global neural network 300 typically includes a global actor neural network 301, which global actor network provides the function of mapping inputs to outputs (e.g. actions), and a global critic neural network 302, which global critic neural network is arranged to take actions and input data (e.g. states) as an input and to process actions and input data to create a neural network output. During training, the reinforcement learning system 110 controls the parameters of the global critic neural network and the global actor neural network.

[0112] The reinforcement learning system disclosed herein has several distinctions as compared to contemporary architectural solutions.

[0113] To train the global actor neural network 301 and the global critic neural network 302, the reinforcement learning system comprises at least one child neural network 310-1, 310-2, 310-N; each child neural network comprises a child actor neural network 311-1, 311-2, 311 -N and a child critic neural network 312-1, 312-2, 312-N. These child neural networks interact with the environment 100 simultaneously (as shown by Figure 6). This is achieved by providing respective copies of the environment 101-1, 101-2, 101 -N, with a separate copy of the environment being created and provided for each of the child neural networks.

[0114] It will be appreciated that any number of child neural networks 310 may be used, where the number used may depend on the implementation.

[0115] Each copy of the environment has different properties, which may be set via different initialization conditions. Typically, these properties relate to different configurations of the target environment (e.g. different configurations of a building). For example, each copy of the environment may relate to a different density of occupants of the environment or a different configuration of objects in the environment (e.g. different sets of doors being open or closed). The copies of the environment may be initialized based on a number of parameters, each of which may be randomly selected or chosen by a user. These parameters typically comprise one or more of: an initial condition in the environment (e.g. an initial temperature or humidity); an initial condition outside of the environment; an initial occupancy and/or population; and a characteristic of the devices (e.g. a type of coolant used).

[0116] Therefore, each child neural network 310 is trained based on a different environment with different properties. The child neural networks may thus be structurally identical and/or may be initialized with the same parameters, which parameters will diverge due to the child neural networks interacting with the different versions of the environment. Equally, the child neural networks may be initialized with different parameters. There may also be provided a plurality similar copies of the environment where these similar copies are linked to differently initialized child neural networks.

[0117] Typically, the baseline simulator 400 is arranged to provide a metric and/or a set of target conditions for each of the copies of the environment 101. For example, the baseline simulation for a crowded environment may lead to a higher energy usage (and a different energy usage metric) than a sparse environment.

[0118] For each input tuple, at each step, the reinforcement learning system 110 is arranged to use the child critic neural networks 312 to determine updates to the current parameters of the child actor neural networks 311 and the weighting values of the child critic neural networks. Following each step, the parameters of the child actor neural networks 311 and the child critic neural networks 312 are modified in dependence on the determined updates.

[0119] The child actor neural networks 311 and child critic neural networks 312 send, as per the process 35 depicted in Figures 6 and 7, the results of their work (e.g. gradients that provide an indication of the performance of these child neural networks 310) to the global neural network 300. The global neural network then accumulates these training results and used them to alter the parameters of the global actor neural network 301 and the global critic neural network 302. Specifically, the global neural network is able to identify parameters and/or modifications to parameters that result in improvements for the various child neural networks 310 so that the parameters of the global neural network can be updated accordingly.

[0120] The structure of each neural network is typically as follows: the global neural network 300 is trained by the child neural networks 310. Specifically, the child neural networks are arranged to periodically transmit 35 gradients and/or parameters to the global neural network. The parameters of the global neural network are determined based on the gradients and/or parameters received from the child neural networks.

[0121] Typically, the parameters of the global neural network 300 are determined in dependence on an average of the gradients provided by the child neural networks 310. These gradients can be applied to existing parameters of the global neural network (and more specifically to existing parameters of the global actor neural network 301 and the global critic neural network 302) in order to obtain updated parameters.

[0122] In turn, the child neural networks 310 are arranged to receive parameter updates from the global neural network 300 periodically, when a condition is met (for example, when a certain number of training steps have been performed), and/or on the execution of a given set of actions. Typically, the global neural network is arranged to periodically send copies of its parameters to the child neural networks as shown in Figure 7 where these parameters may replace the existing parameters of each child neural network.

[0123] The child neural networks 310 are trained based on different copies of the environment 101 having different properties; therefore, they are specialized for different situations. For example, the first child neural network 310-1 may give optimal operation when the environment is crowded with the second child neural network 310-2 may give optimal operation when the environment is sparsely populated (sparsely populated premises and/or several premises).

[0124] By periodically transmitting their gradients and/or parameters to the global neural network 300, the child neural networks 310 are able to train the global neural to provide good performance in a range of environments (e.g. for environments with a range of differing properties). Periodically, the parameters of the global neural network are transmitted to the child neural networks; therefore, the child neural networks indirectly share their parameters. For example, the first child neural network 310-1 may be trained based on a first copy of the environment 101-1 that is crowded and the second child neural network 310-2 may be trained based on a second copy of the environment 101-2 that is sparsely populated. The first child neural network may be trained indirectly for use in a sparse environment by receiving parameters from the global neural network, which global neural network has received previously gradients and/or parameters from the second child neural network. Eventually, the neural networks will converge to provide a series of similar and/or identical child neural networks (and a global neural network that is the same) that provide optimal operation for a wide range of environments.

[0125] Once convergence has occurred, the reinforcement learning system 110 can be used to provide suitable outputs given a set of inputs. This convergence (or near-convergence) may be indicated, for example, by outputting a notification to a user or by providing an output based on an input to the reinforcement learning system.

[0126] The periodicity of data transmission (e.g. the transmission of gradients/parameters from the child neural networks 310 to the global neural network 300 and the transmission of parameters from the global neural network to the child neural networks) may be based on a number of steps, a user input, or a rate of convergence.

[0127] If this periodicity is too low, e.g. if parameters are transmitted too frequently, then the child neural networks are not able to sufficiently diverge to provide optimal operation for their copy of the environment. If the periodicity is too high, the time required for training the reinforcement learning system can become excessive. Therefore, in various embodiments, the global neural network is arranged to provide parameters for the child neural networks no more than once every 20 steps, no more than once every 50 steps, and/or no more than once every 100 steps. Equally, in various embodiments, the global neural network is arranged to provide parameters for the child neural networks no less than once every 200 steps, no more less than once every 100 steps, and/or no less than once every 50 steps.

[0128] Typically, one or more of the following features is implemented by the reinforcement learning system 110 as described by Figure 8:

1) At each training step, for each copy of the environment 101 and agent 200, each child actor neural network 311 receives 31 a set of tuples from the corresponding copy of the environment 101. Each tuple contains data characterizing the state of the environment, an action from a given action space performed by the agent 200 in the response to the data, the reward for the action taken by the agent, and the next set of input data characterizing the next educational state of the environment.

2) Each child actor neural network 311 then updates 34 the current values of its parameters using the data obtained as a result of the following sequence of actions: a) Each data input and action in the resulting tuple is processed 32 using each child critic neural network 312 to determine the output of the corresponding child actor neural network 311 for the tuple in accordance with the current parameters of the child actor neural network and the child critic neural network. This determines 32 the predictive output, or expected reward (e.g. the expected performance), of the child neural network 310 for the received tuple from the training reward and the next training input dataset in the same tuple. The current values of the parameters of the child actor neural network 311 and the critic neural network 312 are then updated 33 using the assessment of the advantage of the predictive output of the child neural network 310 and the reward from the environment, as well as based on the condition(s) and metric(s) obtained from the baseline simulator 400. For example, if the baseline simulator indicates that the child neural network is operating far below the best known possible performance, the parameters may be substantially altered; if the baseline simulator indicates that child neural network is operating close to the best known possible performance the parameters may be altered by only a small amount. The reward is typically based on a difference between the output of the child actor neural network and the metric provided by the baseline simulator. b) Therefore, the current parameters of each child actor neural network 311 are updated 33 using the respective child critic neural network 312 and, e.g., an entropy loss function.

3) Periodically, after a predetermined condition is met (for example, after a certain number of actions have been processed by the child neural network 310) or, where applicable, a given training episode has been completed, the current values of the parameters of the global actor neural network 301 and the global critic neural network 302 are updated 35 based on gradients and/or parameters sent by the child neural networks. The child neural networks may each send these gradients/parameters at the same time, or the time of transmission of gradients/parameters may differ for the various child neural networks.

4) Periodically, e.g. in accordance with a predetermined condition, the parameters of the child actor neural network 311 and the child critic neural network 312 are updated 36 using the parameters of the global neural network 300.

[0129] The updating 36 of the current parameters of each child critic neural network 312 is performed using the error between the reward given from the environment for performing the chosen action and the output of the child actor neural network 311 generated by the child critic network to estimate the expected reward (e.g. the difference between a desired value and an achieved value). Optionally, an error may be used between the output of the child actor neural network 311 for the current observation and the results of the baseline simulator 400. Typically, the system can determine an update for the current parameters reducing error using traditional machine learning and optimization techniques, such as by performing a backpropagation gradient descent iteration.

[0130] To update the parameters of the child actor neural network 311, the learning reinforcement system 110, having processed the input data using the child actor neural network, selects the next action and updates these parameters using one or more of:

• A gradient provided by the child critic neural network 312 (gradient 1) in relation to the next action taken when processing a related input and output in accordance with the current parameters of the child critic neural network (e.g. an improvement and/or reduction in performance that the child critic neural network determines would occur if an action were suggested based on the current parameters of the child actor neural network);

• A gradient provided by the child actor neural network 311 (gradient 2) in relation to the parameters of the child actor neural network, taken during the training observation, and in accordance with the current values of the parameters of the child actor neural network.

[0131] Effectively, if an alteration to the parameters of the child actor neural network has had a positive effect on the child neural network (e.g. if this alteration results in the same conditions being achieved for a lower electricity cost), then a similar alteration in parameters may be implemented again. Furthermore, if the prior alteration had a large positive effect, then a large similar alteration may be implemented. On the other hand, if an alteration to the parameters of the child actor neural network has had a negative effect, an alternate alteration may be implemented.

[0132] The learning reinforcement system 110 may compute gradient 1 and gradient 2 by means of backpropagation of the corresponding gradients via the corresponding neural nets.

[0133] In general, the learning reinforcement system 110 performs this process for every input tuple, after every update of the parameters of the child critic neural network 312 or of the child actor neural network 311 (e.g. after each training step). As soon as the updates for each tuple are calculated, the reinforcement learning system updates the current parameters of each child actor neural network and each child critic neural network using the tuple updates. In this way, the reinforcement learning system iteratively improves the parameters of each of the neural networks.

[0134] Periodically, using the gradients of the child actor neural networks 311 and the child critic neural networks 312, the learning reinforcement system 110 updates 36 the parameters of the global actor neural network 301 and the global critic neural network 302. [0135] Reverse synchronization of parameters from the global neural network 300 occurs using the condition applicable to the child neural networks 310 (e.g. this occurs when a certain number of actions have been performed by the child neural networks), and typically occurs by copying the parameters of the global neural network 300 after the stage of updating the parameters of the global neural network.

[0136] After processing a certain number of minibatches of data, the global 300 neural network resets the gradients and the process is repeated until the required quality of decision making by the agent is achieved.

[0137] Thus, by iteratively updating the current parameter values, the reinforcement learning system 110 trains the global neural network 300 and the child neural networks 310 to generate neural network outputs that represent time-depreciating cumulative future rewards that may be received in response to agent 200 performing a given action in response to a given input dataset.

[0138] Considering this in practical terms, the first child neural network 310-1 may interact with a copy of the environment 101-1 that relates to a crowded environment. After a number of training stages, the parameters of the first child neural network will reach values that provide good operation of the agent 200 for this type of environment. Equally, the second child neural network 310-2 may interact with a copy of the environment 101-1 that relates to a sparse environment and may end up with parameters that provide good operation of the agent for this type of environment. The global neural network 300 receives gradients from each of these child neural networks and uses these gradients to form a neural network that provides good operation in each environment. These parameters of this global neural network are then periodically copied to the child neural networks. The parameters may provide sub-optimal operation for one or more of the types of environments, so the child neural networks are re-trained based on the respective copies of the environment (and then the parameters of the child neural networks are again provided to the global neural network). Over time, all of the neural networks converge to provide a neural network that provides optimal operation for each type of environment.

[0139] It is also possible to apply these methods using a plurality of agents 200, where each agent may have a different set of action spaces, each of which sequentially or simultaneously interacts with the copies of the environments 101. For example, a first agent 200-1 may be arranged to control air conditioners, while a second agent 200-2 may be arranged to control radiators (and each of these agents will have different possible actions and different action spaces). For each of these agents, there may be created a separate global actor neural network, global critic neural network, child actor neural networks, and child critic neural networks. This enables rapid training of neural networks based on differing action spaces.

[0140] An illustration of a disclosed interaction scheme for a system comprising a plurality of agents is shown in Figure 9. With this scheme each of the agents 200-1, 200-2, 200-K interacts with a corresponding reinforcement learning system 110-1, 110-2, 110-K. Each of these reinforcement learning systems comprises a plurality of child neural networks. Each of these reinforcement learning systems is arranged to interact with a shared set of copies of the environment 101-1, 101-2, 101 -N, where for each of the copies of the environment, one child neural network from each reinforcement learning system interacts with this copy.

[0141] It will be appreciated that the reinforcement learning systems could also be trained entirely independently with entirely separate copies of the environment.

[0142] With the scheme of Figure 9, backpropagation and parameter updates for these child neural networks and global neural networks may occur independently of each other, e.g. each reinforcement learning system 110-1, 110-2, 110-K may be trained separately, and they may have neither common parameters nor common gradients. [0143] In some embodiments, the copies of environment 101-1, 101-2, 101-N are forced to accept actions from all agents 200 at the same time (in one tuple). More specifically, all agents perform actions and add their actions to the combined tuple, which is then passed to the environment to obtain an updated state. In practice, this might mean that for each copy of the environment, actions relating to the first agent 200-1 are performed first (e.g. turning on or off air conditioning units). Then actions relating to the second agent 200-2 are performed (e.g. turning on or off radiators).

[0144] In cases where the interaction with the copy of the environment 101 does not occur simultaneously, firstly the order of interaction with the environment 101 is established, after which the agents 200, in the prescribed order, perform actions in turn. By performing actions in this way, each agent in turn influences the environment 101, so that each agent receives, in the training tuple, the state of the environment 101. This state will have been influenced by agents earlier in the order.

[0145] Agents 200 having an identical set of available actions are typically combined into a united agent having one set of global actor neural networks, global critic neural networks, child actor neural networks, and child critic neural networks. This avoids redundant training being performed (which wastes available computing power). For example, there may exist a plurality of agents that each control similar HVAC devices; these agents may have the same action space and so may be combined and trained simultaneously. The action space may be determined based on the input device configurations and descriptions dataset 102.

[0146] Typically, at each step of training, the child actor neural networks 311 and the child critic neural networks 312 train a policy, calculate the effect of a change in the parameters of these neural networks and update the gradients in the global neural network 300 using not only the environment and information about the received or potential actions, but also the results of the baseline simulator 400 in addition to an entropy estimation function to improve the quality of learning. Key performance metrics or desirable conditions provided by the baseline simulator are also used by the child critic neural network to assess the quality of the chosen policy. For example, the child critic neural networks may determine whether the desirable conditions have been achieved and may then evaluate the performance metric to determine whether the neural network is performing well.

[0147] Typically, the baseline simulator provides a separate metric for each of the child networks and/or for each of the copies of the environment.

[0148] The output of the baseline simulator 400 typically contains a number of metrics alongside rewards/priorities for these metrics. The rewards accumulated by the baseline simulator 400 are then compared to the rewards accumulated by each agent 200.

[0149] The performance of each agent 200 may then be compared to the baseline simulator 400. For example, the metric provided by the baseline simulator may relate to the performance of the baseline simulator (e.g. how close the baseline simulator has come to achieving a certain temperature and/or humidity, and how much electricity was required to achieve these conditions). This baseline simulator metric can then be compared to each agent to determine whether any of the agents has outperformed the baseline simulator.

[0150] The process of determining an update for the current parameter values of the global neural network 300 or the child neural networks 310 can be performed by a system of one or more computers located at one or more locations.

[0151] Additionally, any neural network may include one or more batch normalization layers.

[0152] By repeating the above processes many times, using many different subsets of experience tuples, the reinforcement learning system 110 is arranged to train the global neural network 300 to have optimal parameters to control the devices in the environment in order to achieve a desired set of conditions (which desired set of conditions may be based on a user selection, or may be based on the output of a neural network). The actions selected by the global neural network given a particular set of inputs are provided to the agent 200, which agent is arranged to interact with the environment (e.g. with devices within the environment) in accordance with the provided actions.

[0153] In contrast to existing methods, additional sources of simulated reinforcement are used to update all parameters in the neural networks. So, in comparison with the asynchronous advantage actor-critic methods, not only randomly selected or learned values of the value function are used as the basic indicators on which training for evaluating policies is oriented, but also simulated key indicators or baselines, which affect the time needed for optimal policies to converge.

[0154] An example of a use of the disclosed methods is to maintain desired conditions in the interior of a building. With this use, the agent may be arranged to alter the operation of radiators, air conditioners, dehumidifiers etc. in order to maintain a desired temperature range, humidity range, etc.

[0155] Referring to Figure 10, once the reinforcement learning system 110 has been trained, the reinforcement learning system and/or one of the constituent neural networks of the reinforcement learning system may be used to control devices in an environment, such as the room 10. Figure 10 describes an embodiment in which the reinforcement learning system is used alongside the agent 200 to control an HVAC device or an HVAC system. An HVAC system is a system comprising a plurality of devices, which may comprise a central control system or controller (e.g. a computer device arranged to control the plurality of computer devices).

[0156] In a first step 41, a (trained) neural network receives as an input relating to a change in the environment. This change may relate, for example, to: a change in the weather; a change in the population inside the environment; and/or the opening and closing of doors. The input may be received from a sensor of a HVAC device. [0157] Typically, a plurality of HVAC devices are connected to form an HVAC system that can be controlled by a central controller. The controller may then send inputs to each of the HVAC devices, for example a desired change in conditions may be transmitted to the devices, or a change in the external conditions may be transmitted. Equally, the controller may comprise the neural network, where the central controller determines an appropriate action for each of the HVAC devices and transmits these actions to the devices. The controller may comprise a plurality of neural networks, which neural networks relate to various devices and/or groups of devices.

[0158] Equally, the neural network may be present on a computer device of a single HVAC device.

[0159] In a second step 42, the neural network provides (e.g. recommends) an action based on the input. The action is determined based on the parameters of the neural network, which parameters have been determined using the above-described learning method so that the neural network provides a suitable action given an input.

[0160] In a third step 43, the neural network implements the action via the agent 200. For example, the agent may alter the operation of a device.

[0161] In a fourth step 44, the HVAC device and/or the HVAC system provides an output relating to the actions. For example, the device may display an energy usage to a user and/or sound an alarm that indicates deviation from a set of target conditions. Typically, each HVAC device and/or the HVAC system is arranged to periodically output usage information relating to the usage of the devices in the environment.

[0162] In practice, using the reinforcement learning system 110 to control an HVAC device typically comprises using one of the constituent neural networks (e.g. the global actor neural network 301) to control the HVAC device.

[0163] This neural network (e.g. the global actor neural network 301) may be trained using the methods disclosed herein and then may be transferred to an HVAC device or to a controller for an HVAC system. This may comprise transferring the parameters of the trained global actor neural network to the HVAC device or the controller. This trained neural network can then control the HVAC device in dependence on received inputs (e.g. measured changes in the external conditions and/or signals received from an HVAC control system).

[0164] Therefore, typically, the reinforcement learning system 110 as a whole is only present during training, with the global neural network 300 (and/or the global actor neural network 301) being transferred from a training device to an HVAC device once the neural networks have converged and the training of the reinforcement learning system has been completed.

ALTERNATIVES AND MODIFICATIONS:

[0165] It will be understood that the present invention has been described above purely by way of example, and modifications of detail can be made within the scope of the invention.

[0166] For example, the training of the reinforcement learning system 110 may comprise and/or involve one or more of: unsupervised deep learning and/or non-deep learning; general adversarial networks (GANs); variational autoencoders (VAEs). Such methods may enable a once-trained model for a device to be re-used for similar devices with non-similar action spaces. While the reinforcement learning system 110 typically uses reinforcement learning, other machine learning techniques may be used instead of, or in addition to, reinforcement learning.

[0167] The detailed description has primarily considered the use of actorcritic systems. More generally, the reinforcement learning system 110 may use any type of learning system, where the reinforcement learning system typically still comprises an arrangement with a global learning system and a plurality of child learning systems.

[0168] The type of machine learning systems used may be determined based on the environment and/or the input datasets. [0169] In some embodiments, the reward generation function used for the training of an actor-critic system is dynamic and/or is selected from a plurality of possible reward generation functions. For example, the reward generation function may depend on a convergence rate and/or a difference in performance relating to the key metrics (e.g. if the actor-critic system is near optimal for a first of the key metrics but performs poorly for a second key metric, the reward generation function may be altered to prioritise the second key metric).

[0170] In some embodiments, training the reinforcement learning system 110 may comprise auto-correction of hyperparameters.

[0171] In some embodiments, a plurality of agents are trained concurrently, with these agents having differing reward generation functions. This may involve a plurality of agents being trained for a single device or for a group of devices. These agents may then be activated and operated in dependence on a desired set of conditions. For example, a first group of agents may be trained to achieve optimally efficient operation for a first set of target conditions and a second group of agents may be trained to achieve optimally efficient operation for a second set of target conditions.

[0172] The environment typically relates to a building; in some embodiments the environment relates to a plurality of related buildings (for example a plurality of neighboring buildings). The reinforcement learning system 110 and agent 200 may then be trained to obtain optimal joint operation (e.g. one building may be affected by the air conditioning being turned on for another building).

[0173] In some embodiments, game theory is used to determine a rational allocation of resources.

[0174] Although this invention has been shown and described with reference to certain embodiments thereof, those skilled in the art will appreciate that various changes and modifications may be made therein without departing from the actual scope of the invention.

Previous Patent: AN ENERGY SYSTEM INCORPORATING A CONTROLLER AND A METHOD OF OPERATING AN ENERGY SYSTEM

Next Patent: APPARATUS AND METHOD USING HINT CAPABILITY FOR CONTROLLING MICRO-ARCHITECTURAL CONTROL FUNCTION