Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
TRAINED OPTIMIZATION AGENT FOR RENEWABLE ENERGY TIME SHIFTING
Document Type and Number:
WIPO Patent Application WO/2024/084125
Kind Code:
A1
Abstract:
According to an example embodiment, a device is configured to embody a bidding optimizer 200. The bidding optimizer 200 functions as an intermediate agent between a renewable energy production site -system 299 and an electricity exchange 208. The bidding optimizer may comprise one or more processors 101 and one or more memories 102 that comprise computer program code 103. The device 100 may also comprise at least one communication interface 104, as well as other elements, such as an input/output module. The functionality of the bidding optimizer 200 is achieved with a reinforcement learning agent (RLA) algorithm, by configuring the bidding optimizer 200 to reinforce the learning cycle of the RLA by using external data 218 from the electricity exchange 208, such as market rate, market demand, weather forecast and electricity exchange forecast, and internal data using local variables stored in the one or more memories 103. Finally, the RLA algorithm is configured to produce an optimized bid for the electricity exchange 208, depending on external data and operation during the operation interval.

Inventors:
SIERLA SEPPO (FI)
HÖLTTÄ RIKU (FI)
Application Number:
PCT/FI2022/050693
Publication Date:
April 25, 2024
Filing Date:
October 19, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
AALTO UNIV FOUNDATION SR (FI)
International Classes:
G06Q40/04; G06Q50/06
Attorney, Agent or Firm:
PAPULA OY (FI)
Download PDF:
Claims:
CLAIMS 1. A device (100) for an electricity power bid, comprising: at least one processor (101); and at least one memory (102) including programma- ble logic, where the at least one memory and the pro- grammable logic are configured, with the at least one processor, to cause the device (100) to operate a bid- ding optimizer (200) that is configured to: control an electricity energy storage (204) to compensate for electricity power generation forecasts used at an electricity power bidding time; control renewable electricity power generation (202) so that that a maximum power flow limit at a grid coupling (206) is not exceeded; apply the electricity energy storage to shift providing of the generated renewable electricity power to intervals with a higher demand; optimize a reward function to avoid an unnec- essary penalty and energy storage aging; and reinforce a learning cycle of a reinforcement learning agent (103) algorithm that is used to optimize the operation of the bidding optimizer as per the con- figuration. 2. The device (100) according to claim 1, wherein the device is configured to execute the revenue optimi- zation function as follows: where r is a reward for an operation interval T, the revenue() -function is a multiplication of an amount of electricity to be delivered eexchange with an accepted rate rateaccept by an exchange, the penalty() -function is the penalty associated with rules of the electrical power exchange and the age() -function is the aging of the energy storage (204), which is an approximation of equivalent full cycles of storage aging during the op- eration interval, multiplied by the penalty of one cycle for the energy storage (204). 3. The device (100) according to claim 2, wherein the bidding optimizer is further configured to maximize a value of the reward r function by maximizing the rev- enue() -function and minimizing the penalty -function and the age -function. 4. The device (100) according to any preceding claim, wherein the bidding optimizer is further config- ured to use a battery storage as the electricity energy storage to manage errors in the electricity power gen- eration forecasts used at the electricity power bidding time. 5. The device (100) according to any preceding claim, wherein the bidding optimizer is further config- ured to ensure that the maximum power flow limits at the grid coupling are never exceeded, despite fluctuations in the renewable electricity power generation. 6. The device (100) according to any preceding claim, wherein the bidding optimizer is further config- ured to shift the providing of the generated renewable electricity power based on only uncertain demand fore- casts available at the time of placing the bid. 7. The device (100) according to any preceding claim, wherein the bidding optimizer is further config- ured to avoid the energy storage discharging and charg- ing operations if a benefit of the operations does not exceed the energy storage aging. 8. The device (100) according to any preceding claim, wherein the bidding optimizer is further config- ured to control both the electricity energy storage and renewable energy generator to meet an accepted demand of electricity from the electricity exchange and to avoid a penalty of exceeded delivery of electricity. 9. The device (100) according to any preceding claim, wherein the device (100) comprises an interface (104) comprising: i) a data input that is configured to receive any data related to training the reinforcement learning agent such as weather forecast, elec- tricity exchange forecast and/or State of Charge (SoC) of the energy storage (204) associated with a system (299); ii) a data output that is configured to output an action for the electricity exchange, wherein the action includes the exchange data for an upcom- ing operation interval T of the system, where the operation interval T is divided into N in- teger timesteps; iii) output messages to control a renewable energy controller (201) and an energy storage manager (203); and wherein the memory (102) is further con- figured for storing required variables used by the re- inforcement learning agent (103). 10. The device (100) according to any preceding claim, that is further configured for a renewable energy site comprising at least one generator (202) and at least one energy storage (204), where the at least one generator (202) comprises at least one associated re- newable energy controller (201) and the at least one energy storage (204) comprises at least one energy stor- age manager (203), where the renewable energy controller (201) and the energy storage manager (203) are config- ured to receive control messages (210) and (230) from the bidding optimizer (200). 11. The device (100) according to any preceding claim, wherein the device (100) is capable of operating in either a virtual or physical environment. 12. The device (100) according to any preceding claim, wherein the device (100) can be either a separate physical device where the physical device can be: a Field-programmable Gate Array (FPGA), Application- specific Integrated Circuits (ASIC), Application-spe- cific Standard Products (ASSP), System-on-a-chip sys- tems (SOC), Complex Programmable Logic Device (CPLD), or a Graphics Processing Unit (GPU), or wherein the device (100) can be a virtual device, programmed with any general computer architecture, to host any function of any preceding claim. 13. A method for an electricity power bid, comprising: controlling an electricity energy storage (204) to compensate for electricity power generation forecasts used at an electricity power bidding time; controlling renewable electricity power gener- ation (202) so that that a maximum power flow limit at a grid coupling (206) is not exceeded; applying the electricity energy storage to shift providing of the generated renewable electricity power to intervals with a higher demand; optimizing a reward function to avoid an un- necessary penalty and energy storage aging; and reinforcing a learning cycle of a reinforcement learning agent (103) algorithm that is used to optimize the operation of the bidding optimizer as per the con- figuration. 14. A computer program product comprising program code configured to perform the method according to claim 13 when the computer program product is exe- cuted on a computer.
Description:
Trained optimization agent for renewable energy time shifting TECHNICAL FIELD The present disclosure generally relates to the field of renewable electrical energy production. In par- ticular, the present disclosure relates to a device, a method, a system and a computer program to operate productivity for producing controlled renewable elec- trical energy. BACKGROUND The background art considers a site with re- newable generation and an energy storage such as a bat- tery or a pumped waterpower plant or the like having a capability to store large amounts of energy and convert it to electricity, producing electricity to an electri- cal power market with demands that may vary from one market interval to the next. A battery may be used to shift the electricity supply to market intervals with higher demands. A maximum power limit constraint at the site’s grid coupling, which must be considered as a constraint for managing power flows, places a consider- able risk for the producer. Since renewable energy sales are made using possibly erroneous generation forecasts, the battery can also be used to cope with differences between actual and forecasted generation. If the battery becomes full or empty during this process, it may be impossible to de- liver the committed energy to the grid without violating the power limit constraint, so the system may incur a penalty from the market. Further, the battery may be used to shift sales of electricity from low demand mar- ket intervals to high demand market intervals, which is done by bidding a smaller than forecasted sales volume at low demand intervals and a larger than forecasted sales volume at high demand intervals. SUMMARY The scope of protection sought for various ex- ample embodiments of the disclosure is set out by the independent claims. The example embodiments and fea- tures, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various example embodiments of the disclosure. An example embodiment of a device for an elec- tricity power bid comprises: at least one processor; and at least one memory including programmable logic, where the at least one memory and the programmable logic are configured, with the at least one processor, to cause the device to operate a bidding optimizer. The bidding optimizer is configured to control an electricity energy storage to compensate for electricity power generation forecasts used at an electricity power bidding time. Furthermore, it is configured to control renewable elec- tricity power generation so that that a maximum power flow limit at a grid coupling is not exceeded. Further- more, it is configured to apply the electricity energy storage to shift providing of the generated renewable electricity power to intervals with a higher demand. Furthermore, it is configured to optimize a reward func- tion to avoid an unnecessary penalty and energy storage aging. Even furthermore, it is configured to reinforce a learning cycle of a reinforcement learning agent al- gorithm that is used to optimize the operation of the bidding optimizer as per the configurations. For exam- ple, the device may optimize the bids to the electrical power market, using market demand and renewable gener- ation forecasts, so that the utilization scale from the power market and wear of the energy storage is opti- mized. In an example embodiment, alternatively or in addition to the above-described example embodiments, the device is configured to execute a revenue optimization function as follows: In the function r is a reward for an operation interval T. The revenue() -function is a multiplication of an amount of electricity to be delivered e exchange with an accepted rate rate accept by an exchange. The penalty() - function is the penalty associated with rules of the electrical power exchange. The age() -function is the aging of the energy storage, which is an approximation of equivalent full cycles of storage aging during the operation interval, multiplied by the penalty of one cycle for the energy storage. For example, the device may optimize the bids to an electricity exchange, using market demand and renewable generation forecasts, so that the utilization of the electricity exchange and wear of the energy storage is optimized. In an example embodiment, alternatively or in addition to the above-described example embodiments, the device is configured to use a battery storage as the electricity energy storage to manage errors in the elec- tricity power generation forecasts used at the elec- tricity power bidding time. For example, the battery may be a Lithium-Ion (Li) -type battery that provides a longer lifespan, higher efficiency and higher depth of discharge than conventional Lead-Acid -type batteries. In an example embodiment, alternatively or in addition to the above-described example embodiments, the device is configured to ensure that the maximum power flow limits at the grid coupling are never exceeded, despite fluctuations in the renewable electricity power generation. For example, violations of the grid code or failure to deliver the accepted amount of energy may cause a huge penalty on the operator of the renewable energy production site, and thus should be avoided. In an example embodiment, alternatively or in addition to the above-described example embodiments, the device is configured to shift the providing of the gen- erated renewable electricity power based on only uncer- tain demand forecasts available at the time of placing the bid. For example, the device may be able to configure the power distribution based solely on simulation data. In an example embodiment, alternatively or in addition to the above-described example embodiments, the bidding optimizer is further configured to control both the electricity energy storage and renewable energy gen- erator to meet an accepted demand of electricity from the electricity exchange and to avoid a penalty of ex- ceeded delivery of electricity. For example, the device may be able to utilize various electricity energy sources concurrently and make efforts to avoid the pen- alty for undesired delivery of electricity to the grid. In an example embodiment, alternatively or in addition to the above-described example embodiments, the device comprises an interface, wherein the interface may comprise a data input for receiving data related to a weather forecast, electricity exchange forecast, elec- tricity exchange rate, electricity exchange demand and state of charge (SoC), and an output configured to pro- vide an action for the electricity exchange, wherein the action comprises the exchange data for an upcoming op- eration interval T of the system. Additionally, the in- terface may comprise outputs to control a renewable en- ergy controller and an energy storage manager and store required variables used by the reinforcement learning agent in the memory of said device. For example, the reinforcement learning agent may use the data received by the input to continue the learning cycle while in operation or control the shifting of the energy with the control messages to avoid either the penalty of the grid violation or optimize the bid for the exchange. In an example embodiment, alternatively or in addition to the above-described example embodiments, the device is configured for a renewable energy site com- prising at least one generator and at least one energy storage, where the at least one generator comprises at least one associated renewable energy controller and the at least one energy storage comprises at least one en- ergy storage manager, where the renewable energy con- troller and the energy storage manager are configured to receive control messages from the device. For exam- ple, the device may use the control messages to direct excess power generated by the generator to charge the energy storage or to direct more power from the storage to the grid should the generator not provide the re- quired energy for the grid due to fluctuations in the generation. In an example embodiment, alternatively or in addition to the above-described example embodiments, the device is capable of operating in either a virtual or physical environment. For example, the device and its reinforcement learning agent algorithm may be trained in a completely virtual setting to provide the algorithm means to predictively operate in an actual physical re- newable energy site. In an example embodiment, alternatively or in addition to the above-described example embodiments, the device may either be a separate physical device, where the physical device can be: A Field-programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Product (ASSP), System-on-a-chip system (SoC), Complex Programmable Logic Device (CPLD) or a Graphics Processing Unit (GPU) or wherein the device may be a virtual device, pro- grammed within any general computer architecture. For example, a renewable energy site may use the device and its input-output data with a cloud service to operate the pre-existing generator and energy storage of the site. An example embodiment of a method for an elec- tricity power bid comprises means to control at least one electricity energy storage to compensate for elec- tricity power generation forecasts used at an electric- ity power bidding time to predict the behaviour of the energy storage and optimize its aging penalty. Further- more, the method comprises means to control at least one renewable energy power generator so that maximum power flow limits at energy grid coupling are not exceeded, avoiding penalties of grid code violations and failed delivery forced upon the operator of the method. Addi- tionally, the method comprises optimization of a reward function and reinforcement of the learning cycle of a reinforcement learning agent algorithm that is used to optimize the operation of the method as per the de- scribed configuration of this method and previous em- bodiments. An example embodiment of a computer program product comprises program code configured to perform the method according to any of the above example embodi- ments, when the computer program product is executed on a computer. DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are included to provide a further understanding of the example em- bodiments and constitute a part of this specification, illustrate example embodiments and together with the description help to explain the principles of the exam- ple embodiments. In the drawings: Fig. 1 is a block diagram of an embodiment disclosed herein, illustrating a device (100) configured to function as an intermediate agent between a renewable energy production site and an electricity exchange. The device may be configured as a bidding optimizer. Fig. 2 is an illustration of an example embod- iment of the subject matter described herein illustrat- ing a bidding optimizer (200) configured to operate as an intermediate between a renewable energy production site -system (299) and an electricity exchange (208). Fig. 3 is a block diagram of an example embod- iment of the subject matter described herein illustrat- ing a bidding optimizer (200) configured to operate as an intermediate between a renewable energy production site -system (299) and an electricity exchange (208), to send and receive certain data, such as forecasts and exchange data, to optimize the behaviour of said renew- able energy production site during an operation cycle. Fig. 4 is a flowchart for the logic and train- ing method of an example embodiment. The flowchart com- prises different conditions that affect the behaviour of the generator through the controller and the energy storage through the energy storage manager. DETAILED DESCRIPTION Reference will now be made in detail to example embodiments, examples of which are illustrated in the accompanying drawings. The detailed description pro- vided below in connection with the appended drawings is intended as a description of the present examples and is not intended to represent the only forms in which the present disclosure may be constructed or utilized. The description sets forth the functions of the example and the sequence of steps for constructing and operating the example. However, the same or equivalent functions and sequences may be accomplished by different example em- bodiments. Fig. 1 is a block diagram of a device 100 con- figured as an embodiment disclosed herein, where the embodiment may be related to a bidding optimizer 200. The bidding optimizer 200 functions as an intermediate agent between a renewable energy production site -system 299 and an electricity exchange 208 (shown for example in Fig. 2 and 3). The bidding optimizer may comprise one or more processors 101 and one or more memories 102 that comprise computer program code 103. The device 100 may also comprise at least one communication interface 104, as well as other elements, such as an input/output mod- ule (not shown in Fig. 1). The functionality of the bidding optimizer 200 is achieved with a reinforcement learning agent (RLA) algorithm, by configuring the bid- ding optimizer 200 to reinforce the learning cycle of the RLA by using external data 218 from the electricity exchange 208, such as market rate and demand (not shown in Fig. 1; shown in Fig. 2 and 3), and internal data using local variables stored in the one or more memories 103. Finally, the RLA algorithm is configured to produce an optimized bid for the electricity exchange 208, de- pending on external data and operation during the oper- ation interval. Some terminology used herein may follow the naming scheme of reinforcement learning in its current form. However, this terminology should not be considered limiting, and the terminology may change over time. Thus, the following discussion regarding any example embodiment may also apply to other technologies. Rein- forcement learning, RL, may relate to a subfield of machine learning that combines the RL and deep learning etc. The RL considers the problem of a computational agent learning to make decisions by trial and error. RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. RL algo- rithms are able to take in very large inputs, for example every pixel rendered to the screen in a video game, and decide what actions to perform to optimize an objective, for example maximizing the game score. RL has been used for a diverse set of applications including but not limited to simulations, design, etc. Furthermore, the processor 101 may be capable of executing the stored instructions. In an example em- bodiment, the processor 101 may be embodied as a multi- core processor, a single core processor, or a combina- tion of one or more multi-core processors and one or more single core processors. For example, the processor 101 may be embodied as one or more of various processing devices, such as a coprocessor, a microprocessor, a con- troller, a digital signal processor (DSP), a processing circuitry with or without an accompanying DSP, or var- ious other processing devices including integrated cir- cuits such as, for example, an application specific in- tegrated circuit (ASIC), a field programmable gate array (FPGA), a microcontroller unit (MCU), a hardware accel- erator, a special-purpose computer chip, or the like. In an example embodiment, the processor 101 may be con- figured to execute a hard-coded functionality. In an example embodiment, the processor 101 is embodied as an executor of software instructions, wherein the instruc- tions may specifically configure the processor 101 to perform the algorithms and/or operations described herein when the instructions are executed. The memory 102 may be embodied as one or more volatile memory devices, one or more non-volatile memory devices, and/or a combination of one or more volatile memory devices and non-volatile memory devices. For ex- ample, the memory 102 may be embodied as semiconductor memories (such as mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash ROM, RAM (random access memory), etc.). The device 100 may be embodied in, for example, a computer. Alternatively, or in addition, the computer may be a cloud computer system having various distrib- uted units. Although the device 100 may be depicted to comprise only one processor 101, the network node device 100 may comprise more processors. In an example embod- iment, the memory 102 is capable of storing instruc- tions, such as an operating system and/or various ap- plications. When the device 100 is configured to implement some functionality, some component and/or components of the device 100, such as the at least one processor 101 and/or the memory 102, may be configured to implement this functionality. Furthermore, when the at least one processor 101 is configured to implement some function- ality, this functionality may be implemented using pro- gram code 103 comprised, for example, in the memory 102. For example, if the device 100 is configured to perform an operation, the at least one memory 102 and the com- puter program code 103 can be configured, with the at least one processor 101, to cause the bidding optimizer 200 to perform that operation. Fig. 2 is an illustration of an example embod- iment of the subject matter described herein illustrat- ing a bidding optimizer 200 configured to operate as an intermediate between a renewable energy production plant system 299 and an electricity exchange 208. In the ex- ample configuration a renewable energy generator 202, for example a photovoltaic array or a wind turbine clus- ter, is configured with an energy storage 204 to provide energy for a power grid 207, where the power flow of the generator to either the grid tie inverter 205 (214) or energy storage 204 (213) is controlled by a respective renewable energy controller 201 and the power of the energy storage 204 is controlled by a respective energy storage manager 203. The bidding optimizer gathers data along its operation interval T to optimize a bid 217 for the electricity exchange 208. After the bid 217, the received input data 218 from the electricity exchange, which contains information such as demand and rate, is used to optimize the next and subsequent operation in- tervals. Fig. 3. shows a block diagram of an example embodiment configured as an intermediate between a re- newable energy production plant system 299 and an elec- tricity exchange 208. The goal of the bidding optimizer 200 is to operate an electricity energy storage 204 with control messages 212 through an energy storage manager 203 and to compensate for electricity power generation forecasts used at an electricity power bidding time, to control renewable electricity power generation 202 via control messages 211 through a renewable energy con- troller (REC) 201 so that that a maximum power flow 216 limits at a grid tie inverter 205 and grid coupler 206 are not exceeded. Additionally, the bidding optimizer 200 should control the energy storage 204 to shift providing of the generated renewable electricity power to intervals with a higher demand and to optimize the reward, which includes avoiding an unnecessary cost pen- alty and energy storage aging. The renewable energy plant 299 generally has power flows from the generator to the grid tie inverter 214, from the energy storage to the grid tie inverter 215, and from the generator to the energy storage 213. These power flows are typically not optimized; therefore one of the goals of the bidding optimizer 200 is to optimize these power flows. All of the previously mentioned functions and goals of the bid- ding optimizer 200 are achieved with a reinforcement learning agent (RLA) algorithm, by configuring the bid- ding optimizer 200 to reinforce the learning cycle of the RLA by using external data 218 from the electricity exchange 208, such as market rate and demand, and in- ternal data using local variables explained in more de- tail in relation to Fig. 3. Finally, the RLA algorithm is configured to produce an optimized bid for the elec- tricity exchange 208, depending on external data and internal data of operation during the operation interval T. The operation interval T, during which control actions by the REC 201 and the ESM 203 may be taken, is divided to N timesteps. A generalized Eq. 1 describes the power 216 delivered to the energy grid 207 through the grid coupler 206 at a constant rate P grid as follows: ^ ^ ^^^^^^^ ^^^^ = ^ ^ (1) ^ ^^^^ = ^ ^^^_^^^^ + ^ ^^^_^^^^ , (2) where e exchange is the amount of electricity to be deliv- ered for the exchange 208 during the operation inter- val T, P bat_grid is the power that is directed from the energy storage 204 to the grid 207 and P gen_grid is the power directed from the generator 202 to the grid 207. The momentary power generation by the generation unit 202 is either directed to the grid 207 (P gen_grid ) or the energy storage 204 (P gen_bat ) or both: ^ ^^^ = ^ ^^^_^^^^ + ^ ^^^_^^^ (3) and the energy delivered so far during the operation interval T at time t is: e ^^^^^^^^^ = ^ ^ ^^ P ^^^^ (4) The energy that remains to be delivered during the op- eration interval is: ^ ^^^^^^ = ^ ^^^^^^^^ − ^ ^^^^^^^^^ (5) The grid coupler 206 has a maximum power flow P max , which is specified in the contract with the utility. Exceeding this maximum power, even for a short time, can involve a large penalty and thus must be avoided. At this maximum power, the energy that could be deliv- ered to the energy grid during the remainder of the operation interval is: A reinforcement learning agent (RLA) is trained to form the bids within the bidding optimizer 200. An example embodiment of a method for this pur- pose is disclosed in Fig. 3. The input to the agent is a state data structure s. s includes the generation forecast and the exchange forecast for the upcoming operation interval(s) Tm. s also includes the State of Charge (SoC) of the energy storage 204 at the time of bidding. The output of the bidding optimizer agent 200 is an action a 217, which specifies the amount of available electricity to be bid and the rate ratebidfor the electricity exchange 208. After taking the action a, the bidding optimizer 200 is configured to receive 218 the amount of electricity to be deliv- ered e exchange and the accepted exchange rate rate accept 218 for the next operation interval T. Based on e exchange , the bidding optimizer 200 is configured to form the control signals 211 and 212 that specify the setpoints of the power flows in Fig. 2. The embodied logic for determining these setpoints is disclosed in Fig. 3. The inner loop in Fig. 3 runs for the duration of the operation interval T, after which it is time to compute a reward r, which specifies how good the ac- tion a was with respect to the multi-objective optimi- zation targets stated under the purpose of this embod- iment. The independent variables, expressed as ar- rays of N elements, form the required reward function r as such: ^ = ^^^^^^^() + ^ ^ ∗ ^^^^^^^() + ^ ^ ∗ ^^^(), (7) where c 1 and c 2 are adjustable negative weights to adjust the importance of each target in the multi-objective optimization and where functions revenue(), penalty() and age() are dependent on the variables as follows: ^^^^^^^() = ^ ^^^^^^^^ ∗ ^^^^ ^^^^^^ , (8) ^^^^^^^() = ^^^^^^^(^ ^^^^ , ^ ^^^^^^^^^ ), (9) The formula for the function revenue() is the multiplication of the amount of electricity to be de- livered with the accepted rate by the exchange. The penalty() -function depends on the exchange rules of the specific electrical power exchange 208 and the age() - function is the aging penalty of the energy storage 204, which is an approximation of equivalent full cycles of, for example, a battery aging during the operation in- terval, multiplied by the penalty of one cycle for the energy storage 204. For batteries, the details of this formula are dependent on the specific chemistry of the energy storage. For a pumped waterpower plant, the de- tails may be dependent on the mechanical wear of the components. Before the bidding optimizer 200 may be able to bid successfully it may require several iterations; therefore it can be advantageous to utilize a market simulator, developed with historical market data, to perform the training with a simulator than within the actual physical environment. A properly trained rein- forcement learning agent will be able to generalize to perform well even if the simulation is not a high-fi- delity model of the physical environment. The bidding optimizer 200 can continue to learn as it executes the training method during the physical operation interval. The independent variables a, r, s are stored in a general memory 103 and are used to optimize the training method described in Fig. 3. The input interface 300 communicates through the I/O bus that sends the action a and receives the state variable s. One full operation interval T lasts for t = N cycles, so the inner feedback loop 320 runs for the full cycle until the interface resets with the outer feedback loop 321. Based on the received e exchange , the training methods com- prise several Boolean logic checks (true or false), de- pending on the required and generated power, state of charge (SoC) and remaining electricity to be delivered e remain . For the Boolean checks in Fig. 3, T denotes a true clause and, respectively, F denotes a false clause. The first Boolean check 301 is to check if the complete system has delivered the required electricity, leading to either Boolean check 302, where SoC is polled, or Boolean check 303, where the available power P gen from the generator 202 and the energy storage 204 is polled. If during the operation interval T the demanded elec- tricity has already been delivered, the check 302 checks the SoC and either curtails the generation to zero (no charging of the energy storage 204) or directs the gen- eration of the generator 202 to the energy storage 204 (curtail power to grid P gen to zero). Respectively, a true clause from 301 leads to a logic check of 303 that polls the power generation state of the generator 202. Should the generator 202 have a momentary power gener- ation higher than the maximum grid coupler 206 require- ments, the SoC is polled with a Boolean check 307 as to whether excess power should either be directed towards the battery 310 (false of 307) or tell the renewable energy controller 201 to curtail the generator power to the required maximum limit 311 (true of 307). In case the momentary generation of the gener- ator 202 fails to reach the maximum power requirements, a false statement from 303 polls the state of the re- maining electricity e remain with the Boolean check 306. Should e remain be lower than the maximum electricity that can be delivered eremain_max(false of 306), the power to- wards the grid P gen_grid is directed from the generator 202 to save the energy storage. In case of a true clause from 306, the SoC is again polled with the check 309, and should the energy storage 204 be empty at this cur- rent time t, the system is unable to meet the demanded power P demand (313). However, should the energy storage have energy left, the grid generation P gen_grid is expanded to the maximum power with the generator and the excess available power is directed to charge the energy storage 204 (312). As all of the previously mentioned logic flow has been completed, the timestep is incremented to check if the full operation interval has been completed (Bool- ean check 315). Should the cycle still have remaining steps, the inner feedback loop 320 is executed to im- prove the training for the cycle (true from 315); oth- erwise (false from 315) the variables s, a and r are calculated and stored in the buffer. The final Boolean check 317 verifies if the agent should be trained with the new variables in case of the condition e exchange = e delivered is fulfilled, said condition signifying the agent was successful in its task. The outer feedback loop resets the operation cycle to t=0. A device 100 may comprise means for performing any aspect of the method(s) described herein. According to an example embodiment, the means comprises at least one processor 101, and memory 102 comprising program code 103, the at least one processor 101 and program code 103 configured, when executed by the at least one processor 101, to cause performance of any aspect of the method. The functionality described herein can be per- formed, at least in part, by one or more computer program product 103 components such as software components. Ac- cording to an example embodiment, the device 100 com- prises a processor 101 configured by the program code when executed to execute the example embodiments of the operations and functionality described. Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Ap- plication-specific Standard Products (ASSPs), System- on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), and Graphics Processing Units (GPUs). Any range or device value given herein may be extended or altered without losing the effect sought. Also any example embodiment may be combined with another example embodiment unless explicitly disallowed. Although the subject matter has been described in language specific to structural features and/or acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as examples of implementing the claims and other equiv- alent features and acts are intended to be within the scope of the claims. It will be understood that the benefits and advantages described above may relate to one example embodiment or may relate to several example embodiments. The example embodiments are not limited to those that solve any or all of the stated problems or those that have any or all of the stated benefits and advantages. It will further be understood that reference to ‘an’ item may refer to one or more of those items. The operations of the methods described herein may be carried out in any suitable order, or simultane- ously where appropriate. Additionally, individual blocks may be deleted from any of the methods without departing from the spirit and scope of the subject mat- ter described herein. Aspects of any of the example embodiments described above may be combined with aspects of any of the other example embodiments described to form further example embodiments without losing the ef- fect sought. The term ‘comprising’ is used herein to mean including the method, blocks or elements identified, but that such blocks or elements do not comprise an exclu- sive list and a method or apparatus may contain addi- tional blocks or elements. It will be understood that the above descrip- tion is given by way of example only and that various modifications may be made by those skilled in the art. The above specification, examples and data provide a complete description of the structure and use of exem- plary embodiments. Although various example embodiments have been described above with a certain degree of par- ticularity, or with reference to one or more individual example embodiments, those skilled in the art could make numerous alterations to the disclosed example embodi- ments without departing from the spirit or scope of this specification.