Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MODEL-FREE REINFORCEMENT LEARNING SYSTEM AND METHOD
Document Type and Number:
WIPO Patent Application WO/2023/233222
Kind Code:
A1
Abstract:
A reinforcement learning, RL, control block (160) is configured to control a controller (140) of a non-linear system (120), and the RL control block (160) includes an interface (162) configured to receive an amplitude ̂(i) of a state variable (ii), wherein the state variable (ii) characterizes the non-linear system (120), a processor (164) connected to the interface (162). The processor is configured to, determine (714), in an RL policy block (110), an RL mapping (iii) based on a first neural network (112), using the amplitude (i) as input, wherein the RL mapping (iii) results in an estimated control parameter (iv), calculate (716), in a safety block (130), based on a second neural network (132), that a control parameter θ t of the controller (140) is within a given range of the estimated control parameter ̂(iv), and apply (718) the control parameter θ t to the controller (140) for maintaining a parameter of the non-linear system (120) within a desired range.

Inventors:
ALHAZMI KHALID MOHAMMED (SA)
SARATHY SUBRAM MANIAM (SA)
Application Number:
PCT/IB2023/054665
Publication Date:
December 07, 2023
Filing Date:
May 04, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV KING ABDULLAH SCI & TECH (SA)
International Classes:
G05B13/02
Other References:
HAONAN YU ET AL: "Towards Safe Reinforcement Learning with a Safety Editor Policy", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 20 May 2022 (2022-05-20), XP091217229
ALIGHANBARI SINA ET AL: "Safe Adaptive Deep Reinforcement Learning for Autonomous Driving in Urban Environments. Additional Filter? How and Where?", IEEE ACCESS, IEEE, USA, vol. 9, 14 October 2021 (2021-10-14), pages 141347 - 141359, XP011885491, DOI: 10.1109/ACCESS.2021.3119915
D. C. MCFARLANEK. GLOVER: "Robust controller design using normalized coprime factor plant descriptions", 1990, SPRINGER
J. LIA. S. MORGANS: "Feedback control of combustion instabilities from within limit cycle oscillations using H-inf loop-shaping and the v-gap metric", PROC. R. SOC. A 472, 2016
S. EVESQUEA. P. DOWLINGA. M. ANNASWAMY: "Self-tuning regulators for combustion oscillations", PROC. R. SOC. LONDON, SER. A, vol. 459, 2003, pages 1709 - 1749
J. MOECK ET AL.: "Two-parameter extremum seeking for control of thermoacoustic instabilities and characterization of linear growth", 45TH AIAA AEROSPACE SCIENCES MEETING AND EXHIBIT, 2007, pages 1416
M. KRSTICH.-H. WANG: "Stability of extremum seeking feedback for general nonlinear dynamic systems", AUTOMATICA, vol. 36, no. 4, 2000, pages 595 - 601, XP027213870
R. S. SUTTONA. G. BARTOR. J. WILLIAMS: "Reinforcement learning is direct adaptive optimal control", IEEE CONTROL SYSTEMS MAGAZINE, vol. 12, no. 2, 1992, pages 19 - 22
T. P. LILLICRAPJ. J. HUNTA. PRITZEL ET AL.: "Continuous control with deep reinforcement learning", ARXIV:1509.02971, 2015
V. MNIHK. KAVUKCUOGLUD. SILVER ET AL.: "Human-level control through deep reinforcement learning", NATURE, vol. 518, no. 7540, 2015, pages 529 - 533
B. RECHT: "A tour of reinforcement learning: The view from continuous control", ANNUAL REVIEW OF CONTROL, ROBOTICS, AND AUTONOMOUS SYSTEMS, vol. 2, 2019, pages 253 - 279
M. G. BELLEMARES. CANDIDOP. S. CASTRO ET AL.: "Autonomous navigation of stratospheric balloons using reinforcement learning", NATURE, vol. 588, no. 7836, 2020, pages 77 - 82
E. BOHNS. GROSS. MOET. A. JOHANSEN: "Reinforcement learning of the prediction horizon in model predictive control", IFAC-PAPERSONLINE, vol. 54, no. 6, 2021, pages 314 - 320
M. SEDIGHIZADEHA. REZAZADEH: "Adaptive pid controller based on reinforcement learning for wind turbine control", PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, vol. 27, 2008, pages 257 - 262
G. DALALK. DVIJOTHAMM. VECERIKT. HESTERC. PADURARUY. TASSA: "Safe exploration in continuous action spaces", ARXIV:1801.08757, 2018
T. HAARNOJAA. ZHOUK. HARTIKAINEN ET AL.: "Soft actor-critic algorithms and applications", ARXIV:1812.05905, 2018
J. LIA. S. MORGANS: "Time domain simulations of nonlinear thermoacoustic behaviour in a simple combustor using a wave-based approach", J. SOUND VIB., vol. 346, 2015, pages 345 - 360, XP029149749, DOI: 10.1016/j.jsv.2015.01.032
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1 . A reinforcement learning, RL, control block (160) configured to control a controller (140) of a non-linear system (120), the RL control block (160) comprising: an interface (162) configured to receive an amplitude of a state variable wherein the state variable pref characterizes the non-linear system (120); a processor (164) connected to the interface (162) and configured to, determine (714), in an RL policy block (110), an RL mapping πθ based on a first neural network (112), using the amplitude as input, wherein the RL mapping πθ results in an estimated control parameter calculate (716), in a safety block (130), based on a second neural network (132), that a control parameter θt of the controller (140) is within a given range of the estimated control parameter and apply (718) the control parameter θt to the controller (140) for maintaining a parameter of the non-linear system (120) within a desired range.

2. The RL control block of Claim 1 , wherein the first neural network is different from the second neural network.

3. The RL control block of Claim 1 , wherein the first neural network uses a reward function r, which depends on the state variable pref and the estimated control parameter to estimate the estimated control parameter

4. The RL control block of Claim 1 , wherein the first neural network maps the state variable pref of the non-linear system to the estimated control parameter

5. The RL control block of Claim 1 , wherein the processor is further configured to: receive (704) a data set D that characterizes an operation of the non-linear system; and train (708) the second neural network with the data set D.

6. The RL control block of Claim 1 , wherein the state variable is a pressure oscillation in a combustor of the non-linear system, which is an engine.

7. The RL control block of Claim 6, wherein the calculated parameter includes a gain and a time delay of the controller, and the controller is a phase shift controller of the engine.

8. The RL control block of Claim 7, wherein the parameter of the non-linear system is a thermoacoustic stability of a flame.

9. The RL control block of Claim 1 , wherein the safety block is configured to apply a constraint to the state variable pref through the second neural network.

10. A method for reinforcement learning, RL, control of a controller (140) of a non-linear system (120), the method comprising: receiving (712) an amplitude of a state variable pref, wherein the state variable pref characterizes the non-linear system (120); determining (714), in an RL policy block (110), an RL mapping πθ based on a first neural network (112), using the amplitude as input, wherein the RL mapping πθ results in an estimated control parameter calculating (716), in a safety block (130), based on a second neural network (132), that a control parameter θt of the controller (140) is within a given range of the estimated control parameter and applying (718) the control parameter θt to the controller (140) for maintaining a parameter of the non-linear system (120) within a desired range.

11 . The method of Claim 10, wherein the first neural network is different from the second neural network.

12. The method of Claim 10, wherein the first neural network uses a reward function r, which depends on the state variable pref and the estimated control parameter

13. The method of Claim 10, further comprising: mapping, with the first neural network, the state variable pref of the non-linear system to the estimated control parameter

14. The method of Claim 10, further comprising: receiving (704) a data set D that characterizes an operation of the non-linear system; and training (708) the second neural network with the data set D.

15. The method of Claim 10, wherein the state variable is a pressure oscillation in a combustor of the non-linear system, which is an engine.

16. The method of Claim 15, wherein the calculated parameter includes a gain and a time delay of a controller, and the controller is a phase shift controller of the engine.

17. The method of Claim 16, wherein the parameter of the non-linear system is a thermoacoustic stability of a flame.

18. The method of Claim 10, wherein the safety block is configured to apply a constraint to the state variable pref through the second neural network.

19. A non-transitory computer readable medium including computer executable instructions, wherein the instructions, when executed by a processor, implement a method for reinforcement learning, RL, control of a controller (140) of a non-linear system (120), the medium comprising instructions for: receiving (712) an amplitude pref of a state variable pref, wherein the state variable pref characterizes the non-linear system (120); determining (714), in an RL policy block (110), an RL mapping πθ based on a first neural network (112), using the amplitude as input, wherein the RL mapping πθ results in an estimated control parameter calculating (716), in a safety block (130), based on a second neural network (132), that a control parameter θt of the controller (140) is within a given range from the estimated control parameter and applying (718) the control parameter θt to the controller (140) for maintaining a parameter of the non-linear system (120) within a desired range.

20. The medium of Claim 19, wherein the state variable is a pressure oscillation in a combustor of the non-linear system, which is an engine, the calculated parameter includes a gain and a time delay of a controller of the engine, the controller is a phase shift controller, and the parameter of the non-linear system is a thermoacoustic stability of a flame.

Description:
MODELFREE REINFORCEMENT LEARNING SYSTEM AND METHOD CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims priority to U.S. Provisional Patent Application No.63/347,587, filed on June 1, 2022, entitled “MODEL-FREE REINFORCEMENT LEARNING FOR CONTROLLING THERMOACOUSTIC COMBUSTION INSTABILITIES,” the disclosure of which is incorporated herein by reference in its entirety. BACKGROUND OF THE INVENTION TECHNICAL FIELD [0002] Embodiments of the subject matter disclosed herein generally relate to a learning-based adaptive mechanism for controlling parameters of a nonlinear system and associated method for controlling, and more particularly, to a model-free reinforcement learning method that learns an effective parameter adaption law of the system while maintaining a safe operation of the system. DISCUSSION OF THE BACKGROUND [0003] Many industrial processes and energy systems are operated at sub- optimal conditions. The highly nonlinear nature of certain phenomena (e.g., chemical reactions coupled to fluid mechanics) makes process control and optimization challenging. Modeling the dynamics in these systems is often inaccurate and incomplete, resulting in process disturbances. In addition, it is common for the system dynamics and control objectives to change with time. Thus, there is a need for a class of controllers that can learn online to control changing systems using data measured in real-time.

[0004] One approach to address this need is to estimate the system’s parameters through system identification and then derive a control law, an approach known as indirect adaptive control. However, such an approach requires recomputing controls from identified systems at each step, which is inherently complex. Also, such an approach needs to know, ab initio, the structure of the models that describe such a system. Those models, even if known, are a source of inaccuracies. An alternative approach is known as the direct adaptive control, in which the controller parameters are estimated directly, and this approach is more attractive in some applications.

[0005] Of particular interest to the inventors are nonlinear problems with two characteristics: (1 ) time-varying reference trajectory or set point, and (2) lack of availability of sufficiently accurate reduced-order models that are required for the design of well-known direct adaptive controllers, such as H-infinity and self-tuning regulators. An instance of a problem with these features is the case of finding the optimum fuel-air ratio in gas turbine combustion systems susceptible to thermoacoustic instabilities. For these systems, proper control is required for ensuring stable, efficient, and low emission performance. Such a task is challenging due to time-varying variations in fuel (or air) temperature and quality, for instance. While combustor models exist, no reduced-order model is sufficiently general as they apply only to a narrow window of operating conditions.

[0006] A successful adaptive control method for the above-mentioned problems is extremum seeking control (ESC) [1]. ESC aims to find and maintain the optimum operating point by perturbing the plant/system and using the response to estimate the gradient of an objective function. ESC can be used as a controller or to tune the parameters of a working controller. Despite its success, ESC has several limitations. First, it requires continuous perturbation of the system. Second, careful selection of several ESC tuning parameters is required to achieve satisfactory performance. Such parameters include, but are not limited to, the frequency and amplitude of high and low pass filters, the frequency and amplitude of the perturbation signal, an integrator gain, and a good initial guess. While ESC is theoretically model-free, a model is needed to select proper tuning parameters in practice.

[0007] An alternative method to ESC is reinforcement learning (RL), which can be seen as direct adaptive control, as discussed in [2], RL has been extensively used in continuous control tasks, but most applications involve using RL as a controller [3]-[6], which is application specific.

[0008] Thus, there is a need for a new approach that does not rely on an existing model of the system to be described, and can be applied to general working controllers, indifferent to their specific configuration. The new approach needs to also achieve a required safety of the controlled system as some of the systems cannot afford to perform under unsafe conditions. SUMMARY OF THE INVENTION

[0009] According to an embodiment, there is a reinforcement learning, RL, control block configured to control a controller of a non-linear system. The RL control block includes an interface configured to receive an amplitude p ref of a state variable p ref , wherein the state variable p ref characterizes the non-linear system, and a processor connected to the interface. The processor is configured to determine, in an RL policy block, an RL mapping π θ based on a first neural network, using the amplitude as input, wherein the RL mapping π θ results in an estimated control parameter calculate, in a safety block, based on a second neural network, that a control parameter θ t of the controller is within a given range of the estimated control parameter and apply the control parameter θ t to the controller for maintaining a parameter of the non-linear system within a desired range.

[0010] According to another embodiment, there is a method for reinforcement learning, RL, control of a controller of a non-linear system. The method includes receiving an amplitude of a state variable p ref , wherein the state variable p ref characterizes the non-linear system, determining, in an RL policy block, an RL mapping π θ based on a first neural network, using the amplitude as input, wherein the RL mapping π θ results in an estimated control parameter calculating, in a safety block, based on a second neural network, that a control parameter θ t of the controller is within a given range of the estimated control parameter and applying the control parameter θ t to the controller for maintaining a parameter of the non-linear system within a desired range.

[0011 ] According to yet another embodiment, there is a non-transitory computer readable medium including computer executable instructions, wherein the instructions, when executed by a processor, implement the method discussed above.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

[0013] FIG. 1 is a schematic diagram of a safe reinforcement learning scheme for adjusting controller parameters associated with a non-linear system/plant;

[0014] FIG. 2 is a high-level diagram of a process for determining the control parameters and their values associated with a system;

[0015] FIG. 3 schematically illustrates an engine having a combustor and a controller;

[0016] FIG. 4 illustrates the model parameters that describe the combustor of FIG. 3;

[0017] FIG. 5 is a schematic diagram of another safe reinforcement learning scheme for adjusting controller parameters associated with a non-linear turbine;

[0018] FIG. 6 illustrates the RL problem components for the engine of FIG. 3;

[0019] FIG. 7 is a flow chart of a method for determining the control parameters of the controller of the engine of FIG. 3 and also for determining their values;

[0020] FIG. 8 illustrates the hyperparameters that are used to train an RL policy block; [0021] FIG. 9 illustrates a learning curve for training a controller parameter adjusting policy with reinforcement learning;

[0022] FIG. 10, top panel illustrates the unrestricted exploration for an RL control block of the non-linear system while the bottom panel shows the restricted exploration for the RL control block;

[0023] FIG. 11 illustrates the instability attenuation performance of an RL- based adaptation mechanism, for the pressure oscillations when the adaptive scheme is turned on and when it is not, the controller parameters (K and τ) for both scenarios, and also the system parameter’s change;

[0024] FIGs. 12A and 12B show the stabilization from within a limit cycle for a time delay of 1 e -4 sec and 5e -3 sec., respectively;

[0025] FIG. 13 illustrates a comparison of the controller performance for different values of the tuning weight e;

[0026] FIG. 14 illustrates the performance of different controllers in attenuating pressure oscillations; and

[0027] FIG. 15 schematically shows a computing system in which the controller and/or methods discussed herein can be implemented.

DET AILED DESCRIPTION OF THE INVENTION

[0028] The following description of the embodiments refers to the accompanying drawings. The same reference numbers in different drawings identify the same or similar elements. The following detailed description does not limit the invention. Instead, the scope of the invention is defined by the appended claims. The following embodiments are discussed, for simplicity, with regard to an RL agent used as an adaptation mechanism for control of thermoacoustic combustion instabilities in a combustor. However, the embodiments to be discussed next are not limited to such combustor, but may be applied to other nonlinear systems.

[0029] Reference throughout the specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with an embodiment is included in at least one embodiment of the subject matter disclosed. Thus, the appearance of the phrases “in one embodiment” or “in an embodiment” in various places throughout the specification is not necessarily referring to the same embodiment. Further, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.

[0030] According to an embodiment, a novel approach uses RL as an adaptation mechanism for general working controllers. RL is not used herein as the controller, but to adapt the parameters of the controller. In one application, a safety layer that filters out unsafe actions by the RL policy is added to ensure that system constraints are not violated during learning. The RL is applied herein to a time- varying dynamic system controlled by an adjustable controller. Under a time-varying dynamic system, the controller parameters need to be dynamically adjusted to achieve the desired performance.

[0031] To demonstrate the utility and effectiveness of the novel approach, the inventors have applied the RL to the control of thermoacoustic combustion instabilities. An aspect of great interest in gas turbine combustors is dynamic flame stability. The susceptibility of flames to become unstable due to the coupling of the unsteady heat release and the acoustic waves inside the combustor has been one of the challenges in developing modern high-efficiency low-emission gas turbine combustors in recent decades. Combustion instabilities are generally considered one of the highest risk items in new engine development. Introducing fuel variability with novel zero-carbon fuels (e.g., ammonia, hydrogen, etc.) amplifies the uncertainty in the combustor-acoustic interactions; therefore, flame dynamic stability is a hot issue in developing carbon-free gas turbine technologies.

[0032] Reinforcement learning has been used before in tuning the parameters of specific working controllers, such as a proportional-integral-derivative (PID) and model predictive controllers (MPC) [7], [8]. As discussed later, a novel framework is introduced that applies to a broad class of adjustable controllers while also considering the safety implications of online system exploration by RL. Approaches that rely on recorded input and output data also exist. Further, other publications either do not consider the safety of the RL actions or include a penalty in the reward function, which means that applying RL for parameter tuning must be done on a case-by-case basis. All these approaches are undesirable. [0033] On the other hand, the novel approach next discussed can be applied regardless of the controller of the system, with only weak assumptions. This novel framework is applied herein to the problem of attenuating thermoacoustic flame instabilities. Experiments performed with the new approach show that this novel framework performs as well or better than model-free and model-based methods, such as extremum seeking controllers, self-tuning regulators, and H-infinity robust controllers.

[0034] The novel approach is now discussed in more detail. The problem statement for a controller is as follows. A general nonlinear model, which can be used to describe any nonlinear system, is given by equation (1) where is the state of the system, is an unknown disturbance, is the output, and are smooth functions, unknown, and these functions describe the system. Consider the following real situation, which is described by a smooth control law: parametrized by a parameter such that the closed loop system has equilibria corresponding to each θ t . For this scenario, the problem is to select θ t that optimizes the control objective. For this scenario, it is assumed that a set of controller parameters exists under which the closed-loop system is stable, and its performance is optimal or near-optimal for a performance measure. An objective of the novel approach discussed herein is to learn a controller’s parameters adaptation law (policy), π such that

This means that the policy π, when applied to the state x t of the system 120, generates the control parameters θ t of the controller 140. This also means that the method is capable to work for any system, as long as the state x t of the system is known. In other words, the policy n, which is represented in this embodiment by a neural network, is capable to map the state of a system to the control parameters of the system’s controller, although function /discussed above is unknown.

[0035] According to an embodiment, the novel controller parameter adaptation scheme/mechanism 100, see FIG. 1 , is made of two parts: (1 ) a reinforcement learning policy block 1 10, which maps the state of the plant/system 120 to the controller 140’s parameters, and (2) a safety layer block 130, which ensures that the controller parameters used by the controller 140, as updated by the RL policy block 1 10, do not lead to instabilities. The two blocks 1 10 and 130 are part of an RL control block 160, which may be implemented in a processor, as discussed later. RL control block 160 may have an interface 162, which is configured to receive the state of the plant/system 120 and a processor 164, which is hosting the blocks 1 10 and 130. A memory 166 may also be present in the RL control block 160 for storing instructions.

[0036] In one implementation of the scheme/mechanism 100, the reinforcement learning problem may be formulated as a constrained Markov decision process (CMDP). Those skilled in the art would understand that other decision processes may be used instead of the CMDP. For practical reasons, the CMDP is used herein. While a Markov decision process (MDP) introduces a single utility (reward or cost function) consisting of different objectives to be optimized, the CMDP considers a situation where one type of cost is to be optimized while keeping the other types of costs below some given bounds. In this embodiment, a tuple defines a CMDP as follows: where x, u and f are defined as noted above with regard to equation is a reward function that depends on x and u, the sign “x” is a generic operation, and is an immediate-constraint function, where [K] is defined as which represents each constraint, and γ ∈ (0,1) is a discount factor. Also defined herein is as the immediate-constraint values per state.

[0037] An objective of this embodiment is to find a parametrized policy that maps the plant 120’s output/state to the controller 140’s parameter θ by solving the following policy optimization problem: where E is the expected value, s.t. stands for “subject to,” all constrained states are upper bounded by a constant , and is a policy represented by a first neural network 112 and parametrized by Φ, which represents the weights and biases of the neural network. Note that equations (6a) and (6b) are performed in the RL policy block 110, where the first neural network might reside 112.

[0038] The optimization problem presented in equations (6a-6b) presents a significant challenge as the RL agent requires exploration to acquire a satisfactory policy. The use of a penalty in the reward function r for being in an unsafe state alone, is inadequate in guaranteeing safety as the RL agent needs to have sufficient experience in such states to recognize them. Hence, incorporating prior knowledge of the dynamics of the system prior to training it is a solution when implementing model-free reinforcement learning algorithms.

[0039] Following the work in [9], a constraint function c i (x t , θ t ) is first linearized as follows: where g(x t ; w i ) is a second neural network 132 parametrized by The second neural network 132 resides in the safety layer block 130. The second neural network 132 is different from the first neural network 112 as the weights used in these networks describe different parameters. Training of the second neural network 132 may be carried out by solving the following problem:

[0040] The data needed for training is described as a set of tuples and can be obtained from simulations or experiments. The function g(x t ; w i ) represents the sensitivity of changes in the controlled states to the controls using knowledge of the dynamics learned from data.

[0041] The action taken by the RL policy block 110 is considered to be π Φ ( x t ). The safety layer block 130 is configured to solve the following problem, at each state of the plant 120, where the constraint determined earlier has been substituted. This safety layer aims to output a controller parameter, that is as close as possible to the generated by the RL policy block 110, which is the original parameter determined by the RL algorithm. The previous optimization problem has a quadratic objective and a linear constraint, for which the global closed-form solution is readily obtainable and can be found in [9].

[0042] With these equations and conditions in place, a solution algorithm is selected for solving equations (6a) and (6b). In this embodiment, the RL algorithm that solves these two equations is selected to be the soft actor-critic (SAC), which is described in [10]. However, other RL algorithms may be used instead of the SAC. The SAC algorithm integrates three desired elements: (1) an actor-critic architecture with separate policy and value function networks, (2) an off-policy formulation that enables the reuse of previously collected data for efficiency, and (3) entropy maximization to encourage stability and exploration. The authors in [10] found SAC to be more stable and scalable than other common RL algorithms, such as deep deterministic policy gradient (DDPG) [3]. SAC modifies the objective of CMDP to include an entropy term, so the optimal policy is defined as follows: where β is a temperature parameter (not necessary a temperature or related to a temperature) that determines the relative importance of the entropy term versus the reward, and the entropy of the policy, is given by:

[0043] The entropy term is utilized as a regularizer to strike a balance between exploration and exploitation in the agent’s policy. A higher entropy implies a more exploratory policy, while a lower entropy indicates a more deterministic and exploitative policy. By incorporating entropy in the objective function, the algorithm encourages exploration in situations where the agent’s policy is uncertain about the optimal action to take. Furthermore, maximizing entropy helps to prevent the agent from prematurely converging to a suboptimal policy.

[0044] The selection of the reward function r(x t , θ t ) depends on the control objective and an example of such selection is discussed later when the control of thermoacoustic instabilities is discussed. A high-level Algorithm 1 illustrated in FIG. 2 schematically illustrates how the proposed scheme/mechanism 100 may be implemented in a computing device and in this algorithm, the parameter t final is a preset simulation time. Unlike the conventional adaptive controllers that require tuning for different dynamic systems, an RL algorithm, such as SAC, can learn an adaptation mechanism for different systems without modifying the hyperparameters.

[0045] The method behind the algorithm in FIG. 2 starts in step 1 by selecting a controller 140 for the system considered to be controlled. Note that in this method, the controller 140 is different from the RL algorithm. In step 2, the tuning parameters θ t of the selected controller that need to be controlled are selected. In step 3, the set of data D is collected, either from simulations or from actual tests of the system. The set of data D = {(x t , θ t , x t+1 )} is used in the safety filter block 130, for learning a constraint function c i (x t , θ t ). This set of data is used with the second neural network 132. Next, the method performs a loop starting at step 4 and ending at step 13. This loop includes a step 5 of initiating the training environment with randomly generated parameters. In steps 6 to 12, a loop is performed at different times, from an initial zero time to a final time, for observing the plant 120 output in step 7, estimating in step 8 a set of estimated control parameter π θ (x t ), which are mapped to the controller’s parameters (for example, based on equation (10)), determining in step 9 if the set of tunning parameters π θ (x t ) is safe, based on equation (9), and if the set is safe, outputting the control parameters θ t , applying in step 10 the control parameters θ t to the controller 140, and calculating in step 11 a reward (based on equations (6a) and (6b)), which is then fed to the RL agent block 110. Note that step 11 takes place in the RL policy block 110 and uses the first neural network 112 for calculating the reward function r. [0046] The above discussed method is now applied for controlling the thermoacoustic combustion instabilities in an engine or turbine. Combustion instabilities arise due to the coupling of unsteady heat release and acoustic waves. The mechanism through which combustion instabilities take place is summarized as follows. Acoustic waves that are produced during the combustion process propagate through the combustion chamber and reflect from the chamber’s boundaries with a time delay. These reflected pressure oscillations lead to perturbations in the flow field. If the acoustic disturbances gain more energy from combustion than they lose across boundaries, their amplitude increases until constrained by nonlinear effects. When the nonlinearity mainly affects the heat-release rate while acoustic waves remain linear, heat-release saturation or phase-change effects may cause the rate of energy gain and loss to become equal at a specific pressure amplitude. This amplitude corresponds to the occurrence of limit cycle oscillations.

[0047] Modeling the complex dynamics of a combustion system, including fluid dynamics, transport processes, chemical kinetics, flame kinematics, and heat transfer, can be challenging. In this embodiment, a combined numerical and analytical approach, developed by [11] is used to model the behavior of a laminar conical premixed flame. This nonlinear model can accurately predict a nonlinear flame response, making it well-suited for evaluating nonlinear control strategies. [0048] To develop the model, it is assumed a low Mach number flow, which allows to treat the acoustics as linear. For low Mach number flows, the fractional velocity perturbation is significantly larger than the fractional pressure variation, as observed and experimentally confirmed in previous studies. This indicates that the primary source of nonlinearity in the system is the velocity field, which affects the heat release rate. Therefore, the acoustic waves are treated as linear, as their nonlinearity is negligible compared to the nonlinearity caused by changes in velocity. The model makes several other assumptions. It assumes that thermoacoustic oscillations are low-frequency and only involve longitudinal waves, that the fluids in the combustion chamber are perfect gases, that there is no noise produced by entropy waves formed during unstable combustion, and that the flame remains attached to the burner outlet.

[0049] The combustion chamber, which is schematically illustrated in FIG. 3 as element 310 and is part of an engine or turbine 300, is a cylindrical tube in the model, with x representing the distance along the tube’s longitudinal axis X. The tube starts at x = x 0 = 0 and ends at x = X 2 = I. A laminar premixed ethylene-air flame 320 is located at x = x 1 = x f , and a pressure sensor 330 is positioned at x = x ref . FIG. 3 shows the air supply 340, the fuel supply 342, and the premixed 344. The controller 140 adjusts the amount of air and fuel of the premixed 344. The continuity equations of mass, momentum, and energy across the flame zone result in the governing equations (12a) and (12b) as follow: where denotes the heat release rate per surface area, u is the flow velocity, p is the pressure, ρ is the density, and γ is the ratio of specific heat capacities. The upstream and downstream flow conditions are indicated by subscripts 1 and 2, respectively. To solve the governing equation, a flame model that relates heat release rate, to the acoustic perturbations is needed. This relation is described by a nonlinear flame describing function which assumes that the dependence on amplitude and frequency can be decoupled as in the following relation: where s is the complex frequency variable is a linear flame transfer function, and is a nonlinear function that describes the saturation of heat release rate. A steady uniform flow is denoted by a small perturbation is denoted by a signal amplitude is denoted by indicates the Laplace transform. The most common linear flame model for unsteady combustion is the n - τ model, which is a model based on a time delay between acoustic perturbations and unsteady combustion. In this embodiment, the used flame model is a n - τ model with a first- order low-pass filter that empirically captures the flame response shape for weak velocity disturbances.

[0050] The linear flame transfer function is selected to be: where a f is a flame interaction index that characterizes the strength of interaction between the heat release rate and velocity amplitudes, while ω c is the cut-off frequency of the filter, and is the time delay. The form of the saturation function is selected to be where α and β are two coefficients that determine the shape of the nonlinear model (ξ is a dummy variable). Qualitatively, the relation between u is linear for weak velocity perturbations. On the other hand, for stronger velocity perturbations, decreases and the heat release rate begins to saturate. The time delay is described for a nonlinear model as follows: where is the time delay when is a time delay that describes the change of as changes.

[0051] The current embodiment employs an open-source tool for simulating the above low order thermoacoustic model. Table 1 in FIG. 4 lists the model’s parameters. These parameters are one set of parameters that may be used with this method. Those skilled in the art would understand that these parameters may be modified as fit, depending on the particular engine that is controlled. Also, the list of these parameters may be changed as the particular engine may be replaced with another engine. The solver time step is set to 1e -5 sec. Ethylene fuel is used in this embodiment, with a heating value of 47.2 MJ/kg, a mean equivalence ratio of 0.7, and a stoichiometric fuel-to-air ratio of 0.0678. One skilled in the art would understand that the method discussed herein works for any type of fuel. The frequency of thermoacoustic oscillations varies from 140 Hz to 180 Hz, depending on the flame time delay. The model’s predictive capability is assessed by comparing it to experimental results reported in the field. [0052] The engine or turbine 300 that is tested with the method illustrated in FIG. 2 has in this embodiment a controller that is configured to control a phase shift. Phase shift control refers to a control strategy that involves the manipulation of the phase relationship between the fuel injection and the pressure oscillation in the combustion chamber. The idea behind phase shift control is to introduce a controlled time delay in the fuel injection timing, such that the fuel injection occurs at a different phase angle relative to the pressure oscillation. By adjusting the phase relationship in this way, it is possible to influence the feedback mechanism that drives the combustion instability, and ultimately suppress or mitigate the instability. Equation (17) presents the explicit form of the phase-shift controller: where p is the pressure oscillations at p ref , K is the gain, i is a unit imaginary number, ω is the angular frequency, and τ is the time delay. Generally, the parameters of a phase-shift controller, such as the gain and time delay, can be determined through either analytical methods (i.e. , by mathematical analysis) or empirical approaches (i.e., by trial-and-error) for a specific operating condition. This embodiment assumes the presence of an actuator that relates the pressure perturbations to perturbations in the fuel injection The actuator may be a valve with the characteristics of a mass-spring-damper system described by the following second order transfer function: [0053] Experiments conducted by others on full-scale engine fuel nozzle at realistic operating conditions demonstrate that an adaptive gain is required for the phase shift controller to work at both low and high-power conditions. However, the control scheme discussed herein does not restrict the number of control parameters that can be adapted. Thus, this embodiment uses both gain and time delay as tunable controller parameters. More or less control parameters may be used.

[0054] The method illustrated in FIG. 2 uses reinforcement learning to learn a policy/mapping that selects the controller parameters that minimize combustion instability. To apply this method to the engine 300, the parameter vector 9 is denoted as θ = [ K, τ]. With these notations, the control parameters in FIG. 1 become K, τ, the control signal becomes u t , the output/state becomes p ref , and an amplitude detection block 150 is added for measuring the amplitude of the output/state p ref , which is Thus, the state x in FIG. 1 now becomes as illustrated in FIG. 5, and thus, all the equations used by the method of FIG. 1 need to be modified to replace x with The scheme/mechanism 100 illustrated in FIG. 5, when applied to the engine 300, has the components illustrated in Table 2 in FIG. 6, i.e. , the state of the system is the magnitude of pressure oscillations, which is measured in block 150, for example, with a pressure sensor, the control parameters are the gain K and the time delay τ, the transition function f is unknown and thus, data about this function needs to be obtained from simulations or experiment (data D discussed above), the reward function r is selected by the user (an example of such function is discussed later), and the constraint is selected to be the maximum allowed pressure in the combustor. Using equations (6a) and (6b), the method finds a parametrized policy that maps the system state to the controller parameter 9.

[0055] When the method schematically illustrated in FIG. 2 is applied to the engine 300 illustrated in FIG. 3, the following steps are taken place. In step 700 (see FIG. 7), the type of controller is selected and this controller 140 is used to regulate the time-varying combustion system. The controller is responsible for maintaining stable combustion conditions. For example, a phase shift controller 540 (see FIG. 5) is selected in this step. In step 702, control parameters of the controller are identified and the values of these parameters will be adjusted by the RL algorithm. For the phase shift controller 540, this includes the gain K and the time delay τ. Adjusting the values of these control parameters allows the controller to adapt to changes in the combustion system 300.

[0056] In step 704, the method gathers/receives a dataset D consisting of past engine states and controller parameters. This dataset is used to train a constraint function (second neural network 132), which ensures the safety of the engine when exploring new parameter settings. In step 706, the method iterates over a set number of episodes for the reinforcement learning process. Each episode is an independent run of the RL agent block 110 in the simulated environment, allowing it to learn and improve its policy. At the beginning of each episode, in step 708, the method initializes the training environment with a new set of random system parameters. In this embodiment, the time delay τ f is selected from [1e -4 , 5e -3 ] seconds. This ensures that the RL agent block 1 10 is exposed to diverse scenarios and learns to adapt to different combustion conditions.

[0057] In step 710, the method iterates over a fixed number of time steps within each episode. At each time step, the RL agent 1 10 observes the current state of the engine 300, decides on an action (i.e. , determines the control parameters), and receives feedback on its performance (through the reward function r). In step 712, the method monitors the current state of the combustion system at time step t. In other words, the method receives status information at the interface 162, from the plant/system 120. This information will be used by the RL agent 110 to determine the best action to take. In step 714, the RL agent’s policy proposes a new set of controller parameters based on the current plant state. The policy is essentially a mechanism that maps the observed state to an action, which in this case is the selection of tuning parameters for controller 540. In step 716, the safety layer block 130 evaluates the proposed control parameters for safety using the constraint function learned in step 704. If the proposed control parameters are deemed safe, they are returned as the updated control parameters θ t . In step 718, the controller’s tuning parameters are updated with the safe values determined by the safety layer. This allows the controller to optimally react to the current state of the combustion system. In step 720, the method calculates a reward, based on the performance of the updated controller, and provides this feedback to the RL agent block 1 10. The RL agent block 110 uses this reward signal to update its policy and improve its decision-making over time. In step 722, the training is stopped if a termination criterion is met. [0058] The setup of the RL policy block 110 is now discussed. The dynamic system 300 is reset with a randomly generated system parameter (iy in equation (14)) at the beginning of each episode to simulate varying dynamics. An episode is a term used in reinforcement learning to refer to a single run of the environment with the agent. During an episode, the agent receives observations from the environment, takes actions based on its policy, and receives rewards or penalties in return. In this embodiment, τ f is randomly selected from [1 e -4 , 5e -3 ] seconds. The input to the first neural network 112, which represents the RL policy block 110, is the magnitude of pressure oscillations while the outputs are the controller parameters or calculated parameters The calculated parameters are then used in the safety layer block 130 to determine the actual control parameters θ t = [ K, τ], The RL policy neural network architecture includes in this embodiment two hidden layers for each actor and critic network; one layer has 400 units, and the other has 300 units. A different number of layers and/or units may be used. Table 3 in FIG. 8 presents the SAC algorithm hyperparameters. The values of these hyperparameters may be changed depending on the practical situation to which the method is applied.

[0059] The reward function r is selected to minimize the amplitude of the pressure fluctuations with minimum controller gain (to minimize actuation effort). In this embodiment, the reward is increased by ten if the magnitude of the pressure oscillations at time t is less than a predefined value δ, based on the model: where e is a tuning weight for the controller action with a nominal value of 0.1 , 5 is set to 2.5e -4 , and r is defined as follows:

[0060] In this embodiment, the input to the constraint (safety layer block 130) neural network 132 is the magnitude of pressure oscillations and the controller’s parameters at time t, while the output is the magnitude of the pressure oscillations at time t + 1 . This second neural network 132 may include three hidden layers with 100 units each and a Rectified Linear Unit (ReLU) layer. A ReLU layer is a type of activation function commonly used in artificial neural networks, which introduces non-linearity to the network by setting negative input values to zero and leaving positive input values unchanged. Other configurations of the second neural network 132 may be used. To learn the one-step predictor for the safety layer, the training data D is generated by randomly varying the tuning parameters, K= [3,

7] and the delay time τ f = [1 e -4 , 5e -3 ] sec.

[0061] To minimize pressure oscillations, in one application, the pressure amplitude is observed by the RL agent block 110. The amplitude is determined by combining multiple Simulink blocks. A detect positive rise block is used to detect the rising edges of the pressure signal, while a maximun running resettable block is used to track its maximum value. A triggered subsystem block is used to output the maximum value of the pressure signal. By combining these blocks, it is possible to detect the amplitude of the pressure signal and adjust the system parameters to minimize pressure oscillations. [0062] Simulations have been performed on the engine 300 with the method of FIG. 7 and the following results have been obtained. FIG. 9 shows the learning curve of the RL agent block 1 10 with the SAC algorithm. The average reward at each episode increases as the training progresses. It is also observed that introducing a safety layer 130 does not negatively impact the learning performance of the RL agent block 110, which is consistent with the findings in [9]. T raining is terminated when the average reward for the previous 50 episodes becomes equal to the current episode reward. This indicates that the reinforcement learning agent has reached a stable performance level, as there is no significant improvement in the reward over a considerable number of episodes.

[0063] FIG. 10 demonstrates the effectiveness of the safety layer block 130 during an early episode of training. The safety layer block 130 bounds for the magnitude of pressure oscillations are selected to be within a given range, for example, [-0.06, 0.06], meaning that the safety layer will not allow the selection of controller parameters that lead to pressure oscillations outside of these bounds during the learning process. It can be seen that unrestricted exploration during the training of an RL policy leads to multiple instances of safety bound violations, while the safety layer ensures that exploration remains within the safety bounds.

[0064] To assess the effectiveness of the RL-based adaptation mechanism in maintaining thermoacoustic stability across a range of operating conditions, the inventors conducted several simulations. These simulations investigated the impact of controller parameters (gain and time delay) on the thermoacoustic stability of the system at flame time delays of 0.8e -4 sec, 5.8e -4 sec, and 10e -4 sec generated from 273 simulations. The thermoacoustic stability is for this case the parameter of the non-linear system (engine) that is desired to be maintained in a desired range. For a different non-linear system, another parameter is considered. It was found that different combinations of controller gain and time delays are required to stabilize the system at different flame time delays. Next, the inventors verified the effectiveness of the RL-based adaptation mechanism in mitigating thermoacoustic instability as the system parameters change. The system’s time delay was varied from 1e -4 sec to 5e -3 sec, as shown in FIG. 11 . The novel approach successfully maintained flame stability throughout this range. However, stabilizing a flame from within a limit cycle proved to be more challenging than maintaining a stable flame. The inventors observed that once the controller stabilizes the flame at the initial time delay, no further adjustments are needed to maintain stability as the time delay changes. In contrast, the same controller parameters are not effective in stabilizing the flame from within a limit cycle. To demonstrate this, the inventors considered two additional cases. FIGs. 12A and 12B show the stabilization from within a limit cycle at two widely different time delays. The first figure corresponds to a time delay of 1e -4 sec, while the second corresponds to a time delay of 5e -3 sec.

[0065] The inventors also investigated the influence of the reward function on the performance of the resulting controller by presenting two cases with different tuning weights for the controller action, e, as shown in equation (19). A larger value of e imposes a greater penalty on more substantial controller actions. As depicted in FIG. 13, with a higher e, thermoacoustic stability is achieved with lower gain. It is worth emphasizing that the relative magnitude of the terms within the reward function plays a critical role, rather than their absolute values. This underscores the need to carefully balance the contributions of different terms to achieve the desired controller performance.

[0066] To evaluate the robustness of the RL-based controller in the presence of noise, the inventors introduced white noise with a zero mean, variance of 1 e -5 , and sample time of 1 e -5 to the magnitude of pressure oscillations. The obtained results (not shown) indicate that the controller was able to stabilize the system. [0067] Further, the inventors compared the RL-based scheme presented above to three other established controllers. The first is a model-based robust controller designed using the H-°° loop-shaping Glover-McFarlane method and the v- gap metric (see D. C. McFarlane and K. Glover, “Robust controller design using normalized coprime factor plant descriptions,” Springer, 1990, and J. Li and A. S. Morgans, “Feedback control of combustion instabilities from within limit cycle oscillations using H-inf loop-shaping and the v-gap metric,” Proc. R. Soc. A 472 (2016) 20150821 ). The second controller is a model-based self-tuning regulator (STR), as described in S. Evesque, A. P. Dowling, and A. M. Annaswamy, “Self- tuning regulators for combustion oscillations,” Proc. R. Soc. London, Ser. A 459 (2003) 1709-1749. The third controller is a model-free extremum seeking controller (ESC), as described in J. Moeck et al, “Two-parameter extremum seeking for control of thermoacoustic instabilities and characterization of linear growth,” 45th AIAA Aerospace Sciences Meeting and Exhibit (2007) 1416. These strategies aim to design a controller that is stable and performs well in the presence of uncertainty or variations in system dynamics. However, as shown next, their performance deteriorates when there are significant system variations.

[0068] To compare the four control strategies (i.e., the novel RL based system discussed with regard to FIG. 7, and the known models STR, ESC, and H-∞ ), the system time delay was varied from 8x10 -5 to 1 x10 -3 . The flame model’s time delay depends on operating conditions such as the equivalence ratio and preheat temperature. As a result, these simulations offer a meaningful comparison of the control strategies under varying conditions.

[0069] FIG. 14 demonstrates that the RL-based controller, derived from Algorithm 1 in FIG. 2, maintains a stable flame across a wide operating range. This indicates that an effective controller parameter adaptation policy was learned, leading to attenuated pressure oscillations at various time delays. As also shown in FIG. 14, the adaptive controller STR outperforms the robust controller H-°°. This highlights the fact that robust design methods provide fixed compensators that perform well within a specified range of system parameter variations. In contrast, adaptive design methods continuously learn and update the plant parameters online, adjusting the control law accordingly. The H-∞ and STR controllers are model-based, designed using specific knowledge of the flame dynamics. In contrast, the RL-based controller is data-driven.

[0070] Although the ESC controller is model-free, tuning its parameters can be challenging and requires trial-and-error. When attempting to adjust two parameters (gain and time delay), it was difficult to achieve satisfactory performance. However, focusing on adjusting only the time delay led to significantly improved performance, albeit at the expense of maintaining the maximum gain (K = 10), which may be undesirable.

[0071] The proposed RL scheme has a constant number of hyperparameters, regardless of the number of control parameters. In comparison, the ESC controller requires at least four additional tuning parameters for each extra controller parameter (integrator gain, frequency and amplitude of the modulating signals, and the cut-off frequency of the low-pass filter). Finally, in contrast to ESC, the RL-based strategy does not require persistent excitation of the system, offering an additional advantage in its implementation.

[0072] These results show that the RL based method can be a highly effective tool for mitigating combustion instabilities in time-varying systems. By using reinforcement learning to adjust the parameters of a phase-shift controller, it is possible to attenuate combustion instabilities in the presence of process noise and outperform other model-free and model-based methods. These results demonstrate the potential of reinforcement learning as a powerful approach for enabling the development of safer and more efficient carbon-free gas turbine technologies. The same method may be applied, as already discussed above, to other systems that have nonlinearities, not only in the engine field.

[0073] The method discussed above with regard to FIG. 7 may be implemented as now discussed. The method is for reinforcement learning, RL, control of a controller 140 of a non-linear system 120. The method includes a step 712 of receiving an amplitude p ref of a state variable p ref , wherein the state variable p ref characterizes the non-linear system, a step 714 of determining, in an RL policy block, an RL mapping π θ based on a first neural network, using the amplitude as input, wherein the RL mapping π θ results in an estimated control parameter a step 716 of calculating, in a safety block, based on a second neural network 132, that a control parameter θ t of the controller is within a given range from the estimated control parameter and a step 718 of applying the control parameter θ t to the controller for maintaining a parameter of the non-linear system within a desired range.

[0074] The above-discussed procedures and methods may be implemented in a computing device as illustrated in FIG. 15. Hardware, firmware, software or a combination thereof may be used to perform the various steps and operations described herein. The computing device 1500, which may correspond to the RL control block 160, is suitable for performing the activities described in the above embodiments and may include a server 1501. Such a server 1501 may include a central processor (CPU) and/or graphic processor (GPU) 1502 (corresponding to processor 164 in FIGs. 1 and 5) coupled to a random access memory (RAM) 1504 and to a read-only memory (ROM) 1506. ROM 1506 may also be other types of storage media to store programs, such as programmable ROM (PROM), erasable PROM (EPROM), etc. Processor 1502 may communicate with other internal and external components through input/output (I/O) circuitry 1508 and bussing 1510 to provide control signals and the like. Processor 1502 carries out a variety of functions as are known in the art, as dictated by software and/or firmware instructions. Interface 1508, which corresponds to interface 162, may be connected to the system

120’s output (see FIG. 1 ) or to the amplitude detection block 150 (see FIG. 5).

[0075] Server 1501 may also include one or more data storage devices, including hard drives 1512, CD-ROM drives 1514 and other hardware capable of reading and/or storing information, such as DVD, etc. In one embodiment, software for carrying out the above-discussed steps may be stored and distributed on a CD- ROM or DVD 1516, a USB storage device 1518 or other form of media capable of portably storing information. These storage media may be inserted into, and read by, devices such as CD-ROM drive 1514, disk drive 1512, etc. Server 1501 may be coupled to a display 1520, which may be any type of known display or presentation screen, such as LCD, plasma display, cathode ray tube (CRT), etc. A user input interface 1522 is provided, including one or more user interface mechanisms such as a mouse, keyboard, microphone, touchpad, touch screen, voice-recognition system, etc.

[0076] Server 1501 may be coupled to other devices, such as sources, detectors, sensors, engines, etc. The server may be part of a larger network configuration as in a global area network (GAN) such as the Internet 1528, which allows ultimate connection to various landline and/or mobile computing devices. [0077] As described above, the apparatus 1500 may be embodied by a computing device. However, in some embodiments, the apparatus may be embodied as a chip or chip set. In other words, the apparatus may comprise one or more physical packages (e.g., chips) including materials, components and/or wires on a structural assembly (e.g., a baseboard). The structural assembly may provide physical strength, conservation of size, and/or limitation of electrical interaction for component circuitry included thereon. The apparatus may therefore, in some cases, be configured to implement an embodiment of the present invention on a single chip or as a single “system on a chip.” As such, in some cases, a chip or chipset may constitute means for performing one or more operations for providing the functionalities described herein.

[0078] The processor 1502 may be embodied in a number of different ways. For example, the processor may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing element with or without an accompanying DSP, or various other processing circuitry including integrated circuits such as, for example, an ASIC (application specific integrated circuit), an FPGA (field programmable gate array), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like. As such, in some embodiments, the processor may include one or more processing cores configured to perform independently. A multi-core processor may enable multiprocessing within a single physical package. Additionally or alternatively, the processor may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading.

[0079] In an example embodiment, the processor 1502 may be configured to execute instructions stored in the memory device 1504 or otherwise accessible to the processor. Alternatively or additionally, the processor may be configured to execute hard coded functionality. As such, whether configured by hardware or software methods, or by a combination thereof, the processor may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present invention while configured accordingly. Thus, for example, when the processor is embodied as an ASIC, FPGA or the like, the processor may be specifically configured hardware for conducting the operations described herein. Alternatively, as another example, when the processor is embodied as an executor of software instructions, the instructions may specifically configure the processor to perform the algorithms and/or operations described herein when the instructions are executed. However, in some cases, the processor may be a processor of a specific device (e.g., a pass-through display or a mobile terminal) configured to employ an embodiment of the present invention by further configuration of the processor by instructions for performing the algorithms and/or operations described herein. The processor may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processor.

[0080] As used herein, the term “neural network” includes the term “deep learning” and refers generally to a popular machine learning method. Three main architectures associated with deep learning are applicable to addressing at least some of the particular technical challenges associated with system environments: the feedforward neural network (FNN), the convolutional neural network (“CNN”) and the recurrent neural network (“RNN”). In some instances, these deep learning architectures have proven effective in addressing technical challenges associated with image processing. [0081] Instead of being a pure classifier that depends on the manually- designed features such as SVM, CNN is considered to be an end-to-end wrapper classifier, at least in the sense that some CNN-based architectures are able to perform feature extraction based on the classification result and improve the performance of the machine learning model in a virtuous circle. As a complement to the capability of CNN-based architectures to capture significant features from a two- dimensional or three-dimensional matrix, RNN has the potential of encoding longterm interactions within the input data, which is usually a one-dimensional vector, such as the encoding of English words. In some example implementations of embodiments of the invention discussed and otherwise disclosed herein, the advantages of CNN and RNN are combined by using CNN to conduct feature extraction and dimensionality compression starting from the relevant raw two- dimensional encoding matrices, and by using RNN to extract the states or parameters that characterize the studied system.

[0082] In overcoming some of the technical challenges associated with predicting the proper classification of a state, example embodiments of the invention discussed and otherwise disclosed herein address aspects of function prediction as a classification problem with a tree structure in the label space, which can be viewed and treated as a hierarchical classification challenge. By viewing the prediction of the classification of a state as both a multi-label classification challenge and as a multiclass classification challenge, three approaches to implementing a solution are possible: a flat classification approach, a local classifier approach, and a global classifier approach. Example implementations of embodiments of the invention disclosed and otherwise described herein reflect an advanced local classifier approach, at least in the sense that example implementations involve the construction of one classifier for each relevant internal node as part of the overall classification strategy.

[0083] It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first object or step could be termed a second object or step, and, similarly, a second object or step could be termed a first object or step, without departing from the scope of the present disclosure. The first object or step, and the second object or step, are both, objects or steps, respectively, but they are not to be considered the same object or step.

[0084] The terminology used in the description herein is for the purpose of describing particular embodiments and is not intended to be limiting. As used in this description and the appended claims, the singular forms "a," "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used herein refers to and encompasses any possible combinations of one or more of the associated listed items. It will be further understood that the terms "includes," "including," "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Further, as used herein, the term "if" may be construed to mean "when" or "upon" or "in response to determining" or "in response to detecting," depending on the context.

[0085] The disclosed embodiments provide real-time control of highly nonlinear systems in an industrial environment. The proposed method is a learning- based adaptation law for adapting the controller parameters of nonlinear systems. The method applies model-free reinforcement learning to learn an effective parameter adaptation law while maintaining a safe system operation by including a safety layer. It should be understood that this description is not intended to limit the invention. On the contrary, the embodiments are intended to cover alternatives, modifications and equivalents, which are included in the spirit and scope of the invention as defined by the appended claims. Further, in the detailed description of the embodiments, numerous specific details are set forth in order to provide a comprehensive understanding of the claimed invention. However, one skilled in the art would understand that various embodiments may be practiced without such specific details.

[0086] Although the features and elements of the present embodiments are described in the embodiments in particular combinations, each feature or element can be used alone without the other features and elements of the embodiments or in various combinations with or without other features and elements disclosed herein. [0087] This written description uses examples of the subject matter disclosed to enable any person skilled in the art to practice the same, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the subject matter is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims.

References

The entire content of all the publications listed herein is incorporated by reference in this patent application.

[1] M. Krstic and H.-H. Wang, “Stability of extremum seeking feedback for general nonlinear dynamic systems,” Automatica, vol. 36, no. 4, pp. 595-601 , 2000.

[2] R. S. Sutton, A. G. Barto, and R. J. Williams, “Reinforcement learning is direct adaptive optimal control,” IEEE control systems magazine, vol. 12, no. 2, pp. 19-22, 1992.

[3] T. P. Lillicrap, J. J. Hunt, A. Pritzel, et aL, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971 , 2015.

[4] V. Mnih, K. Kavukcuoglu, D. Silver, et aL, “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529-533, 2015.

[5] B. Recht, “A tour of reinforcement learning: The view from continuous control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 2, pp. 253-279, 2019.

[6] M. G. Bellemare, S. Candido, P. S. Castro, et aL, “Autonomous navigation of stratospheric balloons using reinforcement learning,” Nature, vol. 588, no. 7836, pp. 77-82, 2020. [7] E. Bohn, S. Gros, S. Moe, and T. A. Johansen, “Reinforcement learning of the prediction horizon in model predictive control,” IFAC-PapersOnLine, vol. 54, no. 6, pp. 314-320, 2021.

[8] M. Sedighizadeh and A. Rezazadeh, “Adaptive pid controller based on reinforcement learning for wind turbine control,” in Proceedings of world academy of science, engineering and technology, Citeseer, vol. 27, 2008, pp. 257-262.

[9] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa, “Safe exploration in continuous action spaces,” arXiv preprint arXiv:1801.08757, 2018.

[10] T. Haarnoja, A. Zhou, K. Hartikainen, et al, “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.

[11] J. Li and A. S. Morgans, Time domain simulations of nonlinear thermoacoustic behaviour in a simple combustor using a wave-based approach, J. Sound Vib. 346 (2015) 345-360.