Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ACTOR-CRITIC LEARNING AGENT PROVIDING AUTONOMOUS OPERATION OF A TWIN ROLL CASTING MACHINE
Document Type and Number:
WIPO Patent Application WO/2024/015601
Kind Code:
A1
Abstract:
A twin roll casting system comprises counter-rotating casting rolls having a nip between the casting rolls and capable of delivering cast strip downwardly from the nip, a casting roll controller configured to adjust at least one process control setpoint between the casting rolls in response to control signals, a cast strip sensor capable of measuring at least one parameter of the cast strip, and a controller coupled to the cast strip sensor to receive cast strip measurement signals from the cast strip sensor and coupled to the casting roll controller to provide control signals to the casting roll controller, the controller comprising a reinforcement learning (RL) Agent. The RL Agent further comprises a model-free actor-critic agent having a value function and a policy function, the RL Agent having been trained on a plurality of casting system operation datasets composed of casting runs executed by a plurality of different human operators.

Inventors:
RUAN JIANQI (US)
CHIU GEORGE T C (US)
SUNDARAM NEERA JAIN (US)
NOONING ROBERT GERARD (US)
PARKES IVAN DAVID (US)
BLEJDE WALTER N (US)
Application Number:
PCT/US2023/027813
Publication Date:
January 18, 2024
Filing Date:
July 14, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NUCOR CORP (IL)
International Classes:
B22D11/06; B22D11/16; G05B13/02
Foreign References:
US20190091761A12019-03-28
US6085183A2000-07-04
US20180260692A12018-09-13
US20030116301A12003-06-26
US20090090480A12009-04-09
Download PDF:
Claims:
What is claimed is:

1. A twin roll casting system, comprising: a pair of counter-rotating casting rolls having a nip between the casting rolls and capable of delivering cast strip downwardly from the nip; a casting roll controller configured to adjust at least one process control setpoint between the casting rolls in response to control signal s; a cast strip sensor capable of measuring at least one parameter of the east strip: raid a controller coupled to the cast strip sensor to receive cast strip measurement signals from the cast strip sensor and coupled to the casting roll controller to provide control signals to the casting roll controller, the controller comprising a reinforcement learning (RL) Agent; the RL Agent ferther comprising a model-free actor-critic agent having a value function and a policy function, the RL Agent, having been trained on a plurality of casting system operation datasets composed of casting runs executed by a plurality of diflerent human operators.

2. The twin roll casting system of claim I wherein the RL Agent further comprises an advantage function which calculates an advantage value for a selected action as an immediate reward value fora selected action plus a discounted value of a subsequent state for the selected action minus a value of current state; and wherein the advantage value is used to train the policy function.

3. The twin roll casting system of claim 2 wherein the policy ftrnciioii is configured evaluate the advantage function in a way that values an action from the plural ity of casting system operation datasets having a negative advantage value over actions that are not found in the plurality of casting system operation, datasets.

4. The twin roll casting system of claim 1 wherein the RL. Agent further comprises an advantage function which calculates an advantage value for a selected action as an immediate reward value for a selected action plus a discounted value of a subsequent state 'for the selected action minus a value of current state: and wherein the natural exponent o f the advantage value is used to train the policy function.

5, The twin roll casting system of claim 1 , wherein, the cast strip sensor comprises a thickness gauge that measures a thickness of the cast strip in intervals across a width of foe cast strip.

6. fhe twin roll casting system of claim I, wherein foe process control setpoint comprises a force setpoint between the casting rolls: and wherein the parameter of the cast strip comprises chatter.

7. The twin roil casting system of claim 1, wherein the RL Agent further comprises a reward function calculating an immediate reward as a piecewise defined reward function: where is the weight used to scale in the range is the weight used to scale in the range and and are user-defined thresholds for the chatter and edge spike parameters.

8. The twin roll casting system of claim 1 further comprising an advantage function which calculates an advantage value as an immediate reward value for a selected action plus a discounted value of a .subsequent state for the selected action minus a value of current state; wherein the immediate reward is calculated by a reward function calculating an immediate reward as a weighted piecewise defined reward function based on user-defined thresholds for the chatter and edge spike parameters.

9. The twin roll easting system of claim 1 , wherein (he at least one parameter of the cast strip comprises chatter and at least one strip profile parameter.

10. fhe twin roll casting system of claim 9, wherein the at least one strip profile parameter is selected from the group consisting of edge bulge;, edge ridge, maximum peak, and high edge flag.

11 The twin roll casting system of claim 1 , wherein the policy function comprises a stochastic policy function,

12, The twin roll casting system of claim 1 , wherein the policy function includes- a dependency on a previous step’s action.

13, The twin roll casting system of claim 1, wherein for each step in an operation dataset, recurrence from the previous step is embedded to improve the actor training process.

Description:
ACTOR-CRITIC LEARNING AGENT PROVIDING

AUTONOMOUS OPERATION OF A TWIN ROLL CASTING MACHINE

BACKGROUND

[0001] Twin-roll casting (TRC) is a near-net shape rnanu&cturing process that is used to produce strips of steel and other metals. During the process, molten metal is poured onto the surface of two casting rolls that simuliatieousiy cool and solidify the metal into a strip at close to its final thickness. This process is characterized by rapid thermo-mechanical dynamics that are difficult to control in order to achieve desired characteristics of the final product, This is true not only for steady-state casting, but even more so during “start-tip' j the transient period of casting that precedes steady-slate casting. Strip metal produced during start-up often contains an unacceptable amount of detects. For example, strip chatter is a phenomenon where the casting machine vibrates around 35 Hz and 65 Hz. More specifically, the vibration causes variation in the solidification process and results in surface defections, as shown in Figures 1A and IB. Chatter needs to be brought below an upper boundary before commercially acceptable strip metals can be made.

[0002] During both the start-up and steady-state casting processes, human operators are tasked with manually adjusting certain process control setpoints. During the start-up process, the operators' goal is to stabilize the production of the steel strip, including reducing chater, as quickly as possible so as to minimize the length of the start-up period subject to certain strip quality metrics being satisfied thus increasing product yield by minimizing process start up losses. They do this through a series of binary decisions (taming switches on/ofi) and the contimious adjustment of multiple setpoints. In total, operators control over twenty switches and setpoints; for the latter, operators must determine when, and by how much, to adjust the setpoint.

[0003] Among the setpoints that operators adjust, the casting roll separation force setpoint (to be referred to as the “force setpoint” fr om here onward.) is the most frequently adjusted setpoint during the start-up process. It may be adjusted tens of times in an approximately five-minute period. Operators consider many factors when adj listing lire force setpoint, but foremost is the strip chatter, a strip defect induced by the natural frequecies of the casting machine,

[0004] Operators use various policies for adjusting the force setpoint. One is to consider a threshold for the chatter measurement; when the chatter value increases above the threshold, operators will start to decrease the force. However, individual operators use different threshold values based on their own experience, as well as factors including the specific grade of steel or width being cast On the other hand, decreasing the force too much can lead to other quality issues within the steel strip; therefore, operators are generally trained to maintain, as high a force as possible subject to chatter mitigation.

[0005) Attempts have been made to improve various industrial processes, including twin roll casting. In recent years, human-in-the-'loop control systems have become increasingly popular. Instead of considering the human as an exogenous signal, such as a. disturbance, human-in -tlie- loop systems treat humans as a part of the control system, Human-in-the-loop applications may be categorized into three main categories: human control, human monitoring, and a hybrid of these two. Human control is when a human directly controls the process, this may also be referred to as direct control. Supervisory control is a hybrid approach in which human operators adjust specific setpoints and otherwise oversee a predominantly automatically controlled process. Supervisory control is commonly occurring hr industry and has up to now, been the predominant regime for operating twin roll casting machines. However, variation between human operators, for example In their personality traits, past experiences, skill level, or even their current mood, as well as varying, uncharacteristic process factors, continue to cause inconsistencies in process operation.

[0006] Modeling human behavior as a black box problem has been considered. More specifically, researchers agree that system identification techniques can be useful, for modeling human behavior in human-in-thedoop control systems. These generally reference predictive models of human behavior and subsequently, controller designs based on the identified models. The effectiveness of this approach of first identitying a model of the human’s behavior and then designing a modelbased controller is dependent upon the available data, Disadvantageously, if the human data contains multiple distinct operator behaviors, due to significant variations between different operators, any identified model will likely underfit the data and lead to a poorly performing controller.

[0007] Moreover, proposed approaches have been aimed at characterizing the human operator’s role as a feedback controller in a system, but instead of modeling the human, operator’s behavior, they identify an optimal control policy based on the system model. In other words, they do not directly learn from the policy used by experienced human operators. In some industrial applications, especially during highly transient periods of operation, such as process start- tip. system modeling can be extremely difficult and not all control objectives can be quantified. Thus, automating such a process using model-based methods is not trivial; instead, a methodology is needed for determining the optimaloperation policy according to both explicit control objectives and implicit control objectives revealed by human operator behavior.

SUMMARY

[0068] A twin roll casting system comprises a pair of coitnter-rotating casting rolls having a nip between the casting rolls and capable of delivering cast strip downwardly from the nip, a casting roll controller configured to adjust at least one process control setpoint between the casting rolls in response to control signals, a cast strip sensor capable of measuring at least one parameter of the cast strip, and a controller coupled to the east strip sensor to receive east strip measurement signals from the cast strip sensor and coupled to the casting roll controller to provide control signals to the casting roll controller, the controller comprising a reinforcement learning (RL) Agent. The RL Agent further comprises a model-free actor-critic agent having a value function and a policy function, the RL. Agent having been trained on a plurality of casting system operation datasets composed of casting runs executed by a plurality of different human operators.

[0009] In some embodiments, the RL. Agent fruiher comprises an advantage function which calculates an advantage value for a selected action as an immediate reward value for a selected action pins a discounted value of a subsequent state for the selected action minus a value of current state; and the advantage value is used to train the policy function. In some embodiment, the policy .lunction is configured evaluate the advantage function in a way that values an action fem the plurality of casting system operation datasets having a negative advantage value over actions that are not Sound in the plurality of casting system operation datasets.

[0010| In some embodiments, the Rl , Agci it further comprises an advantage function which calculates an advantage value for a selected action as an immediate reward value for a selected action plus a discounted value of a subsequent, state for the selected action minus a value of current state; and the natural exponent of the advantage value is used to train the policy funetion.

[0011] The cast strip sensor may comprise a thickness gauge that measures a thickness of the cast strip in intervals across a width of the cast strip. The process control. setpoint may comprise a force setpoint between the casting rolls, and the parameter of the cast strip may comprise chatter.

[0012] In some embodiments, the RL Agent further comprises a reward function calculating an immediate reward as a weighted piecewise defined reward function based on user-defined thresholds for the chater and edge spike parameters. In some embodiments, the RL Agent further comprises an advantage function which calculates an advantage value as the immediate reward value for a selected action plus a discounted value of a subsequent state for the selected action minus a value of current state. [0013] The at least one parameter of the cast strip may comprise chatter and at least one strip profile parameter. The at least one strip profile parameter may be selected from the group consisting of edge bulge, edge ridge, maximum peak, and high edge flag.

[0014] The policy function may comprise a stochastic policy function. The policy function may farther include a dependency on a previous step's action.

[0015] The data in an operational dataset may be augmented, fa this embodiment; for each step fa an operation dateset, recutrence from the previous step is embedded to improve the actor training process.

BRIEF DESCRIPTION OF THE DRAWINGS

[0016] Figure I A is a strip profile withou t chatter defects.

[0017] Figure IB is a strip profile with chater defects.

[0018] Figure 2 is an illustration of a twin roll caster according to at least one aspect of the invention*

[0019| Figure 3 is an illustration of details of the twin roll caster illustrated in Figure 2..

[0020] Figure 4 is a graph of mean force trajectory of clusters of training datasets.

[0021] Figure 5A is a graph of examples of force trajectory of cl uster 1 in Figure 4.

[0022] Figure 513 is a graph of examples offeree trajectory of cluster 2 in Figure 4.

[0023] Figure 5C i s a graph o f examples of force trajectory of cluster 3 in Figure 4.

[0024] Figure 6A is a graph of maximum chatter amplitude spectrum of cluster 1. in Figure 4.

[0025] Figure 6B is a graph of maximum chatter amplitude spectrum of duster 2 in Figure 4.

[0026] Figure 6C is a graph of maximum chatter amplitude spectrum of duster 3 in Figure 4.

[0027] Figure 7 is a plot an RL Agent's force setpoint value trajectory and the associated chatter trajectory.

[0028] Figure 8 is a plot comparing two RL Agent’s force setpoint value trajectories and the associated chater trajectory. [0029] Figure 9 is a second plot comparing two RL Agent’s force setpoint value trajectories and the associated chatter trajectory .

[0030] Figure 10 is a third plot comparing two RL Agent’s force setpoint value trajectories and. the associated chatter trajectory.

[0031] Figure 1 1 is a fourth plot comparing two RL Agent’s force setpoint value trajectories and the associated chatter trajectory.

[0032] Figure 12 is a plot comparing an RI \genfs force setpoint value trajectories to an operator's force seipoint value trajectories and the av-octetcd chater trajectory.

[0033] Figure 13 is a second plot comparing an RL Agent’s force setpoint value trajectories to an operator's force setpoint value trajectories and the associated chater trajectory.

[0034] Figure 14A is an illustration of thickness variation along a length of a cast strip.

[0035] Figure 1413 is an illustration of ripple defects in a. cast strip profile, including edge spike.

[0036] Figure 15 is a schematic of a cast strip cross section describing edge spike parameters.

[0037] Figure 16 is a graph of average silhouette width versus number of cl usters for operational data sets.

[0038] Figure 17 shows silhouete width of each sample under different clustering settings.

[0039] Figure 18A illustrates an example offeree trajectories in a first cluster.

[0040] Figure 1813 i llustrates an example of force trajectories in a second cluster.

[0041] Figure 19A illustrates ill. Agent verification under a first case of initial edge spikes.

[0042] Figure I.9B illustrates 111. Agent verification under a second case of initial edge spikes, the second case having lower edge spoke than (he first case,

[0043] Figure 20A illustrates RL Agent verification under a third case of initial edge spikes,

[0044] Figure 2013 illustrates RI.. Agent verification under a fourth case of initial edge spikes, the fourth case having similar edge spikes to the third ease.

[0045] Figure 21 is a simplified drawing of a twin roll caster showing the relationship of Roll Separation Force to cast steel strip. |0046] Figure 22 illustrates trajectories offeree and control objectives ofhuman operator and RL Agent without augmented dataset control under a fifth case of operating conditions.

[0D47| Figure 23 illustrates trajectories of force and thickness with con'espouding losses of human operator and RL Agent without augmented dataset con trol under the fi fth case of operating conditions.

[0048] Figure 24 illustratestraiectori^of force and control objectives of human operator andRL Agent with augmented dataset control under a fifth case of operating conditions,

[0049] Figure 25 illustrates trajectories of force and thickness with corresponding losses of human operator and RL Agent with, augmented dataset control under the fifth case of operating conditions,

DETAILED DESCRIPTION

[0050] Referring to Figures 2 And 3, a twin-roll easier is denoted generally by 11 which produces thin cast steel strip 12 which passes into a transient path across a guide table 13 to a pinch roll stand 14, After exiting the pinch roll stand 14, thin cast strip 12 passes Into and through hot rolling mill 16 comprised of back up rolls 16B and upper and lower work rolls 16A where the thickness of the strip reduced, rhe strip 12, upon exiting the rolling mill 15, passes onto a run out table 17 where it maybe forced cooled by water (or water/air)jets 18, and then through pinch roll stand 20 comprising a pair of p inch rolls 20A and to a coi let 19,

[0051] Twin-roll caster 11 comprises a main machine frame 21 which supports a pair of laterally positioned casting rolls 22 having casting surfaces 22A. and forming a nip 27 between them. Molten metal is supplied during a casting campaign from a ladle (not shown) to a tundish 23, through a refractory shroud 24 to a removable tundish 25 (also called distributor vessel or transition piece), and then through a metal delivery nozzle 26 (also called a core nozzle) between the casting rolls 22 above the nip 27. Molten steel is introduced into removable tundish 25 from tundish 23 via an outlet of shroud 24. The tundish 23 is fitted with a slide gate valve (not shown) to selectively open and close the outlet 24 and effectively control the flow of molten metal from the tundish 23 to the caster. The molten metal flows from removable tundish 25 through an outlet and optionally to and through the core nozzle 26.

[0(152] Molten metal thus delivered to the casting rolls 22 forms a casting pool 30 above nip 27 supported by easting roll surfaces 22A» This casting pool is confined at the ends of the rolls by a pair of side dams or plates 28, which are applied to the ends of the rolls by a pair of thrusters (not shown) comprising hydraulic cylinder units connected to the side dams. The upper surface of the casting pool 30 (generally referred to as the ‘‘meniscus” level) may rise above the lower end of the delivery nozzle 26 so that the lower end of the deliver nozzle 26 is immersed within the casting pool.

[0053] Casting rolls 22 are internally water cooled by coolant supply (not shown) and driven in counter rotational direction by drives (not shown) so that shells solidify on the moving casting roll surfaces and are brought together at the nip 27 to produce the thin cast strip 12, which is delivered downwardly from the nip between the casting rolls.

[0054] Below the twin roll easier 11, tile east steel strip 12 passes within a sealed enclosure 10 io the guide table .13, which guides the strip through an X-ray gauge used to measure strip profile to a pinch roll stand 14 through which it exits sealed enclosure 10. The seal of the enclosure 10 may not be complete, but is appropriate to allow control of the atmosphere within the enclosure and access of oxygen to the cast strip within the enclosure. After exiting the scaled enclosure 10, the strip may pass through further sealed enclosures (not shown) after the pinch roll stand 14.

[0055] A casting roll controller 94 is coupled to actuators that control all casting roll operation functions, One of the controls is the force set point adjustment. This determines how much force is applied to the strip as it is being cast and solidified between the casting rolls. Osc illations in feedback from the force actuators is indicative ofchatter. Force actuator feedback may be provided to the casting roll controller or logged by separate equipment / software.

[0056] A controller 92 comprising a trained RL Agent which is coupled to the casting roll controller 94 by, lor example, a computer network. The controller 92 provides force actuator control inputs to the casting roll controller 94 and receives force actuator feedback. The force actuator feedback may be from cornmercially-available data logging software or the casting roll controller 94.

[0057] In some embodiments, before the strip enters the hot roll stand, the transverse thickness profile is obtained by thickness gauge 44 and communicated to Controller 92,

[0058] The present invention avoids tlisadvanlages of known control systems by employing a model-free reinforcement learning engine, such as a deep Q network (DQN) that has been trained on metrics from manually controlled process including operator actions and casting machine responses as the RL Agent in controller 92. A DQN is a neural network that approximates the action value of each state-action pair.

100591 In a first embodiment provided below, the configuration and training of an RL Agent having one action and a reward function having one easting machine quality metric is provided. However, this is for clarity of the disclosure and additional actions and casting machine feedback responses may be incorporated in the RL Agent. Additional actions include rolling mill controls. Additional metrics may include cast strip profile measurements and flatness measurements, for example. Also, while the various embodiments disclosed herein use a RL Agent as an example, other model-free adaptive and/or learning agents may also be suitable and may be substituted therefore in any of the disclosed embodiments.

[0060] In a first embodiment, the DQN is a function mapping the state io the action values of all actions in the action set, as shown in Equation 1 , where Q is the neural network. S' is the state information of a sample, and corresponds to action values of A' elements in the action set.

{0061] In some embodiments, the state at time step i is defined as where C and dC arc the chater and change in chatter over one time step, respectively, and Fand dFare the force and change in force over one time step, respectively. In some embodiments, the easting data is recorded at 10 Hz. The force setpoint adjustment made by operators may be downsampled to 0.2 HK bused on the observation that operators generally do not adjust the force seipoint more frequently than this. Given the noise characteristics of the chatter signal, every 50 consecutive samples may be averaged (i.e. average chatter over a 5 second period) to obtain Q. In some embodiments, non-overlapping 5 second blocks are used. Two index subscripts to represent a data sample, namely / and k. The time index / denotes the time step within a single east sequence. The sample index k denotes the unique index of a sample in the dataset, which contains samples from all cast sequences.

[0062] hi some embodiinents, the action is defined as the change in the force setpoint value between the current time step and the next time step. Unlike the state, which is omtinuous-valued, the action is chosen from a discrete set A £ (i = 1,2, ... t A ? J. In the problem considered here, .¥ ~ 4; there are three frequently used force reduction rates and the last action stands for keeping, the force value unchanged.

[0063] In reinforcement learning (RL), the reward reflects what the user values and what the user avoids. In the context of using RL to design a policy for adjusting a process setpoint, there are two types of Information that can be used: 1) the behavior of “expert” operators and 2) performance metrics defined explicitly in terms of the states. Bach play a distinct role in defining a reward function that incentivizes the desired behavior,

[0064] Given that human operators may control this process as based on general rules of thumb and their individual experience with the process, a reward function that aims to emulate the behavior of operators is a way to capture their expertise without needing a model of their decisionmaking. On the other hand, if the reward function were to be designed to only emulate their behavior, then the trained RL Agent will not necessarily be able to improve upon the operators’ actions. To do the latter, it is useful to consider a second component of the reward function that places value on explicit performance metrics. For example, in the force setpoint adjustment: problem addressed in this first embodiment, the desired performance objectives are a short startup time, below some upper bound and a low chatter level, below some upper bound discussed below.

[0065] to some embodiments, implicit characterizaiton of performance objectives include the following. To better characterize different force setpoint adjustment behaviors, a k-means clustering algorithm may be applied to cluster over 200 individual cast sequences, based on the force setpoint trajectory implemented by operators during each east for a given metal grade and strip width, all of the cast sequences represent the same metal grade and strip width to ensure that differences identified through clustering are a function of the behaviors of the human operator working during each easting campaign for that grade and width. [0066] Additional grades and widths may be characterized in a similar fashion, Alternatively, additional grades and widths can use the same trained RL Agent, but with different starting points assigned to the different grades and widths.

[0067] In the example herein, the force setpoint adjustment behavior is characterized by a 500- second period force setpoint trajectory after an initial, automatic adjustment In one example, among the available cast data sequences, a total of 6 different operators' behavior Is represented. During a given cast, the process is operated by a crew of 2 operators, with one responsible for the force setpoint adjustments. To account for distinct force setpoint adjustment behaviors by different crews, training data sets are cluster and preferred behaviors identified. In some embodiments, k ™ {3, 4. 5, 6] for the k-means algorithm. The -clustering .result is the most stable fork :::: 3 for the data set in this example. Only 2% of the east sequences keep shifting from one cluster to another. Other values of k may be appropriate for other data sets. Fi gure 4 shows the mean force trajectories, computed by averaging each time step’s value in the force trajectories of each cluster, separately . Figures 5(a)-5(e) show examples from each of the three clusters. Figures 6(a)-6(c) show histograms of chatter amplitude for each of three clusters. According to Table I, Cluster 3 has the shortest mean start-up time but not the smallest start- up time variation; Cluster I has the smallest start-up time variation but not the shortest mean start-up time.

[0068] Cluster 3 is also characterized by the most aggressive setpoint adjustment behavior, both in terms in the rate at which the force setpoint is decreased as well as the total magnitude by which it is decreased. Another feature of the cast sequences belonging to Cluster 3 is that they cover a wider range of force setpoint values due to the aggressive adjustment of the setpoint. Cluster 3 is preferred because it has the shortest average start-up time and the lowest overall chatter level among three force behavior clusters.

[0069] In addition to rewarding emulation of certain operator setpoint adjustment behaviors, the reward function should explicitly incentivize desired performance metrics. With respect to achieving a short start up time, ,7«„ it is important to equally reward or penalize each time step, because it is not known whether decisions made near the start of the cast do or do not lead to a short start-up lime. To emphasize that, east sequences with different start-up times should be rewarded differently, in some embodiments, the time reward for each step is | where - is start-up time and is the upper bound on the start- up time as deemed acceptable by the user. The exponential function leads to an increasing penalty rate as the sequence start-up time approaches the upper bound. [OO70| In this embodiment, the second performance objective is to maintain a chatter value below some user-defined threshold. Therefore, a maximum acceptable chatter value, denoted by is defined; if the chatter value is lower than there is no ehatter penaltyassigned to that step. Mathematically, the chatter reward can be expressed as . Decreasing the force too much, at the expense of decreasing chatter, can lead to other quality issues with the steel strip. Therefore, a lower bound on the acceptable force, is also enforced,

[0071] The total reward function is shown in Equation 2:

[0072] In addition to the implicit and explicit performance objectives described above, a constant reward is applied at each sample using the first term of /fe According to the casting campaign records, it. may be observed that the operators often retrain from decreasing the force setpoint at a given time step when both the chatter value and start-up time are within acceptable levels at a given sample. To incentivize the RL Agent to learn from this behavior, a constant reward is assigned to each sample obtained from operators’ cast records. If for a sample, the sum of both time and chatter penalties (negative rewards) is less than the constant, and the ret reward of this sample is still positive. Furthermore, to emphasize that there is a specific type of behavior that is desirable for the RL Agent to learn from, an extra constant may be assigned reward to samples in a cast sequence from the preferred cluster of force behavior, and the net reward of each of these samples will be positive. Associated with a modified training algorithm below, these positive net rewards motivate the RL Agent to fol low the operator’s behavior under certain situations.

[0073] In a typical DQN training process, the RL Agent executes additional trials based on the updated value function and collects more data from new trials. However, the expense of operating an actual twin roll strip steel casting machine, including materials considered and produced renders training the RL Agent to execute trials on an actual casting machine infeasible. In this case, all available samples from operator controlled casting campaigns are collected from the cast to train the value function Q in. each training step. Training may be continued on an actual operating casting machine.

[0074] In some embodiments, the DQN is initialized and trained using a MATLAB deep learning toolbox. However, other reinforcement learning networks and tools may be used. Specifically, as shown in Algorithm 1, the tram() function is employed, and states of all samples as network inputs and their corresponding action values q^ are used as labels to train the parameter set (φ of the value function.

Algorithm 1 Pseudocode of deep Qmctwork learning process (modified version )

[0029] Another modification in the training process is the update of the action values qt., qt is a 1- by-N 'vector, and each entry of it represents the action value of one action option. As shown in the folio wing equation 3: where dndhot is the one-hot encoding of the action (a 1 -by-A vector with the entry of the selected action being one and the rest being zeros ), d is a binary indicator to indicate if the current state is the terminal of a trajectory', ones is a 1 -by-A f 'vector with all entries being ones, and S kt is the state one time step after the current state This equation updates the action value of the selected action as the sum of the immediate reward and a discounted maximum value of the state at the next time step. However, for those actions not being selected, instead of approximating their action values by using the Value function from the previous iteration, their action values are set as xero plus the discounted maximum value of the next state. This update works more like a labeling process of a classification problem. If the immediate reward is positive, the trained Agent is more likely to act as the operator does, and increasing the immediate reward raises the likelihood of emulating the operator’s behavior. Conversely, if the immediate reward is negative, the action selected by the operator is less likely to be selected than the other N-l actions not being selected. In addition, the likelihood of selecting each of the A-1 actions increases equally .

[0076] By combining the DQN with a greedy policy and selecting the most valuable action under each given state, the trained RL Agent can adjust the force setpoint. The RL Agent is asked to provide force setpoint adjustments based on available cast sequence data and record the force setpoint trajectory for each cast sequence in the validation set. A more specific testing process is shown in Algorithm 2.

|tM>77J Algorithm 2 is used to calculate and collect each RL Agent’ s force decision-making trajectories under different chater scenarios. Figure 7 contains the RL Agent’s force setpoint value trajectory and the associated chatter trajectory under which these force adjustments are made for 7L& ~ 500, CU ~ 0.5, and with a preference for operator behavior described by Cluster 3. The R.L Agent begins to reduce the force setpoint as the chatter exceeds the specified threshold and/or the chatter has an increasing trend; similarly, the RL Agent halts further reduction of the force setpoint as the chater decreases below the threshold and/or the chatter shows a decreasing trend, As expected, these results are consistent with the design of the reward function,

[0078] To demonstrate the sensitivity of the trained RL Agent to the operator data used for training, two different preferred clusters are created. The first contains only cast sequences from the most aggressive cluster (Cluster 3 from the k-means clustering results) while the second contains cast sequences from both the most aggressive cluster (Cluster 3) and the moderate cluster (Cluster 2). Both RL Agents are trained with the same dataset but different preferred cluster setings, Cast sequences belonging to Cluster 3 are considered as preference in both training settings because these data include system operation across the foil range of possible force state values, whereas data belonging to Clusters 1 and 2 did not.

|0079j Figures 8 and 9 give examples of RL Agent reactions under different chater scenarios, RL Agent A, the one trained with the reward function preferring the most aggressive operator behavior, chooses to decrease the force setpoint more rapidly than RL Agent B, which was trained with the reward function preferring both moderate and aggressive operator behavior. These results are consistent with the design of the reward function and demonstrate how the choice of operatorbehavior used for training influences each RL Agent.

|0080j To demonstrate the sensitivity of the reward function to changes in the performance specifications, other parameters in the reward function may be fixed but vary the maximum acceptable chatter value, €«* and train two RL Agents. Table II shows details of the reward function settings of two RL Agen ts.

[0081] Figures 10 and 11 provide examples of RL Agent reactions under different chatter scenarios. RL Agent C, trained with a lower maximum acceptable chatter value, displays a more aggressive force adjustment behavior than RL Agent D, the one trained with a higher maximum acceptable chatter value. This is again consistent with the design of the reward function and demonstrates how the performance specifications affect each RL Agent’s behavior even when the same data is used to train each RL Agent

[0082] Ultimately, the purpose of training an RL Agent to automatically adjust the force setpoint is to improve the performance and consistency of the twin-roll strip casting process (or other process as may be applicable). To validate the trained. RL Agent before implementing the RL Agent on an operating twin-roll caster, the trained RL Agent’s behavior is directly compared to that of different human operators. Because the RL Agent is not implemented on an online casting machine for validation purposes, the comparison is between the past actions of the operator (tn which their decisions impacted the force state and in turn, the chater) to what the RL Agent would do given those particular force and chatter measurements. Nonetheless, this provides some basis for assessing the differences between human operator and .machine RL Agent.

[0083] In one example, RL Agent C is compared with a human operator behavior in two different casts. In Figure 12. the operator does not reduce the force setpoint even though the chatter shows a strong increasing trend. In Figure 13, the operator starts to reduce the force before the chatter begins to increase. Engineers with expertise in twin-roll strip casting evaluated these comparisons and deemed the RL Agent’s behavior to be preferable over that of the hitman operator. However, it is important to note that in each case, the human operator may be considering other factors, beyond chatter, affecting the quality of the strip that may explain their decision- making during these casts.

[0084] In some embodiments, additional casting machine responses are added to the reward function. For example, in some embodiments, strip profile is measured by gauge 44 and provided to the RL Agent 92. Gauge 44 may be located between the casting rollers and the hot rolling mill id. Strip profile parameters may include edge bulge, edge ridge, maximum peak versus 100mm, and high edge flag. Each of these may be assi gned an upper boundary. A s with the chater reward function, reward functions for profile parameters are designed to assign negative reward as the measured parameters approach their respective upper bound. These reward functions may be scaled, for example, to assign equal weight to each parameter, and then summed. The sum may be sealed to ensure the chater reward term is dominant, at least during start up An example of such a reward function is shown in equation 4: where C 'is chatter, bg is edge bulge, rg is edge ridge, mp is max peak versus 100mm, and fg is high edge flag. This results in the reward function having a chatter score and a profile score. Additional profile parameters that may be measured and included in a reward function include overall thickness profile, profile crown, and repetitive periodic disturbances related to the rotational frequency of the casting rolls.

[0085] In another embodiment, each of the embodiments described above can be extended to operating the casting machine in a steady state condition, after the start-up time as passed. In some embodiments, the reward function is modified, for example, to eliminate the start-up time term. For example, in the embodiment having both chater and profile, terms provided above, the reward function maybe modified as shown in equation 5:

The relative weights of the chatter and profile reward functions may also be adjusted.

[0086] In other embodiments, a different reward function is developed for steady state operation and a different RL Agent is trained for steady state operations. In other embodiments, a modelbased A. I. agent is developed and trained for steady state operation. In some embodiments, one or more model based controllers are operated concurrently with a trained model-free RL Agent. For example, an Iterative Learning Controller may control wedge to reduce periodic disturbances as in WO 2019/060717A I , which is incorporated by reference, and any of the RL Agents described herein may effectuate the actions to reduce chatter and/or profile defects.

[0087| In the Deep Q Network RL Agent above, it is shown that the trained RL Agent can independently adjust one setpoint based on a single objective signal. However, it may be desirable to extend the RL Agent to multiple objective signals and a reward function containing multiple time-vailing objectives, to determine and apply an offset can be unpractical. In addition, since the training process only uses a finite dataset from human records, an imbalanced dataset can also impact the agent's behavior negatively.

[0088] Accordingly, in another embodiment of a RL Agent, a modified, actor-critic algorithm is provided to a control problem in which multiple control objectives are defined. Similar to the modified DQN algorithm above, the modified actor-critic algorithm trains the RL Agent with only the human records. The traiiied agent is also expected to take the most rewardable action done by some operators under a similar situation, However, instead of applying an offset to the reward function, an actor-critic algorithm is employed which trains the policy function as a multiple-class classification problem, so that cost-sensitive methods can he applied to update the policy function based on. both the reward and the action distribution in the dataset. In addition, this method is applied to learn a setpoint control strategy in: a twin-roll easting process and show that the trained agent nan independently make reasonable and consistent setpoint adjustments under the given scenario.

[0089] The nomenclature provided in 'fable III below is followed for the discussion of the Actor- Critic algorithms.

Table III. Nomenclature

[0090| The RL Agent using an actor-critic algorithm includes two main functions, a value function and a policy function. Tile value (critic) function a state to its value, which is defined as the expected long term reward starting from the given state; that is

The policy (actor) function « maps a stale-action pair to a probability value between 0 and 1 , which represents, under this policy, how likely the action A is to be taken at the given state 5. The RL Agent interacts with the real or simulated environment according to the policy function tt and collects the current state T. the action J, the state at the next time step 5+1, and the immediate reward R to update both the value and the policy function. Immediate reward .R may be calculated as shown in equations 2, 4, or 5 above, the piecewise defined reward function of equation 9 below, or other suitable reward function. Considering a finite training dataset, the value function can be evaluated as shown in Algorithm 3.

If any new observations are collected, one can always include them into the dataset D and increase training iterations. However, in this example a finite training set is used, and the converged value function F will be fixed and used for training the policy function. The training process of the policy function involves updating the likelihood of choosing a certain action under the given state according to an advantage value a. As shown in the advantage function in Equation 6, if the sum of the immediate reward R and the discounted value of the subsequent state yF(5(i>rv) is greater than the value of the current state IfrSh), then the advantage value a, is positive and the action An Is considered a valuable one, and its likelihood given should be increased based on how much the advantage is. However, if the advantage value rfois negative, the updated policy function is less likely to choose when encountering When free exploration in a real or simulated environment is not accessible, a negative advantage value may increase the likelihood of the policy selecting an action not represented in the dataset. In other words, the consequence of that action in terms of the resultant state is unknown. To mitigate this issue, e “is used to determine how much to increase the likelihood of Since is always positive, a less valuable action observed in the dataset will still have a higher chance of being selected compared to those actions that have never been taken given a certain state.

[0092] hi addition, the finite training dataset might have an uneven distribution in terms of the actions taken by the human operators. To effectively learn from an imbalanced dataset, researchers have developed methods such as re-sampling, random forest, and cost-sensitive methods. Resampling is not a challenge when free exploration is available since the agent can interact with the environment and up-sample those action s which are less common. However, w hen free exploration is not possible, the cost-sensitive method is an effective methodology to implement in the policy function update scenario. One may define to be the likelihood of action zfo appearing within the training dataset £>. The loss function depends on both the and the As shown in liquation 7, if an action is frequently taken in. the training dataset and has little or negative advantage value, its weight will be low in the loss function. The training process of the policy function Is shown in Algorithm 4. 12: end for

[0093] During the start-up process, the casting roll separation force setpoint (to be referred to as the “force setpoint”) is the most frequently adjusted setpoint. Operators adjust the force setpoint to respond to different profile issues as set forth above. The strip chatter (O, a no.n-negativc value indicating the thickness variation along the cast length direction, is a major iactor of adj usting the force se tpoint In addition, operators might adjust the force setpoint to respond to another category of profileimperfection, edge spikes. Unlike chatter, which describes profile imperfections along the cast length direction, edge spikes are profile imperfections that lie along the strip cross section. Four parameters are used to characterize different edge spike problems:

(1) Edge bulge (bg)' among Omni to 25mm edge region from the outer end, the thickness range from the peak to the closest minima in the direction away from the outer end. It is a non- negative value.

Ch Edge ridge (eg): among 25mm to 50mm edge region from the outer end, the thickness range from the peak to the closest minima in the direction away from the outer end. It is a nonnegative value.

(5) Maximum peak (mp): maximum thickness between the edge bulge and edge ridge locations with respect to the inner end of the edge region. It is a real value.

(4) High edge flag ((g): a binary value indicating whether either edge region is thicker than the cross section center thickness. Figure 15 shows a scenario where edge region is thicker titan the center region.

See Figures 14 and 15 for ill ustrations of edge bulge, edge ridge, and maximum peak.

[0094| Generally, increasing the force setpoint increases the force applied on the strip surface and reduces the amount of the semi-solid material (also known as “mushy” material ) between the solidified shells, which mitigates some edge spike problems. However, the mushy material functions as a damper, which reduces the strip vibrati on. Iherefore, the reduction of the mushy materi al results in less damping and more vibration in the strip which in turnworsens the chatter problem. Therefore, there is a trade-off between mitigating chatter versus mitigating edge spike problems.

[0095] Given that modeling the system dynamics during the start-up process can be difficult, the reinforcement learning agen t considered here is designed to learn by only observing the record of human operation and then suggest the optimal setpoint adjustment (value and timing) to the human operator. Tire state at time step ft is competed of where is the di fference between values of the current and previous time steps. Cast data is recorded ai I Hz and smoothed with a 10-second movmg-averagc filter. In addition, based on observation that human operators do not adjust the force setpoint more frequently than 0,2 Hz, the data may be further downsampled to 0.2 Hz to adapt to the force setpoint adjustment frequency used by human operators.

[0O$6| ft has also been observed that operators typically adjust the force setpoint by one of eight fixed values. Therefore, at a time step A; the agent is admissible to adjust the force setpoint by one of these eight values , Among these actions, three represent decreasing the force setpoint, four represent increasing the force setpoint, and one is defined as keeping the force setpoint unchanged. A challenging aspect of the specific problem under consideration is that when human operators keep the force setpoint constant it is not known whether that action was taken deliberately , or if it represents more passive behavior that resulted from an operator befog distracted by other operation tasks. How to address this ambiguity' is described in more detail below.

[0097] The reward, function explicitly incentivizes desired performance metrics. Edge spike and chatter are major problems that can be addressed with force setpoint adjustments during a start-up process. The chatter problem is characterized by the chatter parameter value, and the edge spike is characterized by the edge bulge, edge ridge, and maximum peak parameters. The high edge flag parameter is not used to characterize the edge spike problem because it is a binary value and is not. comparable to the other three parameters related to edge spite. However, the high edge flag information is embedded in the state vector to provide the agent with extra information to make a decision. It is desirable to have low values of chatter, edge bulge, edge ridge, maximum peak, and a decreasing trend of these parameters. However, once the value of a parameter decreases below a user-defined threshold, continuing to decrease its value is not necessary. Based on these observations, an edge spike parameter is defined, as and construct a piecewise defined reward fonction for foe performance objectives as: where is the weight used to scale A(.) in the range [~1 , 1], WQ is the weight used to scale (.) in the range [~2, 2 |, and and are user-defined thresholds for foe chatter and edge spike parameters.

[0098| To categorize different force setpoint adjustment behaviors, a k-means clustering algorithm is employed to cluster 95 individual cast sequences in the training dataset. The start-up process of each sequence is operated by one of the six human operators, All of the cast sequences represent the same steel grade and strip width and are collected, from the same cast machine to prevent any behavior variation caused by differences in the cast conditions;

[0099| The force setpoint adjustment behavior is characterized by a force setpoint trajectory after the manual mode of the force setpoint begins. Since there are 6 operators in the data set of this example, the clustering is evaluated as the results of k The average silhouette width indicates that both have an average silhouette width higher than 0,5. According to Figure 17, there is no major difference between k ~ 2 and A - 3. Therefore, for simplicity, foe clustering results of are used. Figure 17a also shows an uneven distribution, in the clustering. Combined with the force trajectory examples shewn in Figure ISA (Cluster 1) and Figure 18B (Cluster 2), over 70% sequences have Cluster 1 force behavior, which is less aggressive in both force adjustment range and frequency . In addition, over 90% of the samples in the training dataset have the zero-force-chatige action.

[0100| Both foe value function and the policy function are represented as neural networks. The selection of the ne ura l network architectures is heuristic and shown in Table IV. Inoue example, the value function has 701 learnable parameters, and the policy function has 848 learnable parameters. The total number of samples used to train these two neural networks is 4594,

Table I V. Neural network architectures of the value function and the policy function

[01011 In the testing process, Ninety-fi ve cast sequences are used for training the reinforcement learning agent, and another 8 cast sequences with the same metal grade and width condition are used for testing. Except for the force setpoint values chosen by the human operator E AE,. the other defined states are provided to the agent at each time step. At the initial time step, the agent observes the initial force setpoint value and is required to adjust it based on the state information; the decision made "by the agent affects the subsequent step’s force setpoint value. Hie goal of this test is to verify whether the trained agent reacts to the twin-roll casting process in a manner that is intuitive given a presence of a particular imperfection in the steel strip. Figures 19 and 20 show two pairs (Case 1 and Case 2) of testing sequence comparisons. The action force (blue curve) represents the human operator’s actual force trajectory, and the force prediction (black ‘W” curve) represents the agent’s force trajectory.

These comparisons demonstrate two important points. 'Hie first point is demonstrated in Figures I9A (Case 1) and I9B (Case 2). Case 1 exhibits higher edge spike values compared to Case 2. Because the process is behaving differently between two casts, the RL Agent makes different setpoint decisions; this is desited and expected. In contrast, the underlying human operator trajectories were similar despite the differences in how the process was: behaving, The second point. is demonstrated in Figures 20A (Case 3) and 20B (Case 4). When the objective related parameters are similar between two casts, the agent likewise makes consistent decisions in the two easts. This is in contrast to what the human operator did in the actual casts, which was to make different force setpoint value decisions despite the process behaving similarly . Although these results do not represent elosed-loop interaction between the agent and the twin.- roll casting process, they provide valuable insight into how the agent would be behave under different casting scenarios.

[0103] to one aspect of the present invention, actor-critic algorithm is modified to better accommodate learning from multiple human experts given the following constraints on the class of settings under consideration:

1.) During the algori thm training phase, the only available data are generated by human experts.

2) Multiple experts’ data are mixed in a dataset. All experts can stabilize the cfosed-loop supeniscay control system.

3) When experts’ performance based on a given criteria is assessed, the performances may not be equally preferred.

[0104] Given that foe reinforcement teaming agent is trained from human data only (and withoist a process model), the following to hold true:

1) If human experts’ behaviors are very consistent, such that the state-aetioil mapping is 1-to-l, the reinforcement learning agent should learn this mapping, exactly.

2) I f there exists inconsistency, such that multiple actions are observed being taken under a certain State, the agent should learn to pick the most preferable one.

[0105] The exploration nature of the reinforcement learning algorithm is temporarily prohibited by replacing the advantage by the natural exponential of because a negative a, results in the action taken by the policy function to depart from the action which is taken by a human expert The function exp has the same monotonicity as uj. Therefore, 11 a sample has a high positive advantage, the corresponding exp is also high, so the sample is considered as preferable. On the contrary, if a sample has a low positive or negative advantage, its corresponding exp becomes low, and the sample is considered less preferable.

[0106] In addition, a deterministic policy function is desired, but due to the concern of inconsistencies in the training dataset, which is generated by multiple experts, a stochastic policy function is employed to eharaeterize a conditional distribution of action. This policy function plays the role of a sensitivity weight to deal with the imbalanced training dataset. The modified loss function is shown in equation 10.

[0107] Recurrence from the previous step is embedded to improve the actor training process. Samples are reconstructed in the training dataset D. such that every sample , where is the action taken in the previous step, as is the advantage. Because this data, reconstruction is mainly for the actor training, it is considered that a fixed i? has been deteniiined, and the corresponding advantages have been calculated, [0100] The policy function is also redesigned with a dependency of the previous step’s action, such that [0109] where as is the action: taken by the policy function a under the given condition. This is enough if only a teacher forcing technique is considered. However, it is also expected of the agent, to perform more robustly, which means that the agent should also be able to tolerate mistakes that it made in previous steps. Therefore, the augmented data is constructed as following. Provided sample di is not the last step of a trajectory, its corresponding augmented Sample is

[0110] In each iteration, the training process first determines a; based on equation Hand forms debased on equation 12. Then, it determines and updates the parameter set T, which satisfies equation. 13. The policy function training process with the usage of an augmented dataset is illustrated in Algorithm 5. [0111] In this embodiment, the focus is on two setpoints: roll separation force and entry gauge thickness. As shown in Figure 21, the roll separation force setpoint directly affects the force applied to the rollers and therefore to the steel strip. The entry gauge thickness setpoint affects the casting speed; the smaller the setpoint, the faster the rollers. Hereinafter, these setpoints are referred to as the “force” and ‘ihickness” setpoints.

[0112] Surface quality and thickness profile uniformity are two of the major concerns in steel strip manufacturing. This includes chatter, a surface imperfection, and edge spikes, a. thickness profile noil-uniformity. Chatter, as shown in Figure IB, is the thickness variation along the east length direction. Based on the vibration frequency, chatter is separated into high and medium frequency chatter.

[0113] Edge spikes characterize thickness imperfections along the cross-section of the strip, as shown in Figure 15. Four quantities are used to characterize edge spike problems. They are;

1) Edge bulge (bg): among 0 to 25mm edge region from the outer end, the thickness ranges from the peak to the closest minima in the direction away from the outer end. It Is a lion-negative value.

2) Edge ridge (eg): among 25mm to 50mm edge region from the outer end, the thickness ranges from the peak to the closest minima in the direction away from the outer end. It is a non-negative value.

3) Maximum peak (mp): maximum thickness between the edge bulge and edge ridge locations with respect to the- inner end Of the edge region. It is a real value.

4) High edge flag (fg): a binary value indicating whether either edge region is thicker than the cross-section center thickness.

[0114] In some embodiments the state, action, and reward .function, are constructed as follows.

[0115] State: With a fixed number of state elements, we prefer to encode more information about the dynamics. Therefore, th© state vector is defined as where and ; are the high and medium frequency chatters of the sample is the allowed minimum thickness value, is the time with respect to the time that a human operator can begin adjusting setpoints, and for any element is the difference between two consecutive steps. The time and the allowed minimum thickness are also included in the state vector, because a desired strip thickness is a part of the final product requirement. Any decision causing the thickness setpoint to be less than the allowed minimum thickness should result in a penalty. As the time increases, the penalty also increases.

}<>116] Action; The action is simply defined as the force (F) and thickness (Th) setpoint values at foe next time step:

1'0.1.17'] Reward: The reward function is a function of ail control objectives in the state vector, including every' element except t and Th, which are considered separately below, Furthermore, the reward is a. varying weighted sum of all control objectives, such that

|0118| where W(s ( ) is the piece-wise linear weighting function for the state vector. When a control objective xi in the state vector is lower than its threshold, the weights corresponding to both the objective x, and its change A(x)i decrease. The weighting function is always non-negative. and so the negative sign in front of it makes lower values and decreasing trends of control objectives result in higher rewards.

(0.11.9] The time-depeiident thickness penalty is directly encoded in a loss function as

(0120J Note that where is the thickness setpoint adjustment decided by the policy function . The thickness penalty loss Jib is then used to determine the parameter set simply by replacing J in equation 13 by J defined in equation 19.

As -discussed above, the training process relies only on data generated by human, experts, because there is not yet an available simulator due to the system complexity. However, we still want to assess and compare the trained agents prior to actual implementation. Accordingly, a method to evaluate an agent performance without a simulator is provided. Then an agent trained with the recurrent augmented dataset may be compared to an agent without the augmented, dataset

[012] Similar to the sequence-io-sequence RNN, the policy function is asked to generate a setpoint trajectory based on a state trajectory . Since the policy function has its recurrence from the previous output, as sho wn in equation 11, the action taken by the agent at time step k should be and when , the initial action is given.

[0122[ Suppose, the K-slep state trajectory resalts from a setpoint trajectory generated by a human expert. As mentioned earlier, if all human experts share the same consistent control policy, then the agent is supposed to perfectly leant the policy, and the setpoint trajectories generated by the human expert or the agent should also be similar. However, if there exists policy inconsistency, which may cause an imperfect imitation of the expert’s control policy, then the agent should prioritize learning from samples with higher advantages. Therefore, for each time step k, the advantage can be calculated based on equation 22, in which. is a binary indicator to show whether the step is the end step of a sequence. A validation loss is defined as [0123] T wo agents using eight unseen testing sequences are compared herein.. Trajectory plots of one sequence are shown in detail, and the loss statistics of all eight sequences are shown and. discussed. Figure 22 shows the force trajectory of the agent without the augmented dataset. The presented cast sequence has increasing edge spikes from the start of the casting sequence. Correspondingly, the human expert increases the force setpoint After about I 00 second, the edge spike values start to decrease. The agent-determined force trajectory starts to deviate from the actual force trajectory chosen by the human expert at about 50 second, and the difference between the two trajectories increases as time increases. Correspondingly, in Figure 23. the loss of the force tracking increases over the sequence. The agent determined thickness follows the human-selected thickness well, ^o the loss of the thickness tracking remains low.

[0124] Figure 24 shows the force trajectory of th e agent with the augmented dataset. Although the agent with augmented dateset also keeps the force setpoint unchanged at the beginning of the cast, at about 75 second, as edge spikes go over .1 and continue increasing, the agent starts to increase the force setpoint. When the loss corresponding to this agent in Figure 25 is seen, the loss of the force tracking still increases as nine increases. although the differencee be ween the agent determined force and the actual force does not increase. That is because the foss is a weighted difference between the agent-determined force setpoint and the true force, according to equation 21. In this sequence, the advantage increases as k increases. Therefore, although the tracking error remains unchanged, the loss increases, Table IV shows the loss statistics of all testing sequences. By training with augmented data, both losses corresponding to the force and the thickness tracking are improved in most testing sequences.

TABLE IV

LOSS STATISTICS OF TESTING SEQUENCES

|0125| In this embodiment, recurrent features are embedded to improve the performance of a reinforcement learning controller for a complex supervisory control scenario. As in other embodiments, the problem seting considers no available system model, and reinforcement learning algorithm is supposed to evaluate, select, and leant from data of m ultiple human experts. Augmented datasets are constructed iteratively to perturb the output recurrence to enhance the robustness of the action learning process in later steps in sequences. In the context of a supervisory control problem with a twin-roll casting example, an agent trained with recurrent augmented datasets performs beter in advantageous action tracking over testing sequences compared to an. agent trained without using any recurrent augmented dataset.

[01261 Additional actions .may also be assigned to the RL Agent For example, the RL Agent may be trained to reduce periodic disturbances by controlling wedge control for the casting rollers. Some embodiments include localized temperature control, of the casting rollers to control casting roller shape and thereby cast strip profile. See, for example, WO 2019/217700, which is incorporated by reference. In some embodiments, the strip profile measurements are used in a reward function so the RL Agent can control the localised heating and/or cooling of the casting rolls to control strip profile.

[0127] Actions may also be extended to other portions of the twin roll caster process equipment, including control of the hot rolling mill 16 and water jets 18. For example, various controls have been developed for shaping the work rolls of the hot rolling mill to reduce flatness defects. For example, work roil bending jacks have been provided to affect symmetrical changes in the roll gap profile central region of the work rolls relative to regions adjacent the edges. The roll bending is capable of correcting symmetrical shape detects that are common to the central region and both edges of the strip. Also, fores cylinders can affect asymmetrical changes in the roll gap profile on one side relative to the other side. The roll force cylinders are capable of skewing or tilting the roll gap profile to correct for shape defects in the strip that occur asymmetrically at. either side of the strip, with one side being tighter and the other side being looser than average tension stress across the strip. In some embodiments, a RL Agent is trained to provide actions to each of these controls in response to measurements of the cast strip before and/or after hot rolling the strip to reduce thickness.

Another method of controlling a shape of a work roll (and thus the elongation of cast strip passing between the work rolls) is by localized, segmented cooling of the work rolls. See, for example, U.S. Pat. No. 7, 181 ,822, which is incorporated by reference. By controlling the localized cooling of the work surface of the work roll, both the upper and lower work roil profiles can be controlled by thermal expansion or contraction of the work rolls to reduce shape defects and localized buckling. Specifically, the control of localized cooling can be accomplished by increasing the relative volume or velocity of coolant sprayed through nozzles onto the work roll surfaces in the zone or zones of an observed stri p shape buckle area, causing the work roll diameter of either or both of the work rolls in that area to contract, increasing the roll gap profile, and effectively reducing elongation in that zone. Conversely, by decreasing the relative volume or velocity of the coolant sprayed by the nozzles onto the work surfaces of the work rolls causes the work roll diameter in that area to expand, decreasing the roll gap profile, and effectively increasing elongation. Alternatively or in combination, the control of localized cooling can be accomplished by internally controlling cooling the work surface of the work roll in zones across the work roll by localized control of temperature or volume water circulated through the work rolls adjacent the work surfaces, In some embodiments, a RL Agent is trained to provide actions to provide localized, segmented cooling of the work rolls in response to casting mill metrics, such as flatness defects. j0129j In some embodiments, the RL. Agent in any of the above embodiments receives reinforcement learning not only from casting campaigns controlled manually by operators, but also from the RL Agent's own operation of a physical casting machine. That is, in operation, the RL Agent continues to learn through reinforcement learning including real-time casting machine metrics in response to the RL .Agent's control actions, thereby improving the RL Agent's and the casting machine's performance.

[01301 hi some embodiments, intelligent alarms are included to alert operators to internal® if necessary. For example, the RL Agent may direct a step change but receive an unexpected response. This may occur, for example, if a sensor fails or an actuator fails,

[0131 J The functional features that enable an RL. agent to effectively drive all process set points and also enable process and machine condition monitoring constitutes an autonomously driven twin roll casting machine where an operator is required to intervene only in the instances where there is a machine component breakdown or a process emergency (such as failure of a key refractory element) .

[0132] It is appreciated that any method described herein utilizing any reinforced learning agent as described or contemplated, along with any associated algorithm, may be performed using one or more controllers with foe reinforced learning agent is stored as hrstruc-tions on any memory storage device. The instructions are configured to be performed (executed) using one or moreprocessors in combination with a twin roll casting machine to control the formation of thin metal strip by twin roll casting. Any such controller, as well as any processor and memory storage device, may be arranged in operable communication with any component of the twin roll easting machine as may be desired, which includes being arranged in operable communication with any sensor and actuator. A sensor as used herein, may generate a signal that may be stored in a memory storage device and used by the processor to Control certain operations of the twin roll casting machine as described herein. An actuator as used herein may receive a signal from the controller, processor, or memory storage device to adjust, or alter any portion of the twin roll casting machine- as described herein.

{0133] To the extent used, the terms “comprising,” “including,” and “having,” or any variation thereof, as used in the claims and/or specification herein, shall be considered as indicating an open group that may include other elements not specified. The terms “a,” “an,” and the singular forms of words shall be taken to include the plural form of the- same words, such that the terms mean that one or more of something is provided. The terms “at least one” and “one or more” are used interchangeably. The term “single” shall be used to indicate that one and only one of something is intended. Similarly, other specific integer values, such as “two,” are used when a specific number of things is intended. The terras “preferably ,” “preferred,” “prefer,” “optionally,” “may,” and similar terms are used to indicate that an item, condition or step being referred to is an optional (i.e., not required) feature of the embodiments. Ranges that are described as being “between a. and b” are inclusive of the values for “a” and “b” unless otherwise specified.

10134] While various improvements have been described herein with reference to particular embodiments thereof, it shall be understood that such description is by way of illustration only and should not be construed as limiting the scope of any claimed invention, f'urthemiore, it is understood that the features of any specific embodiment discussed herein may be combined with one or more features of any one or more embodiments otherwise discussed or contemplated herein unless otherwise stated.