SAMPLE-EFFICIENT REINFORCEMENT LEARNING

Title:

SAMPLE-EFFICIENT REINFORCEMENT LEARNING

Document Type and Number:

WIPO Patent Application WO/2023/171102

Kind Code:

A1

Abstract:

There is a continued need in the art for a sample-efficient method of controlling a complex process control task such as a chemical plant. According to some embodiments, a method for accelerating reinforcement learning training based on a form of assisted policy search is provided. Training the reinforcement learning controller for process control utilizes a method to provide guidance regarding regions of the state space to explore when making decisions. The method comprises collecting state trajectories based on existing controllers, which guidance examples are useful to speed-up the training process, a method to extract and relabel "controller examples" generated by suboptimal existing controllers, a replay memory to store controller examples, an episodic memory to store rollouts with large cumulative rewards, a mechanism to update the episodic memory, a method to sample with priority examples from the replay memory and the episodic memory, training an ensemble of neural networks to distinguish controller examples and collected rollouts from agent examples, determining an exploration bonus based on aggregated predictions from the ensemble of neural networks, and setting a control objective for the system which maximizes an expected total reward. Advantageously, the method improves sample efficiency in a number of respects. In particular, any existing controllers such as model predictive control or proportional- integral-derivative can be used to guide the agent's learning towards states/regions that are potentially taking the agent in the appropriate direction for solving the task being faced. In particular, such a guidance encourages the agent to learn how to correct disturbances and/or maintain the system in a steady state. As a result, the training data can be explored more efficiently because the invention leverages this guidance to limit the search space within a promising area, thus improving sample efficiency during training. The term "controller examples" as used above refers to a set of training

Inventors:

BOUGIE NICOLAS BRUNO ALEXANDRE (JP)

Application Number:

PCT/JP2022/048600

Publication Date:

September 14, 2023

Filing Date:

December 28, 2022

Export Citation:

Click for automatic bibliography generation Help

Assignee:

AIST (JP)

International Classes:

G06N3/0895; G05B13/02; G05B23/02; G05D13/60; G06F18/214; G06N3/045; G06N3/0475; G06N3/082; G06N3/092; G06N3/094; G06N20/20

Other References:

XU PEI PEIX@CLEMSON.EDU; KARAMOUZAS IOANNIS IOANNIS@CLEMSON.EDU: "A GAN-Like Approach for Physics-Based Imitation Learning and Interactive Control", PROCEEDINGS OF THE ACM ON COMPUTER GRAPHICS AND INTERACTIVE TECHNIQUES, ACMPUB27, NEW YORK, NY, USA, vol. 4, no. 3, 27 September 2021 (2021-09-27), New York, NY, USA , pages 1 - 22, XP058627772, DOI: 10.1145/3480148
DAOCHEN ZHA; KWEI-HERNG LAI; KAIXIONG ZHOU; XIA HU: "Experience Replay Optimization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 20 June 2019 (2019-06-20), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081378294
TSURUMINE YOSHIHISA; CUI YUNDUAN; YAMAZAKI KIMITOSHI; MATSUBARA TAKAMITSU: "Generative Adversarial Imitation Learning with Deep P-Network for Robotic Cloth Manipulation", 2019 IEEE-RAS 19TH INTERNATIONAL CONFERENCE ON HUMANOID ROBOTS (HUMANOIDS), IEEE, 15 October 2019 (2019-10-15), pages 274 - 280, XP033740706, DOI: 10.1109/Humanoids43949.2019.9034991