Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
NEURAL ARCHITECTURE SEARCH WITH IMITATION LEARNING
Document Type and Number:
WIPO Patent Application WO/2021/226709
Kind Code:
A1
Abstract:
Automated machine learning (AML) refers to a class of techniques that, given a problem, can find an optimal set of model architectures, properties, and parameters. An algorithm is provide for guided neural architecture search with imitation learning, termed as Guided-NASIL, that takes advantage of imitation learning and trains the agent learning from a set of expert designed structures instead of starting from scratch.

Inventors:
FARD FARZANEH SHEIKHNEZHAD (CA)
TOMAR VIKRANT SINGH (CA)
Application Number:
PCT/CA2021/050647
Publication Date:
November 18, 2021
Filing Date:
May 10, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
FLUENT AI INC (CA)
International Classes:
G06N3/08; G06N3/04
Other References:
GUO MINGHAO, ZHONG ZHAO, WU WEI, LIN DAHUA, YAN JUNJIE: "IRLAS: Inverse Reinforcement Learning for Architecture Search", 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 1 June 2019 (2019-06-01) - 20 June 2019 (2019-06-20), pages 9021 - 9029, XP055872137, ISBN: 978-1-7281-3293-8, DOI: 10.1109/CVPR.2019.00923
HAN CAI; TIANYAO CHEN; WEINAN ZHANG; YONG YU; JUN WANG: "Efficient Architecture Search by Network Transformation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 16 July 2017 (2017-07-16), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081282155
Attorney, Agent or Firm:
SPRIGINGS, Mark et al. (CA)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1 . A method of automatically designing a set of neural architectures for a given task comprising: training a learning agent from a set of expert designed neural architectures using imitation learning; outputting a learning option using the trained learning agent, the learning agent using a current state of a neural architecture to output the learning option; electing an option from a plurality of options including the output learning option to create a next state of a neural architecture based on the current state of the neural architecture; adding the option to the current state of the neural architecture to create the next state of the neural architecture; and evaluating the next state of the neural architecture.

2. The method of claim 1 , wherein a network dictionary stores information of trained neural architectures that have been evaluated.

3. The method of claim 1 , wherein the option comprises a neural architecture layer and acceptable neural architectures are built by stacking up suitable layers that provides a valid structure for any given task.

4. The method of claim 3, wherein discovered neural architectures meet a computational footprint restriction.

5. The method of claim 1 , wherein the plurality of options comprise a set of probing options including learning and planning options that can be added the current state of the neural architecture as a selected layer or cell.

6. The method of claim 5, wherein the planning options are found based on past experiences and visited neural architectures.

7. The method of claim 5, wherein the planning options comprise one or more of: a random layer; a layer that has obtained highest accuracy at certain depth; a layer that has obtained highest relative improvement at the certain depth; and a layer that has obtained highest relative improvement overall.

8. The method of claim 1 , wherein the learning option is predicted by a trained RL-agent based on past experiences and visited neural architectures.

9. The method of claim 8, wherein the agent uses an actor-critic method, such as soft actor-critic or DDPG, for learning and predicting appropriate layers at each stage.

10. The method of claim 8, wherein the agent uses an actor only or critic only learning method.

11. The method of claim 9, wherein Hindsight experience replay (HER) memory is implemented to make it more sample efficient.

12. The method of claim 1 , wherein knowledge of expert can be conveyed to RL- agent via imitation learning.

13. The method of claim 12, wherein expert designed structures are provided to the agent as good examples.

14. The method of claim 12, wherein by demonstrating expert designed structures, the RL-agent learns values of sequences of layers given as expert designed neural architectures.

15. The method of claim 1 , wherein transfer learning is used to accelerate training time of new neural structures via reusing subset weights of already trained structures.

16. A non transitory computer readable medium having embedded thereon instructions, which when executed by a computing device configure the computing device to provide a method of automatically designing a set of neural architectures for a given task, the method comprising: training a learning agent from a set of expert designed neural architectures using imitation learning; outputting a learning option using the trained learning agent, the learning agent using a current state of a neural architecture to output the learning option; electing an option from a plurality of options including the output learning option to create a next state of a neural architecture based on the current state of the neural architecture; adding the option to the current state of the neural architecture to create the next state of the neural architecture; and evaluating the next state of the neural architecture.

17. The non transitory computer readable medium of claim 16, wherein a network dictionary stores information of trained neural architectures that have been evaluated.

18. The non transitory computer readable medium of claim 16, wherein the option comprises a neural architecture layer and acceptable neural architectures are built by stacking up suitable layers that provides a valid structure for any given task.

19. The non transitory computer readable medium of claim 18, wherein discovered neural architectures meet a computational footprint restriction.

20. The non transitory computer readable medium of claim 16, wherein the plurality of options comprise a set of probing options including learning and planning options that can be added the current state of the neural architecture as a selected layer or cell.

21. The non transitory computer readable medium of claim 20, wherein the planning options are found based on past experiences and visited neural architectures.

22. The non transitory computer readable medium of claim 20, wherein the planning options comprise one or more of: a random layer; a layer that has obtained highest accuracy at certain depth; a layer that has obtained highest relative improvement at the certain depth; and a layer that has obtained highest relative improvement overall.

23. The non transitory computer readable medium of claim 16, wherein the learning option is predicted by a trained RL-agent based on past experiences and visited neural architectures.

24. The non transitory computer readable medium of claim 23, wherein the agent uses an actor-critic method, such as soft actor-critic or DDPG, for learning and predicting appropriate layers at each stage.

25. The non transitory computer readable medium of claim 23, wherein the agent uses an actor only or critic only learning method.

26. The non transitory computer readable medium of claim 24, wherein Hindsight experience replay (HER) memory is implemented to make it more sample efficient.

27. The non transitory computer readable medium of claim 16, wherein knowledge of expert can be conveyed to RL-agent via imitation learning.

28. The non transitory computer readable medium of claim 27, wherein expert designed structures are provided to the agent as good examples.

29. The non transitory computer readable medium of claim 27, wherein by demonstrating expert designed structures, the RL-agent learns values of sequences of layers given as expert designed neural architectures.

30. The non transitory computer readable medium of claim 16, wherein transfer learning is used to accelerate training time of new neural structures via reusing subset weights of already trained structures.

31. A computing device comprising: a processor for executing instructions; and a memory for storing instructions, which when executed by the processor configure the computing device to provide a method of automatically designing a set of neural architectures for a given task, the method comprising: training a learning agent from a set of expert designed neural architectures using imitation learning; outputting a learning option using the trained learning agent, the learning agent using a current state of a neural architecture to output the learning option; electing an option from a plurality of options including the output learning option to create a next state of a neural architecture based on the current state of the neural architecture; adding the option to the current state of the neural architecture to create the next state of the neural architecture; and evaluating the next state of the neural architecture.

32. The computing device of claim 31 , wherein a network dictionary stores information of trained neural architectures that have been evaluated.

33. The computing device of claim 31 , wherein the option comprises a neural architecture layer and acceptable neural architectures are built by stacking up suitable layers that provides a valid structure for any given task.

34. The computing device of claim 33, wherein discovered neural architectures meet a computational footprint restriction.

35. The computing device of claim 31 , wherein the plurality of options comprise a set of probing options including learning and planning options that can be added the current state of the neural architecture as a selected layer or cell.

36. The computing device of claim 35, wherein the planning options are found based on past experiences and visited neural architectures.

37. The computing device of claim 35, wherein the planning options comprise one or more of: a random layer; a layer that has obtained highest accuracy at certain depth; a layer that has obtained highest relative improvement at the certain depth; and a layer that has obtained highest relative improvement overall.

38. The computing device of claim 31 , wherein the learning option is predicted by a trained RL-agent based on past experiences and visited neural architectures.

39. The computing device of claim 38, wherein the agent uses an actor-critic method, such as soft actor-critic or DDPG, for learning and predicting appropriate layers at each stage.

40. The computing device of claim 38, wherein the agent uses an actor only or critic only learning method.

41. The computing device of claim 39, wherein Hindsight experience replay (HER) memory is implemented to make it more sample efficient.

42. The computing device of claim 1, wherein knowledge of expert can be conveyed to RL-agent via imitation learning.

43. The computing device of claim 42, wherein expert designed structures are provided to the agent as good examples.

44. The computing device of claim 42, wherein by demonstrating expert designed structures, the RL-agent learns values of sequences of layers given as expert designed neural architectures.

45. The computing device of claim 31 , wherein transfer learning is used to accelerate training time of new neural structures via reusing subset weights of already trained structures.

Description:
NEURAL ARCHITECTURE SEARCH WITH IMITATION LEARNING

RELATED APPLICATION

[0001] The current application claims priority to US Provisional Application 63/022,854 entitled, "NEURAL ARCHITECTURE SEARCH WITH IMITATION LEARNING" filed May 11 , 2020, which is incorporated herein by reference in its entirety for all purposes.

FIELD OF THE INVENTION

[0002] The present disclosure relates to methods and systems to automate neural network architecture designing and hyper-parameter tuning to prepare suitable solutions for a given machine learning problem. These methods are referred to as automated machine learning (AML).

BACKGROUND OF THE INVENTION

[0003] Designing an optimal machine learning solution is a challenging task. It takes a considerable amount of time and effort to find the optimal set of hyper parameters. Moreover, machine learning solutions are often data dependent, and a solution designed for a particular dataset may not be proper for another similar dataset. Demand for using machine learning solutions on small footprint and low power devices has also drawn a lot of attention in recent years. These factors have resulted in an increased interest in methods for automatic search and hyper-parameter optimization, including tuning and compression, for machine learning algorithms. Collectively, these methods are referred to as automated machine learning (AML). An ideal AML method should be able to quickly find a number of practical structures or models under a predefined set of constraints such that a large majority of the discovered models give high accuracy on a given task.; we refer to these as “desirable AML characteristics” in the rest of this work.

[0004] Neural architecture search (NAS) is one of the successful methods that uses an auto-regressive recurrent neural network (RNN) and REINFORCE algorithm to produce a sequence of layers in a network. Improvements to NAS have been suggested by changing the search space, search strategy, or performance estimation strategy. [0005] NASNet changed the search space to design better building blocks similar to MobileNet and Xception as good examples of such architectures with stacking particular building blocks. Searching to find such cells is faster since the search space is smaller; however, not all neural ML solutions follow such block based structure.

[0006] Progressive neural architecture search (PNAS) uses a surrogate function that is trained to predict the performance of a structure without training it, which makes the learning process faster. Although training the surrogate function is another problem that needs to be solved.

[0007] Differentiable architecture search (DARTS) considered a large continuous search space with all different types of operation between the nodes. Using such continuous search space, the architecture can be optimized with respect to its validation set performance by gradient descent. Some other methods also suggested starting with over-parameterized structure and removing unimportant parameters while learning and pruning unimportant paths later. A fast solution has been proposed that designs an over-parameterized network and searches via a single-path search space for the best sub set of weights over the over parameterized kernel that is called “super-kernel”. This solution is fast but it is limited to use only three squared kernel sizes (e.g. 1x1 , 3x3, and 5x5) in a certain structure (e.g. MobileNets).

[0008] Energy-aware pruning, Netadapt, AMC and gate decorator are some of successful solutions that enables machine learning practitioners to compress their neural architectures through kernel pruning. Therefore, some AML solutions, first focus on accuracy and then apply compression and pruning techniques to compress discovered structures up to 50 percent without accuracy loss.

[0009] So far, different search strategies have been used in AML solutions including reinforcement learning (RL). Recent works in RL showed significant improvements in solving a wide range of problems, however, as best understood by the current inventors, these solutions have not been tried on AML. For example, Soft actor-critic (SAC) is an actor-critic structure that achieves significant improvement on a wide range of problems and is the state-of-the-art for various problems currently.

[0010] To accelerate the learning process by an RL agent, providing good examples is helpful. One way of providing useful examples is hindsight experience replay (HER) that can overcome sparseness of the reward function and makes the generalization faster. HER considers all attempts as a successful attempt by assuming the current state of the agent as the target location. This technique provides more valuable information to the agent and makes the generalization faster.

[0011] Another useful technique to provide good examples is using imitation learning or learning by demonstration. Imitation learning can happen by showing good trajectories or a good policy. Through demonstration, the knowledge of the expert can be conveyed to the agent which leads to a faster learning. Through imitation learning, the agent learns the value of the sequence of the actions provided by the expert. For instance, Normalized actor-critic (NAC) learns from imperfect demonstration. NAC, first learns the initial policy through demonstrations and then refines it in the environment.

[0012] Recently SoRB has been introduced that uses graph search on the replay buffer with SAC agent, which combines the planning with learning that leads to a better performance. In SoRB, the search generates the sequence of sub-goals and obtains a proper generalization over sparse reward function quickly. However, depending on the problem, defining a set of subgoals is very difficult and sometimes impractical.

[0013] Additional, alternative and/or improved techniques to automate neural network architecture design are desirable.

BRIEF DESCRIPTION OF THE DRAWINGS

[0014] For a more complete understanding of the invention, reference is made to the following description and accompanying drawings, in which:

[0015] FIG. 1 illustrates an environment for Guided-NASIL; [0016] FIG. 2 illustrates general overview of Guided-NASIL;

[0017] FIG. 3 shows how probing options are contribute to make a better decision;

[0018] FIG. 4 is an example of actor-critic agent;

[0019] FIG. 5 is an example of creating neural architecture based on taken action and previous state; and

[0020] FIG. 6 depicts a method of automatically designing a set of neural architectures for a given task.

DETAILED DESCRIPTION

[0021] In accordance with the present disclosure there is provided a method of automatically designing a set of neural architectures for a given task comprising: training a learning agent from a set of expert designed neural architectures using imitation learning; outputting a learning option using the trained learning agent, the learning agent using a current state of a neural architecture to output the learning option; electing an option from a plurality of options including the output learning option to create a next state of a neural architecture based on the current state of the neural architecture; adding the option to the current state of the neural architecture to create the next state of the neural architecture; and evaluating the next state of the neural architecture.

[0022] In a further embodiment of the method, a network dictionary stores information of trained neural architectures that have been evaluated.

[0023] In a further embodiment of the method, the option comprises a neural architecture layer and acceptable neural architectures are built by stacking up suitable layers that provides a valid structure for any given task.

[0024] In a further embodiment of the method, discovered neural architectures meet a computational footprint restriction. [0025] In a further embodiment of the method, the plurality of options comprise a set of probing options including learning and planning options that can be added the current state of the neural architecture as a selected layer or cell.

[0026] In a further embodiment of the method, the planning options are found based on past experiences and visited neural architectures.

[0027] In a further embodiment of the method, the planning options comprise one or more of: a random layer; a layer that has obtained highest accuracy at certain depth; a layer that has obtained highest relative improvement at the certain depth; and a layer that has obtained highest relative improvement overall.

[0028] In a further embodiment of the method, the learning option is predicted by a trained RL-agent based on past experiences and visited neural architectures.

[0029] In a further embodiment of the method, the agent uses an actor-critic method, such as soft actor-critic or DDPG, for learning and predicting appropriate layers at each stage.

[0030] In a further embodiment of the method, the agent uses an actor only or critic only learning method.

[0031] In a further embodiment of the method, Hindsight experience replay (HER) memory is implemented to make it more sample efficient.

[0032] In a further embodiment of the method, knowledge of expert can be conveyed to RL-agent via imitation learning.

[0033] In a further embodiment of the method, expert designed structures are provided to the agent as good examples.

[0034] In a further embodiment of the method, by demonstrating expert designed structures, the RL-agent learns values of sequences of layers given as expert designed neural architectures.

[0035] In a further embodiment of the method, transfer learning is used to accelerate training time of new neural structures via reusing subset weights of already trained structures. [0036] In accordance with the present disclosure there is further provided a non transitory computer readable medium having embedded thereon instructions, which when executed by a computing device configure the computing device to provide a method of automatically designing a set of neural architectures for a given task, the method comprising: training a learning agent from a set of expert designed neural architectures using imitation learning; outputting a learning option using the trained learning agent, the learning agent using a current state of a neural architecture to output the learning option; electing an option from a plurality of options including the output learning option to create a next state of a neural architecture based on the current state of the neural architecture; adding the option to the current state of the neural architecture to create the next state of the neural architecture; and evaluating the next state of the neural architecture.

[0037] In a further embodiment of the non transitory computer readable medium, a network dictionary stores information of trained neural architectures that have been evaluated.

[0038] In a further embodiment of the non transitory computer readable medium, the option comprises a neural architecture layer and acceptable neural architectures are built by stacking up suitable layers that provides a valid structure for any given task.

[0039] In a further embodiment of the non transitory computer readable medium, discovered neural architectures meet a computational footprint restriction.

[0040] In a further embodiment of the non transitory computer readable medium, the plurality of options comprise a set of probing options including learning and planning options that can be added the current state of the neural architecture as a selected layer or cell.

[0041] In a further embodiment of the non transitory computer readable medium, the planning options are found based on past experiences and visited neural architectures.

[0042] In a further embodiment of the non transitory computer readable medium, the planning options comprise one or more of: a random layer; a layer that has obtained highest accuracy at certain depth; a layer that has obtained highest relative improvement at the certain depth; and a layer that has obtained highest relative improvement overall.

[0043] In a further embodiment of the non transitory computer readable medium, the learning option is predicted by a trained RL-agent based on past experiences and visited neural architectures.

[0044] In a further embodiment of the non transitory computer readable medium, the agent uses an actor-critic method, such as soft actor-critic or DDPG, for learning and predicting appropriate layers at each stage.

[0045] In a further embodiment of the non transitory computer readable medium, the agent uses an actor only or critic only learning method.

[0046] In a further embodiment of the non transitory computer readable medium, Hindsight experience replay (HER) memory is implemented to make it more sample efficient.

[0047] In a further embodiment of the non transitory computer readable medium, knowledge of expert can be conveyed to RL-agent via imitation learning.

[0048] In a further embodiment of the non transitory computer readable medium, expert designed structures are provided to the agent as good examples.

[0049] In a further embodiment of the non transitory computer readable medium, by demonstrating expert designed structures, the RL-agent learns values of sequences of layers given as expert designed neural architectures.

[0050] In a further embodiment of the non transitory computer readable medium, transfer learning is used to accelerate training time of new neural structures via reusing subset weights of already trained structures.

[0051] In accordance with the present disclosure there is further provided a computing device comprising: a processor for executing instructions; and a memory for storing instructions, which when executed by the processor configure the computing device to provide a method of automatically designing a set of neural architectures for a given task, the method comprising: training a learning agent from a set of expert designed neural architectures using imitation learning; outputting a learning option using the trained learning agent, the learning agent using a current state of a neural architecture to output the learning option; electing an option from a plurality of options including the output learning option to create a next state of a neural architecture based on the current state of the neural architecture; adding the option to the current state of the neural architecture to create the next state of the neural architecture; and evaluating the next state of the neural architecture.

[0052] In a further embodiment of the computing device, a network dictionary stores information of trained neural architectures that have been evaluated.

[0053] In a further embodiment of the computing device, the option comprises a neural architecture layer and acceptable neural architectures are built by stacking up suitable layers that provides a valid structure for any given task.

[0054] In a further embodiment of the computing device, discovered neural architectures meet a computational footprint restriction.

[0055] In a further embodiment of the computing device, the plurality of options comprise a set of probing options including learning and planning options that can be added the current state of the neural architecture as a selected layer or cell.

[0056] In a further embodiment of the computing device, the planning options are found based on past experiences and visited neural architectures.

[0057] In a further embodiment of the computing device, the planning options comprise one or more of: a random layer; a layer that has obtained highest accuracy at certain depth; a layer that has obtained highest relative improvement at the certain depth; and a layer that has obtained highest relative improvement overall. [0058] In a further embodiment of the computing device, the learning option is predicted by a trained RL-agent based on past experiences and visited neural architectures.

[0059] In a further embodiment of the computing device, the agent uses an actor- critic method, such as soft actor-critic or DDPG, for learning and predicting appropriate layers at each stage.

[0060] In a further embodiment of the computing device, the agent uses an actor only or critic only learning method.

[0061] In a further embodiment of the computing device, Hindsight experience replay (HER) memory is implemented to make it more sample efficient.

[0062] In a further embodiment of the computing device, knowledge of expert can be conveyed to RL-agent via imitation learning.

[0063] In a further embodiment of the computing device, expert designed structures are provided to the agent as good examples.

[0064] In a further embodiment of the computing device, by demonstrating expert designed structures, the RL-agent learns values of sequences of layers given as expert designed neural architectures.

[0065] In a further embodiment of the computing device, transfer learning is used to accelerate training time of new neural structures via reusing subset weights of already trained structures.

[0066] It is desirable to have an AML method capable of designing a plurality of suitable neural networks, where each neural network is a different combination of layers. Each of the layers is basic neural network component that comprises a set of connected neurons. Each layer may be of particular type, including for example convolutional, recurrent, residual, etc. Most of the neural networks designed by the AML method should perform with high accuracy on a target task, as well as respecting the predefined constraints such as power and computation limitations. The AML method described further below is able to design a plurality of suitable models in a shorter time compared to traditional design techniques. [0067] There are at least two reasons why previous AML techniques were time consuming. One is the need to train each structure produced by the AML in order to evaluate them to see how good they are. The second reason is due to the size of the search space. Some techniques, such as weight sharing and training a surrogate function as an estimator, have been introduced that accelerate training and evaluation time of produced neural architectures. Other previous methods have tried to make the search space smaller by searching for proper building blocks and stacking them up to make a complete neural network instead of searching for a complete neural network. Many others, however, tried different search strategies to accelerate the process. The approach described herein, referred to as Guided-NASIL, overcomes these challenges and accelerates finding suitable neural networks.

[0068] A first way the Guided-NASIL approach may speed up the search and design process is by taking advantage of imitation learning. The use of imitation learning helps the AML solution to find better architectures using guidance from one or more previous expert designed structures. Usually, each ML solution depends on a specific data, however, a solution for one ML problem could also obtain acceptable performance on similar data or ML problem. Therefore, learning from available solutions can be beneficial. Guided-NASIL uses imitation learning to convey the expert knowledge to a learning agent. In this manner, contrary to most AML solutions suggested in literature, Guided-NASIL does not try to design a neural network from scratch, instead it can take advantage of well- known expert designed neural networks on similar problems. Some examples of expert designed structures include MobileNet, Xception or VGG-Net etc. Without imitation learning, the learning agent needs a lot of experience to be able to generalize well. Note that conveying expert knowledge can be a difficult task. Therefore, a relabeling framework may be used in Guided-NASIL for reproducing and training the given expert designed neural networks by the learning agent.

[0069] The Guided-NASIL approach may also accelerate the learning process of the learning agent by using an actor-critic algorithm with sample efficient memory replays. Most of the related works in the AML field use some sort of reinforcement learning (RL) algorithm; and typically, these utilize either actor-only or critic-only paradigms. Whereas, the actor-critic methods have the strength of the two strategies and have shown better and more consistent performance in other RL applications. Guided-NASIL may also use actor-critic structure as an RL-agent for AML. In one implementation, Guided-NASIL may use, for example, soft actor-critic (SAC) which is an actor-critic structure. In another implementation, deep deterministic policy gradient (DDPG) may be used. Optionally, Guided-NASIL can also make use of hindsight experience replay (HER) that provides a more sample efficient solution that further speeds up the learning process by the RL-agent. Typically, an RL agent takes a series of actions and when it reaches the target or end state a delayed reward is provided. Therefore, it usually takes more time for an agent to learn the value of each action. The agent takes action and transfers to the next state. In HER, the next state is assumed to be the target, therefore the agent receives an immediate reward for the taken action which makes the HER more sample efficient. Without actor-critic structures and HER, the learning agent needs more time to obtain enough good experiences.

[0070] The Guided-NASIL approach may also use transfer learning for quicker training and evaluation of the agent designed neural networks. An auxiliary memory, termed as “network dictionary”, is designed and used to keep track of all relevant information of visited and trained neural networks. Therefore, it can be used to accelerate training and evaluation of new designed neural networks when a subset of them has been visited before. Since the neural networks are designed incrementally by adding layers, the new neural network benefits from transfer learning and takes less time for training. These smaller networks that are designed in each episode may be referred to as “intermediate networks”.

[0071] The Guided-NASIL approach may also expedite a generalization process of the reward function by RL-agent via planning options. Guided-NASIL may use probing options that are a combination of learning and planning options that are used for providing a set of valid options at each step for the Guided-NASIL agent to choose in order to create the next layer. The planning options are extracted from the network dictionary that stores all past experiences, while the learning option is provided by the RL-agent. Learning option is learned by the learning agent, while planning options are derived via search on the network dictionary. Since the RL-agent is trained using all experiences, it also benefits from these planning options and will train on these experiences too which helps the RL- agent to generalize better.

[0072] FIG. 1 illustrates an environment for Guided-NASIL. The environment includes one or more computing devices 102 that comprise one or more processing units 104 capable of executing instructions which may be stored in one or more memory units 106. The computing devices 102 may also include non-volatile storage 108 and one or more input/output (I/O) interfaces 110 for connecting one or more periphery devices to the computing devices 102. The processing units 104 execute instructions which configure the computing devices 102 to provide various functionality 112, including Guided-NASI L functionality 114. As described in further detail below, the Guided-NASI L functionality 114 can generate one or more neural networks 116, which may be used by one or more other devices for different applications.

[0073] The environment is further depicted as including an application device 118. The application device 118 may comprise one or more processing units 120 capable of executing instructions which may be stored in one or more memory units 122. The application device 118 may also include non-volatile storage 124 and one or more input/output (I/O) interfaces 126 for connecting one or more periphery devices to the application device 118. The processing units 120 execute instructions which configure the application device 118 to provide various functionality 128, including application functionality 130. The application functionality 130 uses the neural network 132 generated by the Guided NASIL functionality 114. The application functionality 130 may further train and use the neural network 132. The application functionality may be a wide range of functionality from audio or voice related applications, including vocal user interfaces, translation, transcription, etc. Additionally or alternatively, the various applications may include video or image processing applications, including for example detection, classification, recognition, tagging, etc. Further still, the various applications may include natural language processing, text applications, etc. It will be understood that the above examples of applications is intended only to be illustrative and other applications are possible.

[0074] FIG. 2 illustrates general conceptual overview of Guided-NASIL. The arrangement in FIG. 2 shows an exemplary arrangement of a Guided-NASIL agent 201. The agent 201 interacts with the environment 202 by taking an action 203 and receiving a reward 204 from the environment 202. Guided-NASIL agent 201 designs a neural structure by adding suitable layers one by one incrementally. Guided-NASIL agent 201 also takes advantage of imitation learning 206 by training on expert designed architectures 205. Guided-NASIL agent 201 also keeps track of all visited structures in network dictionary 207 that is used to design planning options as well as transfer learning. The action 203 may be considered as the new layer, or the set of parameters defining the new layer, to be added to the neural network and the reward 204 may be considered as the accuracy of the new neural network on test data or other measurements of how good the new neural network performs. In order to provide a better search strategy, the Guided-NASIL agent may also randomly select to choose one or multiple actions at every step. When Guided-NASIL takes multiple actions together, an assumption is made that the “intermediate actions” perform well and do not evaluate the system on it. For example, if in an implementation Guided- NASIL adds two new layers at once instead of adding them one by the intermediate layer, L , is chosen based on previous learning without explicitly evaluating the resultant models. The resultant models are evaluated after adding the two layers together. In such an implementation, I, would be the learning option predicted by the RL agent while l, +i would be the best option among all probing options, including the learning option and planning options. The environment 202 may comprise the training data or testing data used in evaluating the neural networks.

[0075] FIG. 3 illustrates the arrangement of Guided-NASIL agent and shows the process of designing a set of probing options at each step. Each of the options may be considered as a layer of a neural network. These options include a learning option 301 and a plurality of planning options 302. The learning option 301 is generated by the reinforcement learning (RL) agent 303. To produce the learning option 301 , Guided-NASIL trains the RL-agent 303. Each of the plurality of the planning options 302 may be generated based on Guided-NASIL agent’s past experiences stored in network dictionary 207. For example, one or more of the planning options 302 may include a randomly taken action, an action that provided the highest accuracy among discovered neural networks with the same depth of the current neural network, an action that provided the highest relative improvement among discovered neural networks with the same depth of the current neural network, and an action that obtained the highest accuracy among all discovered neural networks up to that point of time.

[0076] The basic reinforcement learning problem may be modeled as a finite discounted Markov decision process (MDP) by a tuple M=(S, A, P, R) where S is a set of states, A is a set of actions, P:SxAxS M is a transition probability distribution, gº [0,1] is a discount factor, and R is a reward function that often assumed to be a deterministic function of the states and actions. Policy p defines the action to be taken at a given state.

[0077] In Guided-NASIL, a state space that includes the current state of the neural network 306 ad the next state of the neural network 308, is a set of any feasible neural network that can be built by stacking various layers together. Action space, A 203, includes all types of layers and cells available. A cell is a combination of a plurality of individual layers. Accordingly, the action may be to stack one or more layers together on the current neural network. Reward, R 204, is defined as accuracy of the model on a validation set. The RL-agent will learn the reward function r t =R(s t ,a t ) that is a value of taking each action a t º {A} at any state St º {S} which leads to learn an optimum policy TT * (st). Here st, at and n denote state, action and reward at time step t and the optimal policy is an optimal sequence of actions to reach a desired target or maximize the total reward.

[0078] The search space and transition model in AML problems is very complex. Adding a layer to a current structure will lead to a new structure which can be one of these cases: 1) an impractical structure, e.g., negative receptive field, 2) unacceptable structure, e.g., does not meet the device specifications, 3) an impractical structure but that could become practical only by changing one parameter, e.g., invalid skip connection, and 4) practical structure, i.e. , a structure that meets all the design requirements and is also a valid structure.

[0079] The RL-agent may use any RL algorithm, however, actor-critic methods are more favorable due to their strength compared to actor-only or critic-only methods. FIG.4. Demonstrates a general actor-critic structure with experience replay memory. Guided-NASIL components are discussed further below with regard to four processes: 1) learning process, 2) imitation learning, 3) planning process, and 4) transfer learning.

[0080] As depicted, the RL-agent provides a learning option 301 , while the plurality of planning options 302 may be generated based on the network dictionary 207, which stores details of previous neural networks. A decision component 307 selects one of the options, as the next action to add to the current state of the neural network. An experience replay memory 305 stores detail of previous options, which may be used by the RL agent.

Learning process:

[0081] In one implementation depicted in FIG. 4, the RL-agent 303 can be a deep deterministic policy gradient method (DDPG) which is an actor-critic technique, where the actor 401 learns the policy function to predict an optimal action at a current state while the critic 402 is used to evaluate the policy function via temporal difference (TD) algorithm.

[0082] The goal of the critic 402 is to learn the reward function according to the current state 306 and the taken action 203. The critic is learned through temporal difference (TD) learning.

(1) (2) where a c is the learning rate of the critic network, n is the actual immediate reward received from the environment at time t, g is a discount factor, and Q represents the estimation of the value of a state-action pair. DDPG uses target networks 406,407 that help to avoid oscillation. In DDPG there are two target networks one for the 406 and one target network for critic 407. The actor target network 406 and critic target network 407 are copies of the main actor network 401 , and main critic network 402 respectively. While the main networks 401 , 402 are used for training and will be modified at every step, the target networks 406, 407 are updated less frequently. However for action prediction and reward estimation these target networks 406, 407 are used. The target networks may be updated using a hard update technique or a soft update technique. For hard updates, the target network is a copy of the main network after a certain number of steps. For example, after every 5 steps, the critic target network is updated with a copy of the critic main network and the actor target network is updated with a copy of the actor main network. For soft updates, the target network will be updated at every step gradually with respect to the main networks. For example: a' = a * t + a * (1-t), where a is the main network parameters, a' indicates the target network parameters and t indicated the scale of the updating.

[0083] The loss function here is defined as the error between predicted reward 404 and the actual reward 204 received from the environment 202 where the new neural network is applied to testing data,

L Q = l/iV ^(7>) 2 .

(3)

[0084] The actor network 401 (TT) receives the current state 306 (s t ) and predicts actions 203 to be taken (a t ). p(8^ q p ) = a, t .

(4)

[0085] The actor 401 may be trained using the deterministic policy gradient method. Similar to critic 402, another target network is also used for the actor 406. Main actor 401 network is updated at every step according to equation 3, while the actor target network 406 is updated less frequently according to: a a here is the learning rate of the actor 401. e-greedy is also used to let the agent explore the environment 202.

[0086] In another implementation, Guided-NASIL employs SAC, which is an actor-critic structure too and similarly the actor 401 learns the policy function to predict an optimal action at a current state while the critic 402 is updated via temporal difference (TD) algorithm. The main advantage of SAC compared to other actor-critic methods is that SAC considers a more general maximum entropy objective by augmenting the expected sum of rewards with the expected entropy of policy at the current state: entropy ^ where a is the temperature parameter to adjust the importance of entropy impact against the reward value, R p is a probability of trajectory distribution under policy TT, is the entropy. T is a definition of Kullback-Leibler divergence, which is the maximum entropy objective here.

[0087] SAC updates the Q-function and policy as follows, where 1/a normalizes the distribution:

(10

) [0088] SAC has two soft Q-functions that are trained independently. SAC takes minimum value of the two Q-functions to moderate the impact of bias on policy improvement. SAC also uses target networks 406, 407 to avoid oscillation.

[0089] In addition to the above two actor-critic methods, it is possible to use Guided-NASIL with any variant of actor-critic approaches. In fact, while it may be advantageous to use actor-critic algorithms, the benefits of Guided-NASIL is not limited to the method used. The other advantages of Guided-NASIL, such as imitation learning and planning can also be used in combination with actor only (e.g. policy gradient ) or critic only (e.g. Q-learning ) methods, however, the effectiveness of these methods would be limited as compared to the aforementioned actor-critic methods.

[0090] The learning agent in Guided-NASIL, similar to many other RL agents, uses experience memory replay 305 which provides a memory to restore and reuse past experiences, that is the details of previous layers. The memory replay is in the form of (s t ,a t ,r t ,s t+i ) here, where t is the current time step and t+1 is the next time step. Usually in RL settings, the agent receives delayed reward which makes the learning more time consuming and challenging. Since the RL agent receives the reward 204 at the end of each episode, it needs many more experiences to be able to learn the value of each action taken. To overcome this issue, hindsight experience replay is introduced.

[0091] The learning agent in Guided-NASIL can also benefit from HER and obtain a more sample efficient solution. Sparse reward functions lead to long and time consuming training and need a lot more episodes to obtain valuable experiences. By using HER, instead of receiving the delayed reward at the end of the episode, the agent will receive the reward at its current state. Guided-NASIL incrementally adds layers to a neural network at each episode and employing HER helps the Guided-NASIL agent to learn the value of each layer in each state. In fact, the agent evaluates the current produced trajectory, i.e., neural network and returns the reward that is associated with the taken action. These neural networks that are built at each step of the episode are called intermediate networks (e.g. FIG. 5 shows two intermediate networks 306, 308) and are stored in the replay memory 305 as well as network dictionary 207. As depicted in FIG. 5, the action or layer(s) that are added may be added between an input layer (i.e. L-5 504) and an output layer (i.e. L-OUT 505).

Imitation learning:

[0092] Another innovation in Guided-NASIL is using imitation learning 206.

Instead of starting the search for suitable neural networks from scratch, it learns the value of available neural networks designed by experts to bootstrap the learning paradigm. In Guided-NASIL, expert designed neural networks may be given in the form of a list of tuples. Each tuple has a dictionary of parameters that defines a specific layer or cell. Guided-NASIL, in a very first phase, takes the expert designed neural networks and trains them by incrementally adding layers to the neural networks (FIG. 5). These intermediate networks along with the final structure are stored in the network dictionary as well as experience replay memory 305. Therefore, Guided-NASIL starts its learning process with updating the policy and Q-functions using good examples instead of random samples. In this manner, it takes advantage of available solutions, such as, MobileNet, Xception, or any other expert designed structures to fill the memory with useful examples.

[0093] To convey expert's knowledge to the agent, the hand designed neural networks are provided as a list to the Guided-NASIL agent in the beginning phase. The agent takes the provided list and builds intermediate networks one by one stacking up the layers from the list with the classifier layer, i.e. L-OUT 505 in FIG. 5, on top of each intermediate structure. After the imitation learning phase, Guided-NASIL may train and evaluate those neural networks that meet the device specifications, such as memory and/or computation limitations. Since this feature is set after imitation learning, the agent is still able to learn from expert designed structures that are larger than the specified device limitations for a given task, however, Guided-NASIL only discovers neural networks that meet the predefined limitations like power or memory limitations.

Action-selection process:

[0094] Typically, RL-agent trades off between exploration (taking a random action) and exploitation (predicting the action using a policy). While many RL- agents use e-greedy strategy, like DDPG, that adjusts this trade off, some others like option-critic deploy different policies to provide good options; however, defining and learning these options are difficult. SAC agent manages this trade off via entropy maximization, however it still takes time to generalize well. Therefore, search on replay buffer (SoRB) combines learning with planning over agent's past experiences.

[0095] Inspired by these ideas, in Guided-NASIL, a set of “probing options” are provided that can be extracted from past experiences (see FIG. 3). These probing options are a combination of learned and planned options that are designed through planning and learning process. For example in one implementation these options are one of: 1) the action predicted by the learning process, 2) a random action, 3) an action that obtained highest accuracy at the same depth, 4) an action that obtained highest relative improvement at the same depth, and 5) an action that obtained highest relative improvement overall.

[0096] Here, the first probing option is a learning based option 301 , whereas the other four probing options, namely planning option 302, require no learning 302, and depth refers to the number of layers that have been added to the current neural network that the agent is designing at the current step. Relative improvement refers to the amount of improvement that happened on the accuracy by adding the last layer. The set of probing options helps the agent to find a better neural network greedily. This leads to better generalization of the reward function and therefore higher speed of finding good neural networks. In fact, the options provide a beam search method that combines depth-first-search with breadth-first-search and leads the Guided-NASIL agent to have a better view over the current between exploration and exploitation, additionally there may be provided the random action as one independent option. In another implementation different set of planning options may be used.

Transfer learning:

[0097] Besides arranging the probing options and mediating the expert knowledge, the network dictionary 207 is used for transfer learning. By adding a new layer on top of the old network (before the output layer), it is possible to transfer trained weights to the equivalent layers in the new network. In the next steps, these trained weights are frozen for a few iterations. After a few iterations, the whole neural network may be trained for a small number of iterations. It is possible to train every intermediate network for a fixed number of epochs or iterations. In order to use transfer learning for accelerating the training of new neural networks, all trained neural networks are stored along with their locations, accuracy, validation loss, and other information in the network dictionary which may be used in generating subsequent planning options.

[0098] The above has described an agent that is able to quickly generate and train neural networks for a given task. The agent uses previous neural network designs in order to speed the generation of new neural networks. Further, the agent may use a combination of learning and planning options to generate the next layers to use in generating the neural networks.

[0099] FIG. 6 depicts a method for automatically designing a set of neural architectures for a given task. The method 600 trains a learning agent from a set of expert designed neural architectures (602). The training may be done using imitation learning. The trained learning agent outputs a learning option based on a current state of a neural architecture (604). The method 600 further includes electing an option from a plurality of options including the learning option to create a next state of a neural architecture based on the current state of the neural architecture (606). The elected option is added to the current state of the neural architecture to create the next state of the neural architecture (608). Once the next state is created, it is evaluated (610). The evaluation may include evaluating the current state of the neural architecture against various criteria to determine if the created architecture is acceptable. The evaluation criteria may include performance criteria and/or computational resources criteria. If the created neural architecture meets or exceeds the evaluation criteria, the neural architecture may be used for the particular task.

[0100] Although certain components and steps have been described, it is contemplated that individually described components, as well as steps, may be combined together into fewer components or steps or the steps may be performed sequentially, non-sequentially or concurrently. Further, although described above as occurring in a particular order, one of ordinary skill in the art having regard to the current teachings will appreciate that the particular order of certain steps relative to other steps may be changed. Similarly, individual components or steps may be provided by a plurality of components or steps. One of ordinary skill in the art having regard to the current teachings will appreciate that the components and processes described herein may be provided by various combinations of software, firmware and/or hardware, other than the specific implementations described herein as illustrative examples.

[0101] The techniques of various embodiments may be implemented using software, hardware and/or a combination of software and hardware. Various embodiments are directed to apparatus, e.g. a node which may be used in a communications system or data storage system. Various embodiments are also directed to non-transitory machine, e.g., computer, readable medium, e.g., ROM, RAM, CDs, hard discs, etc., which include machine readable instructions for controlling a machine, e.g., processor to implement one, more or all of the steps of the described method or methods.

[0102] Some embodiments are directed to a computer program product comprising a computer-readable medium comprising code for causing a computer, or multiple computers, to implement various functions, steps, acts and/or operations, e.g. one or more or all of the steps described above. Depending on the embodiment, the computer program product can, and sometimes does, include different code for each step to be performed. Thus, the computer program product may, and sometimes does, include code for each individual step of a method, e.g., a method of operating a communications device, e.g., a wireless terminal or node. The code may be in the form of machine, e.g., computer, executable instructions stored on a computer-readable medium such as a RAM (Random Access Memory), ROM (Read Only Memory) or other type of storage device. In addition to being directed to a computer program product, some embodiments are directed to a processor configured to implement one or more of the various functions, steps, acts and/or operations of one or more methods described above. Accordingly, some embodiments are directed to a processor, e.g., CPU, configured to implement some or all of the steps of the method(s) described herein. The processor may be for use in, e.g., a communications device or other device described in the present application. [0103] Numerous additional variations on the methods and apparatus of the various embodiments described above will be apparent to those skilled in the art in view of the above description. Such variations are to be considered within the scope.