Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ANOMALOUS DESIGN HANDLING IN SINGLE STEP DESIGN SPACE EXPLORATION
Document Type and Number:
WIPO Patent Application WO/2023/205820
Kind Code:
A2
Abstract:
A system and method for design space exploration where the design space includes a bounded set of dimensions includes anomalous design handling. An intelligent agent is trained to produce optimal design configurations for a design space using feedback provided by an evaluator to a neural network. The feedback is provided by a design evaluator which provides a scalar reward for each of the sample configurations and at least one anomalous design indicator. The scalar reward is calculated based on at least one objective design metric and a weight assigned to the objective metric. The anomalous design indicator identifies that one of the sample configurations in the batch comprises an anomalous design. Policy gradient algorithms in deep reinforcement learning evaluate the rewards in conjunction with an anomalous design reward created from the indicator to update the neural network.

Inventors:
MORTAZAVI MASOOD (US)
Application Number:
PCT/US2023/070272
Publication Date:
October 26, 2023
Filing Date:
July 14, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
FUTUREWEI TECHNOLOGIES INC (US)
Domestic Patent References:
WO2022147583A22022-07-07
Foreign References:
US199562633914P
US201462633936P
US194562633993P
US195262634195P
US20210064634W2021-12-21
Other References:
D. P. KINGMAJ. BA: "Adam: A method for stochastic optimization", INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, ICLR, 2015
Y. S. SHAOS. L. XIV. SRINIVASANG.-Y. WEID. BROOKS: "Codesigning accelerators and soc interfaces using gem5-aladdin", ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2016, pages 1 - 12
Attorney, Agent or Firm:
VIERRA, Larry E. (US)
Download PDF:
Claims:
CLAIMS What is claimed is: 1. A computer implemented method, comprising: outputting, by a neural network, a fixed number of probability distributions for a design having the fixed number of dimensions; generating a batch of sample configurations for the design based on the probability distributions, wherein each sample configuration of the batch of sample configurations corresponds to a different, complete configuration of the fixed number of dimensions; outputting the batch of sample configurations to an evaluator; receiving from the evaluator a batch of scalar rewards with at least a scalar reward for each of the sample configurations in the batch, each scalar reward calculated based on at least one objective design metric and a weight assigned to the objective metric, and an anomalous design indicator identifying that one of the sample configurations in the batch comprises an anomalous design; calculating an anomalous design reward based on the anomalous design indicator; and updating parameters of the neural network based on the scalar rewards and the anomalous design reward. 2. The computer implemented method of claim 1, further comprising: repeating the outputting the fixed number of probability distributions, the generating the batch of sample configurations, the outputting the batch of sample configurations, and the updating the parameters for a number of iterations, wherein the anomalous design indicator is received in one or more of the iterations. 3. The computer implemented method of any of claims 1-2, wherein the generating a sample design configuration comprises a single-step compound action. 4. The computer implemented method of any of claims 1-2, wherein the generating a sample design configuration comprises a multi-step reinforcement learning-based analysis.

5. The computer implemented method of any of claims 1-4, wherein the anomalous design is a physically impossible design based on objective design policy constraints. 6. The computer implemented method of any of claims 1-5, wherein the anomalous design reward discourages further generation of anomalous designs by the neural network. 7. The computer implemented method of any of claims 1 – 6 wherein objective metrics are combined with economic weights in order to arrive at a single weighted-sum, scalar reward for a given batch. 8. The computer implemented method of any of claims 1 – 7 further including calculating a running reward based on one or more statistics calculated for the batch of scalar rewards; calculating an advantage for each scalar reward in the batch of scalar rewards; and updating parameters of the neural network based on a statistical loss derived based on the advantage. 9. The computer implemented method of claim 8, wherein the anomalous design reward comprises the lesser of a batch average reward and a baseline running reward – minus a penalty margin by determined as a percentage of the current running reward. 10. The computer implemented method of any of claims claim 8 - 9 wherein the running reward is updated based on a current running reward and the mean batch reward. 11. The computer implemented method of any of claims 1 – 10 further including outputting an optimal design with the fixed number of dimensions. 12. An apparatus, comprising: a storage medium comprising computer instructions; and one or more processors coupled to communicate with the storage medium, wherein the one or more processors execute the instructions to cause the system to: output, by a neural network, a fixed number of probability distributions for a design having the fixed number of dimensions; generate a batch of sample configurations for the design based on the probability distributions, wherein each sample configuration of the batch of sample configurations corresponds to a different, complete configuration of the fixed number of dimensions; output the batch of sample configurations to an evaluator; receive from the evaluator a batch of scalar rewards with at least a scalar reward for each of the sample configurations in the batch, each scalar reward calculated based on at least one objective design metric and a weight assigned to the objective metric, and at least one anomalous design indicator identifying that one of the sample configurations in the batch comprises an anomalous design; calculate an anomalous design reward based on the anomalous design indicator; and update parameters of the neural network based on the scalar rewards and the anomalous design reward. 13. The apparatus of claim 12, wherein the one or more processors execute the instructions to further cause the system to: repeat the output the fixed number of probability distributions, the generation the batch of sample configurations, the outputting the batch of sample configurations, and the updating the parameters for a number of iterations, wherein the anomalous design indicator is received in one or more of the iterations. 14. The apparatus of any of claims 12-13, wherein the one or more processors execute the instructions to cause the system to generate a sample design configuration as a single-step compound action. 15. The apparatus of any of claims 12-14, wherein the one or more processors execute the instructions to cause the system to generate a sample design configuration as a multi-step reinforcement learning-based analysis. 16. The apparatus of any of claims 12-15, wherein the anomalous design is a physically impossible design based on objective design policy constraints.

17. The apparatus of any of claims 12-16, wherein the anomalous design reward discourages further generation of anomalous designs by the neural network. 18. The apparatus of any of claims 12 – 17 wherein the one or more processors execute the instructions to further cause the system to apply objective metrics combined with economic weights in order to arrive at a single weighted-sum, scalar reward for a given batch. 19. The apparatus of any of claims 12 – 18 wherein the one or more processors execute the instructions to further cause the system to calculate a running reward based on one or more statistics calculated for the batch of scalar rewards; calculate an advantage for each scalar reward in the batch of scalar rewards; and update the parameters of the neural network based on a statistical loss derived based on the advantage. 20. The apparatus of any of claims 12- 15, wherein the anomalous design reward comprises the lesser of a batch average reward and a baseline running reward. 21. The apparatus of claim 20 wherein the one or more processors execute the instructions to further cause the system to update the running reward based on a current running reward and the mean batch reward. 22. The apparatus of any of claims 12 - 21 wherein the one or more processors execute the instructions to further cause the system to output an optimal design with the fixed number of dimensions. 23. A non-transitory computer-readable medium storing computer instructions for rendering images, that when executed by one or more processors, cause the one or more processors to perform the steps of: outputting, by a neural network, a fixed number of probability distributions for a design having the fixed number of dimensions; generating a batch of sample configurations for the design based on the probability distributions, wherein each sample configuration of the batch of sample configurations corresponds to a different, complete configuration of the fixed number of dimensions; outputting the batch of sample configurations to an evaluator; receiving from the evaluator a batch of scalar rewards with at least a scalar reward for each of the sample configurations in the batch, each scalar reward calculated based on at least one objective design metric and a weight assigned to the objective metric, and at least one anomalous design indicator identifying that one of the sample configurations in the batch comprises an anomalous design; calculating an anomalous design reward based on the anomalous design indicator; and updating parameters of the neural network based on the scalar rewards and the anomalous design reward. 24. The non-transitory computer-readable medium of claim 21 wherein the instructions cause the one or more processors to perform the steps of: repeating the outputting the fixed number of probability distributions, the generating the batch of sample configurations, the outputting the batch of sample configurations, and the updating the parameters for a number of iterations, wherein the anomalous design indicator is received in one or more of the iterations. 25. The non-transitory computer-readable medium of any of claims 23-24, wherein the generating a sample design configuration comprises a single-step compound action. 26. The non-transitory computer-readable medium of any of claims 23-24, the generating a sample design configuration comprises a multi-step reinforcement learning-based analysis. 27. The non-transitory computer-readable medium of any of claims 23-25, wherein the anomalous design is a physically impossible design based on objective design policy constraints. 28. The non-transitory computer-readable medium of any of claims 23-27, wherein the anomalous design reward discourages further generation of anomalous designs by the neural network.

29. The non-transitory computer-readable medium of any of claims 23 –25 wherein objective metrics are combined with economic weights in order to arrive at a single weighted-sum, scalar reward for a given batch. 30. The non-transitory computer-readable medium of any of claims 23 – 29 wherein the instructions cause the one or more processors to perform the steps of calculating a running reward based on one or more statistics calculated for the batch of scalar rewards; calculating an advantage for each scalar reward in the batch of scalar rewards; and updating parameters of the neural network based on a statistical loss derived based on the advantage. 31. The non-transitory computer-readable medium of claim 30, wherein the anomalous design reward comprises the lesser of a batch average reward and a baseline running reward – minus a penalty margin by determined as a percentage of the current running reward. 32. The non-transitory computer-readable medium of any of claims 30 - 31 wherein the instructions cause the one or more processors to perform the steps of updating the running reward based on a current running reward and the mean batch reward. 33. The non-transitory computer-readable medium of any of claims 23 – 32 wherein the instructions cause the one or more processors to perform the steps of outputting an optimal design with the fixed number of dimensions.

Description:
ANOMALOUS DESIGN HANDLING IN SINGLE STEP DESIGN SPACE EXPLORATION Inventor: Masood Mortazavi CLAIM OF PRIORITY [0001] This application claims priority to: U.S. Provisional Patent Application No. 63/391,495, entitled “Theta-Resonance: A Deep Reinforcement Learning Approach to Design Space Exploration”, filed July 22, 2022. [0002] This application claims further priority to U.S. Provisional Patent Application No. 63/393,614, entitled “Theta-Resonance: A Deep Reinforcement Learning Approach to Design Space Exploration”, filed July 29, 2022. [0003] This application claims further priority to U.S. Provisional Patent Application No. 63/399,345, entitled “Theta-Resonance: A Deep Reinforcement Learning Approach to Design Space Exploration”, filed August 19, 2022. [0004] This application claims further priority U.S. Provisional Patent Application No. 63/419,552, entitled” Theta-Resonance: A Single-Step Reinforcement Learning Method for Design Space Exploration”, filed October 26, 2022. [0005] Each of the foregoing listed applications are incorporated by reference herein in their entirety. FIELD [0006] The disclosure generally relates to systems and methods for design space exploration in fixed dimension spaces, and in particular to handling anomalous designs in design space exploration systems. BACKGROUND [0007] Design Space Exploration (DSE) is the process of exploring and evaluating possible design alternatives and configurations of a system or product. Each design space may have a series of dimensions and a set of one or more values within each dimension. Each combination of values within a set of dimensions is a different, specific design. Design space exploration examines these designs against a design objective defined by metrics which set out the objectives of the design. The goal of DSE is to identify the best design choices among all possible designs, considering the interrelationships between the dimensions and values within the dimensions. [0008] A design space represents the entire range of possible design decisions or solutions. Some system designs have very large design spaces because the number of configuration dimensions is large, and/or the number of available options within each dimension is large. For example, for a system whose configuration has 10 dimensions with each dimension having 8 options, the number of different configurations for the system can reach 10 8 . Different configuration dimensions for a system can have different numbers of options. In design space issues, each dimension may include a discrete number of options or the options may be continuous (or may include both discrete and continuous options in different dimensions). SUMMARY [0009] In one aspect a computer implemented method is provided. The computer implemented method includes outputting, by a neural network, a fixed number of probability distributions for a design having the fixed number of dimensions. A batch of sample configurations is generated for the design based on the probability distributions, wherein each sample configuration of the batch of sample configurations corresponds to a different, complete configuration of the fixed number of dimensions. The method further includes outputting the batch of sample configurations to an evaluator and receiving from the evaluator a batch of scalar rewards with at least a scalar reward for each of the sample configurations in the batch. Each scalar reward is calculated based on at least one objective design metric and a weight assigned to the objective metric. At least one anomalous design indicator identifying that one of the sample configurations in the batch comprises an anomalous design is also received. The method further includes calculating an anomalous design reward based on the anomalous design indicator; and updating parameters of the neural network based on the scalar rewards and the anomalous design reward., Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. [0010] Implementations may include the computer implemented method further comprising repeating the outputting the fixed number of probability distributions, the generating the batch of sample configurations, the outputting the batch of sample configurations, and the updating the parameters for a number of iterations, wherein the scalar reward indicating that one of the sample configurations comprises an anomalous design is received in one or more of the iterations. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the generating a sample design configuration comprises a single-step compound action. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the anomalous design is a physically impossible design based on objective design policy constraints. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the anomalous design indicator discourages further generation of anomalous designs by the neural network. Implementations may include the computer implemented method of any of the foregoing embodiments wherein objective metrics are combined with economic weights in order to arrive at a single weighted-sum, scalar reward for a given batch. Implementations may include the computer implemented method of the foregoing embodiment further including calculating a running reward based on one or more statistics calculated for the batch of scalar rewards; calculating an advantage for each scalar reward in the batch of scalar rewards; and updating parameters of the neural network based on a statistical loss derived based on the advantage. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the anomalous design indicator comprises the lesser of a batch average reward and a baseline running reward. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the running reward is updated based on a current running reward and the mean batch reward. Implementations may include the computer implemented method of any of the foregoing embodiments further including outputting an optimal design with the fixed number of dimensions. [0011] Another aspect comprises an apparatus for design space exploration which includes a storage medium comprising computer instructions; and one or more processors coupled to communicate with the storage medium. The one or more processors execute the instructions to cause the system to: output, by a neural network, a fixed number of probability distributions for a design having the fixed number of dimensions; generate a batch of sample configurations for the design based on the probability distributions, wherein each sample configuration of the batch of sample configurations corresponds to a different, complete configuration of the fixed number of dimensions; output the batch of sample configurations to an evaluator; receive from the evaluator a batch of scalar rewards with at least a scalar reward for each of the sample configurations in the batch, each scalar reward calculated based on at least one objective design metric and a weight assigned to the objective metric, and at least one anomalous design indicator identifying that one of the sample configurations in the batch comprises an anomalous design; calculate an anomalous design reward based on the anomalous design indicator; and update parameters of the neural network based on the scalar rewards and the anomalous design reward. [0012] Implementations may include an apparatus wherein the one or more processors execute the instructions to further cause the system to: repeat the output the fixed number of probability distributions, the generation the batch of sample configurations, the output of the batch of sample configurations, and the update of the parameters for a number of iterations, wherein the scalar reward indicating that one of the sample configurations comprises an anomalous design is received in one or more of the iterations. Implementations may include an apparatus of any of the foregoing embodiments wherein the one or more processors execute the instructions to cause the system to the generate a sample design configuration as a single- step compound action. Implementations may include an apparatus of any of the foregoing embodiments wherein the anomalous design is a physically impossible design based on objective design policy constraints. Implementations may include an apparatus of any of the foregoing embodiments wherein the anomalous design reward discourages further generation of anomalous designs by the neural network. Implementations may include an apparatus of any of the foregoing embodiments wherein the one or more processors execute the instructions to further cause the system to apply objective metrics combined with economic weights in order to arrive at a single weighted-sum, scalar reward for a given batch. Implementations may include an apparatus of any of the foregoing embodiments wherein the one or more processors execute the instructions to further cause the system to calculate a running reward based on one or more statistics calculated for the batch of scalar rewards; calculate an advantage for each scalar reward in the batch of scalar rewards; and update the parameters of the neural network based on a statistical loss derived based on the advantage. Implementations may include an apparatus of any of the foregoing embodiments wherein the anomalous design reward comprises the lesser of a batch average reward and a baseline running reward. Implementations may include an apparatus of the foregoing embodiment wherein the one or more processors execute the instructions to further cause the system to update the running reward based on a current running reward and the mean batch reward. Implementations may include an apparatus of any of the foregoing embodiments wherein the one or more processors execute the instructions to further cause the system to output an optimal design with the fixed number of dimensions. [0013] A still further aspect comprises a non-transitory computer-readable medium storing computer instructions for rendering images, that when executed by one or more processors, cause the one or more processors to perform steps of: outputting, by a neural network, a fixed number of probability distributions for a design having the fixed number of dimensions; generating a batch of sample configurations for the design based on the probability distributions, wherein each sample configuration of the batch of sample configurations corresponds to a different, complete configuration of the fixed number of dimensions; outputting the batch of sample configurations to an evaluator; receiving from the evaluator a batch of scalar rewards with at least a scalar reward for each of the sample configurations in the batch, each scalar reward calculated based on at least one objective design metric and a weight assigned to the objective metric, and at least one anomalous design indicator identifying that one of the sample configurations in the batch comprises an anomalous design; calculating an anomalous design reward based on the anomalous design indicator; and updating parameters of the neural network based on the scalar rewards and the anomalous design reward. [0014] Implementations may include an embodiment wherein the instructions cause the one or more processors to perform the steps of: repeating the outputting the fixed number of probability distributions, the generating the batch of sample configurations, the outputting the batch of sample configurations, and the updating the parameters for a number of iterations, wherein the scalar reward indicating that one of the sample configurations comprises an anomalous design is received in one or more of the iterations. Implementations may include an embodiment of any of the foregoing embodiments wherein the generating a sample design configuration comprises a single-step compound action. Implementations may include an embodiment of any of the foregoing embodiments wherein the anomalous design is a physically impossible design based on objective design policy constraints. Implementations may include an embodiment of any of the foregoing embodiments wherein the anomalous design reward discourages further generation of anomalous designs by the neural network. Implementations may include an embodiment of any of the foregoing embodiments wherein objective metrics are combined with economic weights in order to arrive at a single weighted- sum, scalar reward for a given batch. Implementations may include an embodiment of any of the foregoing embodiments wherein the instructions cause the one or more processors to perform the steps of calculating a running reward based on one or more statistics calculated for the batch of scalar rewards; calculating an advantage for each scalar reward in the batch of scalar rewards; and updating parameters of the neural network based on a statistical loss derived based on the advantage. Implementations may include an embodiment of the foregoing embodiment wherein the anomalous design reward comprises the lesser of a batch average reward and a baseline running reward. Implementations may include an embodiment of any of the foregoing embodiments wherein the instructions cause the one or more processors to perform the steps of updating the running reward based on a current running reward and the mean batch reward. Implementations may include an embodiment of any of the foregoing embodiments wherein the instructions cause the one or more processors to perform the steps of outputting an optimal design with the fixed number of dimensions. [0015] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the Background. BRIEF DESCRIPTION OF THE DRAWINGS [0016] Aspects of the present disclosure are illustrated by way of example and are not limited by the accompanying figures for which like references indicate the same or similar elements. [0017] FIG.1 is a graphical illustration of a design space problem. [0018] FIG.2 shows illustrates a conventional reinforcement learning approach relative to episodes of a state-action-state-reward sequence. [0019] FIG. 3 illustrates an improved application scheme of RL for searching for an optimal configuration in a design space. [0020] FIG. 4 illustrates a general scheme of using policy-gradient family of the RL algorithms for design space exploration. [0021] FIG.5 illustrates an architecture for implementing a DSE system. [0022] FIG.6 is a flowchart illustrating a method of operating a DSE system. [0023] FIG. 7 is pseudo-code which illustrates additional details of the DSE method and policy optimization algorithm. [0024] FIG. 8 illustrates an embodiment of a DSE system which accounts for failed or “anomalous” designs. [0025] FIG.9 illustrates a method for the setup of the DSE system. [0026] FIG. 10A compares genetic algorithm implementations with the DSE system disclosed herein in a synthetic DSE task. [0027] FIG.10B illustrates a method of creating an artificial problem and design solution to determine an optimal network configuration. [0028] FIG. 11 illustrates an example of an artificial problem/target input and design solution. [0029] FIG. 12 illustrates an example design space exploration problem comprising a system on chip (SoC)) design. [0030] FIG.13 is a graph illustrating the impact of policy network architectural variations. [0031] FIG. 14 is a graph illustrating the effect of some PPO batching variations when using the Transformer policy network. [0032] FIG.15 is a block diagram illustrating an example of a network processing device that can be used to implement various embodiments of the technology described herein. DETAILED DESCRIPTION [0033] The present disclosure and embodiments are directed to a system and methods for design space exploration (DSE) where the design space includes a bounded set of dimensions. A DSE system is described herein which produces optimal design configurations. The DSE system operates using an intelligent agent to produce progressively more optimal design configurations until a satisfactory design is achieved. Reinforcement learning (RL) is a machine learning approach that involves an agent learning to interact with an environment in order to maximize a reward signal. In the context of DSE, RL can be employed to aid in the search and optimization of design configurations. In embodiments of a DSE system, a neural network Net θ consumes a constant input tensor and produces a policy π θ as a set of conditional probability density functions (PDFs) for sampling each design dimension. Policy gradient algorithms in deep reinforcement learning (D-RL) are used to evaluate feedback (in terms of cost, penalty or reward) to update Net θ . Multiple neural architectures (for Net θ ) may be utilized. The DSE system described herein, and embodiments of systems shown and described in WO2022147583A2 entitled “System and method for optimal placement of interacting objects on continuous (or discretized or mixed) domains” describe an application scheme of RL as a single-step compound step action. In embodiments, the system and methods for anomalous design identification can be used in single-step compound action RL DSE systems as well as multi- step RL-based DSE systems. [0034] All such systems described with a single step compound action produce designs without knowledge of practical constraints on the design. As a result, systems using a single step compound action can result in incorrect or physically impossible designs. Given an evaluation environment (e.g., a simulator) which can evaluate evaluating sample configurations of a specified design template, embodiments of the technology utilize an anomalous design indicator as part of the feedback from the evaluation environment to train an intelligent agent to produce progressively more optimal design configurations until a satisfactory design is achieved. [0035] In DSE, a dimension comprises a set of one or more choices of an individual parameters (or continuous dimensions of choice or mixed discrete-continuous dimensions of choice where some dimensions involve continuous dimensions), and each dimension represents a different aspect of a design that can be adjusted. Each dimension in the design space corresponds to a specific design choice or configuration. For example, in the case of designing a computer chip, dimensions could include the number of cores, clock frequency, cache size, power consumption, or communication bandwidth. Each of these dimensions can take on different values, and the combination of values across all dimensions defines a specific design configuration within the design space. A bounded design space includes a finite and fixed number of dimensions. There may be a choice of a number of value choices within each dimension, or the choices may be continuous. [0036] FIG.1 graphically illustrates a problem of configuring a design in a design space. In the example of FIG.1, a design space 100 has three dimensions 102, 104, 106. In the context of DSE, a dimension is a distinct variable or attribute that can be varied to explore different design options. Dimensions be modified independently but a change to a dimension may impact the choice of another dimension in the exploration. In the illustration of FIG.1, three are three dimensions are shown to illustrate a DSE problem. It should be understood that in many DSE problems there are significantly more dimensions. In the example of FIG.1, each element in the dimension is shown as having a different pattern, representing a different parameter for that dimension. Dimension 102 has five (5) choices, dimension 104 has seven (7) choices and dimension 106 has three (3) choices. Dimensions define the parameters or factors that influence the design and affect its performance, behavior, or appearance. They can include both quantitative and qualitative variables, such as size, shape, color, material, weight, functionality, or any other relevant property. Each dimension typically has a range or set of possible values that can be assigned to it. [0037] Each design space 100 includes a template 120. Template 120 illustrates that there is one choice for each dimension, and thus the total possible number of completed designs is 5*7*3 or 105. The design template comprises a set of guidelines, principles, or rules that govern the exploration and decision-making process within the context of design. The template describes the search space and the type (i.e. continuous, discrete or mixed) and range of each dimension of choice. The sampling space is then mapped onto the template. A completed design 125 makes one choice for each dimension. A goal of DSE is to explore different combinations of values, and design possibilities, and evaluate their impact on various design objectives or constraints. Design space exploration involves systematically varying these dimensions to understand the relationships between them and to identify optimal or desirable design configurations. [0038] FIG.2 shows a conventional reinforcement learning approach relative to episodes of a state-action-state-reward sequence. The methods that rely on RL to guide sampling and discovery of design space solutions generally conceive of the design space exploration as a series of multi-step episodes of the form s 0 → x 1 → s 1 → x 2 → s 2 →・・・ → x D → s D . Each episode begins with an incomplete design template s 0 and after D design decisions, arrives at a terminal, “design complete” state: s D = s T . The episodes start from an initial state S 0 and end in a terminal state ST (e.g., one episode has all dimensions configured such as going through the path once from S o to S T in FIG.2). The general RL theory and algorithms may be devised for episodes of arbitrary length. A configuration selection is made in each of the configuration dimensions. For example, at state S 0 , the action may be configuring C i to select an option value of dimension 1 of the configuration; and so on. For a system design, the number of configuration dimensions is fixed (e.g., the number of dimensions in the example shown in FIG.2 is a fixed number, T in FIG.2). In each dimension, various (types and values of) options may be available. [0039] Some application schemes of RL, such as the scheme described above with respect to FIG.2, can experience technical difficulties and complexities. The reward (which is used to measure design goodness) can only be evaluated once all configuration dimensions are configured (with one of the available options in each dimension). That is, only fully configured system can be evaluated by a system evaluation mechanism (e.g., power/performance/area (PPA) in chip design, for a chip design that has multiple possible configuration dimensions). [0040] FIG. 3 illustrates an improved application scheme of RL for searching for the optimal configuration in a design space. FIG. 3 illustrates a single-step (compound-action) Markov Decision process (MDP) reinforcement-learning model for DSE: s 0 → X → s D where X is a compound action < x 1 , x 2 ,・・・ , x D > and where the cost (penalty or reward) signal has all that is required to learn about compound-action inter-dependencies. Because the design space episode is of a fixed horizon (e.g., the number of the dimensions, or the number of the configuration parameters, is fixed), and because the reward is only collected at the end, the application scheme of RL can be considered as a single-step compound step action. For this embodiment, different permutations of configuration options for different dimensions may be evaluated to determine which produces the best reward. As shown in FIG.3, the scheme renders the RL system to a single-step stateless episode. The single-step compound action 304 in FIG. 3 is the combination of configuring all dimensions (e.g., the combination of “Configure C 1 for dimension 1, Configure C 2 for dimension 2, Configure C 3 for dimension 3, Configure C 4 for dimension 4, ..., Configure C T-1 for dimension T-1, Configure C T for dimension T). With the single-step compound step action 304, the application scheme is “stateless” in that there are no intermediate states (e.g., states S1, S 2 , S 3 , S T-1 as shown in FIG.2). There are only two states in the improved scheme shown in FIG.3: a blank state 302 with no dimension configured and a fully configured state 306 with all dimensions configured. At the blank state 302, the single- step compound action 304 is performed. Then, the state transitions to the fully configured state 306, and reward R T is collected. Other application schemes of RL (e.g., the scheme shown in FIG. 2) need to consider constraints between dimensions. In contrast, with the application scheme of RL in FIG.3, if an early C i constrains Cj (i <j), techniques described herein ignore such constraints by modeling joint probabilities instead of the more complex conditional probabilities. [0041] While single-step arrival at a terminal state is rarely used in developing RL theory or applications, embodiments herein use simplified policy-gradient algorithms, in order to allow for “weight” (θ) resonance, a single-step, compound-action RL approach, which is also identical with a single-step Markov decision process. [0042] FIG. 4 illustrates a general scheme of using policy-gradient family of the RL algorithms for design space exploration. An RL agent 405 (including a neural network 407) is used for generating sample configurations. The RL agent 405 includes a policy gradient algorithm (proximal policy optimization (PPO)) 410, and, as discussed herein, a reward statistics module 420. The RL agent 405 produces single step, completed designs – one choice per dimension. Sampled configurations 425 comprise one configuration 430 of the template, and a potential design choice in the design space. Each sample configuration 430 is sent to an evaluator 440. The evaluator 440 may run in parallel with the RL agent 405 (e.g., when the RL agent 405 generates a new sample configuration, the evaluator 440 may evaluate a previously generated sample configuration), and multiple instances of the evaluator 440 may analyze batches of design samples. After evaluating the sample configurations, the evaluator 440 generates the rewards 450 for the sample configurations. Each reward 450 may be generated based on the multiple performance metrics generated by the evaluator 440. In the DSE system, that scalar reward 450 is the weighted average of the metrics used to measure the optimality of the design. Usually, these metrics are traded against each other, and each has its own costs. The weights can be considered a unit cost or price of each of the performance metrics. In many instances, the weights are set to reflect the relative preferences of the design engineer, since performance objective cost estimates may be unavailable (or impossible to compute in any meaningful way) when setting the weights. The policy-gradient RL algorithm 410 may use the reward 450 to update the neural network of the RL agent 405. [0043] In embodiments herein, weight (θ) resonant methods train a sampling policy for a given design space exploration as an economic search problem. Using the reward feedback, the policy gradient algorithm is used to improve the sampling policy. [0044] FIG. 5 illustrates one architecture for implementing a DSE system. The system includes a weight (θ) resonant neural network 510 which generates a sampling policy π θ . The policy has parameters θ: π θ = π(x 1 ...x D |θ, I). A constant input I 505 is the initial state of the design template s 0 . Input 505 may remain the same and unchanged throughout the search for an optimal design space configuration. In some embodiments, the input 505 may be a single constant value (e.g., a non-zero number). In some embodiments, the constant input 505 may be the tensor of 1s or hot- 1s (e.g., a group of bits among which the allowable combinations of values are only those with a single high (1) bit and all the others low (0). The terminal state, s T = s D , is accomplished in a single step that involves making all the choices X at once in accord with the current policy = π(x 1 ...x D |θ, I) where X1 = x1 ∧ X2 = x2 ∧ ... ∧ XD = xD with random variable X i representing the i-th design action.. [0045] The output of the network 510 is a compound-action probability density function (PDF), or parameters of the PDF. These distributions can represent either discrete or continuous dimensions. For dimensions representing discrete dimensions, discrete probabilities required to establish the (effectively, we argue, conditional) categorical distribution are produced by the resonant policy network. For dimensions representing continuous dimensions, the policy network produces parameters for parametric modeling of those dimensions by appropriate analytic distribution functions (e.g., Gaussian, Beta, etc.). The probability distribution gives the possibility of each outcome of a random event (e.g., a possible option is configured for a dimension). The compound-action PDF is a function used to define the probabilities of different possible occurrences (e.g., the probabilities of different options configured for each of the different dimensions) (represented at 520, 525, 530, 535). A PDF sampler 540 generates sample design configurations. Sample configurations are generated in batches, where each sample in a batch of sample configurations is a sample of full configuration of all T dimensions, with each dimension of the total T dimensions configured with a corresponding dimension option. In some embodiments, a batch of sample configurations are generated based on the compound-action PDF such that the distribution of the configured options for dimensions in the batch of sample configurations match the probability distribution of the compound-action PDF. In embodiments, the number of configurations in a batch and in a mini-batch, and the number of epochs are hyperparameters that can be pre-configured in the system. [0046] The policy network 510 decodes the constant input into the joint “action” distribution π(x 1 ...x D |θ, I), expressed in its conditional form: Equation 1 [0047] As illustrated in FIG. 5, multiple parallel instances of ana simulation engine 545, 550, 555, 560, each of which is capable of evaluating batches of configurations based on performance metrics, consume batches of configuration samples and generate rewards 570. In this context, rewards 570 comprise a numerical value that provides feedback to an agent based on its actions and the current state of the environment. A reward can be positive or negative where positive rewards indicate beneficial actions that the model should try to repeat, while negative rewards indicate actions that should be avoided in the future. Each simulation engine acts as an evaluator on ones of the fully configured designs in the batches of sample configurations. Each engine 545, 550, 555, 560 extracts objective metrics 565 achieved by a given design sample, and includes a set of weights associated with the objective metrics 567 utilized to compute a single weighted-sum, real number reward R b for each sample. The reward feedback 570 from batches of samples is used to update the weights of network 505 though the policy gradient RL algorithms (PPO 410). Essentially, if an action led to a positive reward, the network is updated to make that action more likely in the future, and if an action led to a negative reward, the network is updated to make that action less likely. This policy improvement cycle is discussed and illustrated below. By repeatedly interacting with the environment and updating the neural network weights based on the rewards received, the policy is improved and drives the system toward an optimal design. [0048] Communication 575 of the batches of samples between the PDF sampler 540 and the simulation engines 545, 550, 555, 560, and communication of the generated rewards 570 to the network 510, may occur in one embodiment using the techniques discussed in PCT/US2021/064634 entitled PARALLEL PROCESSING USING A MEMORY-BASED COMMUNICATION MECHANISM. [0049] The sampler 540 conditionally samples each choice dimension through the neurally predicted conditional PDFs. The policy network 510 produces and continually updates sampling policy π θ as a collection of the conditional probabilities for each dimension by resonating on the constant tensor I, with the policy network designed with a sufficient capacity for internal resonance particularly among the dimensions so that inter-dependencies are learned through online exploration. The internal resonance causes the network to produce a joint distribution for an optimal design in the design space (or to approximate the optimal design) tending toward a multi-dimensional delta function δ(x* 1 ,x* 2 ,...,x* D ) concentrated on an optimal point. In embodiments, network 510 comprises a Transformer, or multi-layer perceptron policy network, though other types of policy networks (Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), etc.) may be used in accordance with embodiments of the technology. [0050] FIGs. 6 and 7 illustrate methods in accordance with this disclosure. FIG. 6 is a flowchart illustrating a method of operating a DSE system. In embodiments, the method of FIG.6 introduces the concept of updating the policy network 410 based on a running reward which guides exploration of the design space. At 605, for any design space having a fixed number of dimensions, evaluation objective metrics are provided at 610. The evaluator (simulation engine) may output multi-dimensional performance metrics for each fully configured system using a sample configuration of the batch of sample configurations. The multi-dimensional performance metrics are used to generate the scalar reward. The metrics depend on the design space but, by way of example, may include efficiency metrics such as computational, power or material efficiency, performance metrics such as speed, accuracy and throughput, and economic metrics such as cost and return on investment. At 615, the DSE system is configured with certain initial parameters and hyperparameters including the input I, the network (Net θ ) input size, a target reward R τ , a target time (or number of epochs) t τ , a design simulator Sim, design objectives Obj and design weights W. In addition, the policy network type (Transformer, multi-layer perceptron, etc.), and batch size are configured. [0051] At 620, the neural network is initialized based on the constant input I and an initial set of network weights θ 0 . At 625, the method consumes the input I and moves to continually evaluate sample configurations produced by the policy network 510, updating network weights and interdependencies until an optimal configuration is reached (or until an ending event (665) – a maximum run time or number of epochs, for example). At 640, a constant input I is input to the neural network. At 645, the neural network generates a batch of sample configurations, each with the fixed number of dimensions. At 650, each sample configuration is evaluated by the simulation engine (design evaluator). At 655, a scalar reward for each design is returned to network 510 by the simulation engine. At 660, the neural network 510 is updated (by the policy gradient network (PPO)) and, unless an ending event 665 is detected, the process loops to step 640 until an optimal design 670 for the design space is produced. Updating of the weights at 660, described with respect to FIG.7 below, includes computing reward statistics at 660a. The reward statistics are used to update a running reward at 660b which is in turn used to update an advantage at 660c. This statistically generated running reward removes bias from the exploration. As explained further below, the running reward is an improving baseline for measuring the (negative or positive) advantage of generated design samples. This running reward, and the use of the advantage computed based on the running reward, allows the network 510 to resonate to an optimal state. The advantage is used to compute a statistical risk 660d which is used to update the weights of the policy network at 660e. In embodiments, an optimal design 670 may be determined by the system reaching a design which returns the target reward Rτ. In some instances, a target reward for an optimal designs is not always known and in multi-dimensional objectives — where reward is generated as a weighted sum of measured objective metrics — the user of the system may change weights. In such cases, the target reward may be set to be above what is expected so that the system settles on an optimal state it can discover for that combination of objective weights. [0052] FIG. 7 illustrates additional details of the DSE method and policy optimization algorithm. As noted above, the DSE system is configured by a design engineer (i.e. a user of the DSE system) with certain initial parameters including the input I, the network (Net θ ) input size, a target reward R τ , and a target time (or number of epochs) t τ . A design simulation engine Sim is provided by a design engineer utilizing the DSE system and requires design objectives (metrics) and objective weights W. [0053] At line 2 in FIG.7, F ≡ < ƒ (x 1 |x 2 , …, x D ), ƒ(x 2 |x 3 …x D ), …, ƒ(x D ) > represents the conditional PDFs produces from the joint PDF π θ with the relationship given above in Equation 1. The conditional PDFs F i for each dimension i are used to sample a batch of possible designs (where the batch of possible designs is represented as < ξ 1 , ξ 2 , …, ξ B >). represents the set of objective metrics, represents a batch of rewards and represents a running reward, used as described below. Each design sample ξ B is simulated in one of the parallel simulation engines 560. [0054] With reference to FIG. 7, initialization of the DSE system (lines 7 – 11) includes setting the input I to a tensor of 1s or hot-1s (or a constant value), setting the neural network initial weights of θ 0 , and setting the initial time target to zero. [0055] The design sample generation and policy improvement cycle begins at line 13 of FIG.7 with a loop which runs while a best reward sample R bestsample (a reward computed by the system which is the output closest to the target reward R τ ) is less than or equal to the target reward R τ or the current time t is less than the target time t τ. At line 14, the neural network produces the joint PDF π θ as a set of conditional PDFs based on the state of the weights θ t and the constant input I. The conditional probabilities are then sampled to produce a batch of possible designs. Each conditional probability for each dimension i is used to sample a given batch of possible designs (line 15). Each design sample ξ B in a given batch is simulated by one of the simulation engines (in one embodiment, operating in parallel as described above,) and creates a set Sim(ξ b ). Each result set Sim(ξ b ) produced by each sample simulation is filtered to extract objective metrics achieved by the corresponding design sample ξ b in the batch. The objective metrics are combined with economic weights in order to arrive at a single weighted-sum, scalar reward R b for a given batch: Equation 2 where O jb is the j-th objective metric evaluated for the b-th sample in the batch. [0056] The running reward is then updated based on the current running reward and reward statistics R statistics . Based on the batch of rewards obtained for the batch of samples, a set of reward statistics R statistics at batch level are calculated (660a) in order to update the running reward (line 19, 660b): Equation 3 Where α renew is a hyperparameter that controls the contribution of different terms in the running reward. And a surrogate advantage A b is calculated from the running reward which provides a measure of the quality of the sample: Equation 4 [0057] This surrogate advantage A b is a departure from conventional RL algorithms which use a critic. In RL, a critic is an component of an agent's architecture which estimates the value or quality (the reward) of a given state. The primary function of the critic is to approximate the expected return or future cumulative reward that the agent can achieve starting from a particular state. In embodiments herein, the running reward is the exponential decay to the surrogate advantage A b which itself scales adjustments to the probability of a sampled design ξ b that led to the batch reward R b . This running reward, and the use of the surrogate advantage, eliminates the need for a value estimation network, and allows the network 510 to resonate to an optimal state. [0058] The running reward, then, comprises an improving baseline for measuring the positive or negative advantage of generated design samples and provides the feedback to the PPO on how to adjust the probability density function. [0059] The policy network Net θ is then updated (Net θ+1 ) (line 22, 660e) based on a policy gradient statistical risk L total ( θ t ) (660d) using an optimization algorithm (such as Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions based on adaptive estimates of lower-order moments (D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015.)) and an auto-differentiation tool supporting back-propagation such as TensorFlow, PyTorch, MxNet, JAX, etc. to produce π θ . [0060] The policy gradient update algorithm thus uses the running reward to reshape π θ toward an optimal sampling policy π * = π θ* = δ(x* 1 ,x* 2 ,...,x* D ) . The statistical risk L total (θ ) is comprised of a conditional update loss, a surrogate conditional entropy loss and a Kullback- Leibler (KL) divergence KL, rev-divergence. The conditional loss is given by: Equation 5 [0061] The KL rev—divergence regulates how rapidly a policy is allowed to change and uses the sum of all conditional probability’s KL, rev-divergence loss as a surrogate KL – regularizer: Equation 6 In embodiments, the sum of all conditional probability’s KL is a rough estimate of the joint KL distribution and serves as a “surrogate” regularizer so that estimated conditional distributions do not change too quickly from one learning iteration to the next. Finally, the surrogate conditional entropy loss is given by: Equation 7 Where and normalizes relative contributions of conditional entropy and d i represents the cardinality of the i-th categorical design choice dimension. The entropy is a “surrogate” as it comprises a rough estimate of the joint entropy which may be as difficult to compute as enumerating all the possible combinations. Hence, this provides a minimal regularizer for some control over the extent of exploration. And thus, the total statistical risk is: Equation 8 [0062] As the policy network produces sample configurations that approach optimal configurations, too great an entropy may adversely affect the policy into creating less optimal designs. Thus, in embodiments herein, an entropy penalty is designed to maintain entropy in the system high during the initial runs of the system and is adjusted to control the degree of exploration from an initial value to a minimum final value: Equation 9 βe,0, βmin and r decay are used as hyper-parameters used to adjust entropy penalty β e,t and control the degree of exploration, starting from an initial value and decaying to a minimum final value. [0063] As noted in FIG. 7, the time t is incremented, and the loop returns to line 13 to repeat sample creation and policy updating until the time reaches its limit. [0064] As the DSE system generates single step configurations, there is a potential for a generated design to be a failed or “anomalous” design. An example of a failed or “anomalous” design is one that is physically impossible, such as a combination of elements in different dimensions that cannot co-exist in a design. For example, suppose a DSE concerns design of an internal combustion engine and the dimensions include the placement, spacing and size of cylinder bores in an engine block. Because a DSE system does not have knowledge of the constraints on cylinder placement, the DSE system could produce a design where cylinder bores are too close to each other or overlap. Such a design would thus be anomalous. Because the DSE system does not have any knowledge of such constraints in any given DSE problem, it is possible that the system will generate one or more of these “anomalous” designs at any point. [0065] FIG. 8 illustrates an embodiment of a DSE system which accounts for failed or “anomalous” designs. In the system of FIG.8, the simulation engine (evaluator) provides an anomalous design indication to the DSE system which is used by the DSE system to update the policy network 440. In FIG. 8, only portions of the system that are relevant to anomalous design handling are shown, with an example shown as that of processing of a single sample. It should be understood that the various other system components described herein are included in the DSE system and that in embodiments, batches of samples are processed. When a given sample design ξ B is processed by an evaluator (simulation engine 560), the evaluator is tasked with determining whether the design is an anomalous design at 805. If the design is not anomalous at 805, the system proceeds as described above, determining a scalar reward 570 which is provided to the update algorithm 410 and reward statistics module 420. [0066] If the evaluator identifies a sample configuration as anomalous at 805, an anomalous label or indicator 810 is applied to the configuration. For example, in embodiments, an anomalous reward assignment may comprise a simple label associated with a particular sample design ξ B , the label indicating to the policy gradient network that the sample design ξ B was anomalous. The label can be reflected in the reward returned for the sample design ξ B or as a non-scalar flag associated with the design which marks it as anomalous. When using, for example, the aforementioned communication system 575 described in PCT/US2021/064634, data files and signal files are used for communication between systems. A label or indicator may comprise a flag which is associated with the design in either of such data or signal files. In embodiments, these anomalous designs are identified to the policy gradient optimization algorithm by anomalous reward assignment module 825 assigning an appropriate reward Ra in order to discourage their further generation by an anomalous reward assignment module 825. In one embodiment, Ra is defined as the lesser of a batch average reward and a baseline running reward: Equation 10 [0067] The structure of Equation 10 provides for a more graceful dynamic reward assignment in anomalous cases where a simulator rejects a design. Other more specific methods for “anomalous” reward assignment may be appropriate in a given scenario. [0068] It should be recognized that the embodiment of FIG.8 is not limited to use in the embodiments herein. For example, the embodiment can be applied to any of the embodiments shown and described in WO2022147583A2 entitled “System and method for optimal placement of interacting objects on continuous (or discretized or mixed) domains”, as well as multi-step RL-based DSE systems. Multi-step RL method analysis (as opposed to the single step approach of the DSE system herein) may be used by a DSE engineer (user) to incorporate constraints through conditionality. In a multi-step method, if some constraints are not met and the design is impossible, it becomes anomalous and the present reward assignment mechanisms can still be used. [0069] FIG.9 illustrates a method for the setup of the DSE system. It should be understood that the steps, while shown in an order in FIG.9, need not be performed in the order shown. At 900, the number of dimensions is defined and the number of choices for each dimension are provided. As noted herein, the DSE system operates on a design space where the number of design dimensions and number of options in each dimension is fixed. At 905, a policy network type is chosen. The use of an MLP or Transformer policy network have been shown to be advantageous in testing scenarios. Other neural network architectures may also be used. The neural network size and architecture depends on the complexity and structure of the problem. In embodiments, the type of network architecture used and its own design parameters may be considered a meta-design problem which is resolved through best practices. In alternative embodiments, this meta-design problem can potentially also be explored by the DSE system. The policy network size is defined at 910. As noted herein, one should design the policy network with good capacity for internal resonance among the various dimensions so that inter- dependencies can be learned efficiently. Optionally, an ending event (a timer as discussed herein or epoch limit) may be defined at 915. Objective metrics for evaluating the design are provided at 920. The metrics may include weights defining the importance of any one or more dimensions or choices within each dimension. [0070] In order to define a policy network size and type, one may create an artificial design space with the same topology as a design space to be explored. In alternative embodiments, the DSE system described herein can be used to explore very large synthetic as well as real spaces. Artificial DSE settings are designed for rapid dimensional benchmarking and hyper-parameter configuration. One of the main challenges in comparing DSE systems is the unavailability of common benchmarks. To study large design spaces a class of very fast exploration space sample-efficiency benchmarks that can be used for two purposes: 1) to compare system performance of alternate algorithms in particular design space topologies, and 2) to explore best-practice values for hyper-parameters when using those algorithms in a given design space topology. As a very simple example, let us consider a synthetic decision space of 20 dimensions with 64 categorical choices available in each dimension. (This simple synthetic example is for demonstration purposes only.) This space contains 1.329228 × 10 36 possible design configurations. Unbeknownst to the automatic explorer an arbitrarily point is selected as a “black-box” point in the exploration space. This selection allows an artificial distant reward to be used as a quick black-box feedback to the automatic explorer. FIG.10A compares genetic algorithms (e.g. combinations of normal greedy mutation, differential evolution and uniform greedy mutation) implementations with the DSE system disclosed herein in this synthetic DSE task. Again, the DSE system can achieve to the optimal point, while the combination GA algorithms never find the optimal point although they are capable of getting close to it. This offers another confirmation for the advantage of the single-step MDP (θ- Resonance) approach. [0071] FIG.10B illustrates a method of creating an artificial problem and design solution to determine an optimal network configuration. Because the design engineer is tasked with choosing the type of network architecture used and its design parameters, FIG.10B illustrates a method of exploring this meta-design problem by the DSE system. One creates an artificial problem with synthetic benchmarks and an artificial target. An example of an artificial problem/target input is shown in FIG. 11. As illustrated therein, and with reference to FIG. 10, the method of providing synthetic benchmarks for a dimensionality analysis begins by setting the number of parameters at 925 and choices for each dimension. In the example of FIG. 11, the number of dimensions is eighteen (18) and the choices in each dimension vary. Next, at 930, an artificial target is defined, network type chosen and a benchmark penalty is defined at 930. A least (minimum) penalty of 0.1 is defined, and an artificial target specified as shown in FIG.11. The network simulation and policy update is then run at 935 which yields a performance time for the network to reach the artificial target. By experimenting with different network types and sizes using the same artificial problem, one can determine the optimal network type and size for the particular design space problem under consideration. [0072] FIG. 12 illustrates an example design space exploration problem comprising a system on chip (SoC)) design. FIG. 12 allows for an SoC design with different CPU configuration and parameters. Given the number of options available across the 18 choice dimensions in FIG. 11, the number of possible designs equals 2×12×3×2×3×3×4×7×13×10×5×10×5×6×7×5×7×2 = 3.4673184×10 12 . A simulator such as that described in Y. S. Shao, S. L. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks, “Codesigning accelerators and soc interfaces using gem5-aladdin,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp.1–12 may be used as the design evaluator. In this DSE, the number of samples the system generates are compared before the system reaches a defined level of cost. Practical applications of the disclosed DSE system include explore designs in macro placement for silicon devices and SoC design. [0073] The design space of the example in FIG.12 is composed of a number of possible discrete (categorical) choices. However, the DSE system is also applicable to cases where choice dimensions include continuous dimensions of choice or mixed discrete-continuous dimensions of choice where some dimensions involve continuous and other dimensions have discrete choices. For continuous dimensions, there are an innumerable set of possibilities. For continuous design decision dimensions, one can use analytic continuous probabilities that are most suitable for a given problem at hand—e.g., through the Beta distribution for bounded continuous dimensions (which has two parameters for each continuous, bounded dimension) or the Gaussian distribution for unbounded continuous dimensions. In such cases, the conditional action probabilities for continuous dimensions would be generated by the policy networks as the corresponding analytic probability density function’s parameters. For example, if one dimension of choice is a real value between -10 and 10, we use beta distribution (over the bounded region 0 to 1) and then map samples in this bounded region to samples in the bounded region from -10 to 10. For dimensions that are semi-infinite (semi-bounded), one of the many distributions that model domains with semi-infinite support may be used. [0074] FIG.13 illustrates the impact of policy network architectural variations. Results for a Transformer network with 12.67 million runs with a PPO batch/minibatch/epoch configuration of 8/8/4 (T12.67M−8−8−4) and multilayer perceptrons (MLP) with 13.72 million, 8.88 million and 3.58 million runs, respectively, are shown. These results are found using a random seed. With the exception of the smallest MLP, both MLP and Transformer architectures lead to the design with the least penalty found. The Transformer policy network seems most efficient in learning joint dependencies and converges to the best design with the least number of samples. [0075] FIG. 14 illustrates the effect of some PPO batching variations when using the Transformer policy network. Larger sampled batched under the same policy cause slower learning rates but are also generally more stable. [0076] FIG.15 is a block diagram of a network processing device 1500 that can be used to implement various embodiments of a network processing device on which the methods described herein may be operated, including, for example, a service host. Specific network devices may utilize all of the components shown, or only a subset of the components, and levels of integration may vary from device to device. [0077] The network device 1500 may comprise a central processing unit (CPU) 1510, a memory 1520, a mass storage device 1530, and an I/O interface 1560 connected to a bus 1570. The bus 1570 may be one or more of any type of several bus architectures including a memory bus or memory controller, a peripheral bus, or the like. A network interface 1550 enables the network processing device to communicate over a network 1580 with other processing devices such as those described herein. The I/O interface may be connected to a keyboard, mouse, touch screen, display device and/or an image capture device. [0078] The CPU 1510 may comprise any type of electronic data processor. Memory 1520 may comprise any type of system memory such as static random-access memory (SRAM), dynamic random-access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), a combination thereof, or the like. In an embodiment, memory 1520 may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs. In embodiments, the memory 1520 is non-transitory. In one embodiment, memory 1520 includes computer readable instructions that are executed by the processor(s) 1510 to implement embodiments of the disclosed technology, including the design space exploration described herein. [0079] The mass storage device 1530 may comprise any type of storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus 1570. The mass storage device 1530 may comprise, for example, one or more of a solid-state drive, hard disk drive, a magnetic disk drive, an optical disk drive, or the like. One or more stored applications (such as neural network 510a, PPO 410a, statistics calculation module 420a, design configurations 425a and a reward buffer 455a) may be resident in the mass storage and accessed by the CPU (as illustrated in FIG. 15) and loaded into memory 1520 for execution as needed. Multiple instances of each of the applications may be present in memory 1520. [0080] Embodiments of the simulation engines discussed herein may communicate with device 1500 using network 1580 and via the network interface. [0081] As a consequence of the distinct features of the DSE system, the system scales to far larger design spaces that previous systems. For the purposes of this document, it should be noted that the dimensions of the various features depicted in the figures may not necessarily be drawn to scale. [0082] For purposes of this document, reference in the specification to “an embodiment,” “one embodiment,” “some embodiments,” or “another embodiment” may be used to describe different embodiments or the same embodiment. [0083] For the purposes of this document, a connection may be a direct connection or an indirect connection (e.g., via one or more other parts). In some cases, when an element is referred to as being connected or coupled to another element, the element may be directly connected to the other element or indirectly connected to the other element via intervening elements. When an element is referred to as being directly connected to another element, then there are no intervening elements between the element and the other element. Two devices are “in communication” if they are directly or indirectly connected so that they can communicate electronic signals between them. [0084] Although the present disclosure has been described with reference to specific features and embodiments thereof, it is evident that various modifications and combinations can be made thereto without departing from the scope of the disclosure. The specification and drawings are, accordingly, to be regarded simply as an illustration of the disclosure as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the present disclosure. [0085] The technology described herein can be implemented using hardware, software, or a combination of both hardware and software. The software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein. The processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer readable storage media and communication media. Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information, and which can be accessed by a computer. A computer readable medium or media does (do) not include propagated, modulated, or transitory signals. [0086] Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated, or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media. [0087] In alternative embodiments, some or all of the software can be replaced by dedicated hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc. In one embodiment, software (stored on a storage device) implementing one or more embodiments is used to program one or more processors. The one or more processors can be in communication with one or more computer readable media/ storage devices, peripherals and/or communication interfaces. [0088] It is understood that the present subject matter may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and complete and will fully convey the disclosure to those skilled in the art. Indeed, the subject matter is intended to cover alternatives, modifications, and equivalents of these embodiments, which are included within the scope and spirit of the subject matter as defined by the appended claims. Furthermore, in the following detailed description of the present subject matter, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. However, it will be clear to those of ordinary skill in the art that the present subject matter may be practiced without such specific details. [0089] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. [0090] The description of the present disclosure has been presented for purposes of illustration and description but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated. [0091] For purposes of this document, each process associated with the disclosed technology may be performed continuously and by one or more computing devices. Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device. [0092] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.