Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
FLUID SIMULATIONS USING ONE OR MORE NEURAL NETWORKS
Document Type and Number:
WIPO Patent Application WO/2021/101548
Kind Code:
A1
Abstract:
Approaches in accordance with various embodiments provide for fluid simulation with substantially reduced time and memory requirements with respect to conventional approaches. In particular, various embodiments can perform time and energy efficient, large scale fluid simulation on processing hardware using a method that does not solve for the Navier-Stokes equations to enforce incompressibility. Instead, various embodiments generate a density tensor and rigid body map tensor for a large number of particles contained in a sub-domain. Collectively, the density tensor and rigid body map may represent input channels of a network with three spatial-dimensions. The network may apply a series of operations to the input channels to predict an updated position and updated velocity for each particle at the end of a frame. Such approaches can handle tens of millions of particles within a virtually unbounded simulation domain, as compared to classical approaches that solve for the Navier-Stokes equations.

Inventors:
TUMANOV EVGENII (US)
KOROBCHENKO DMITRY (US)
SOLOVEY ALEXEY (US)
Application Number:
PCT/US2019/062551
Publication Date:
May 27, 2021
Filing Date:
November 21, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NVIDIA CORP (US)
International Classes:
G06T7/174; G06T7/215; G06T7/246
Foreign References:
US8452721B22013-05-28
US20160246903A12016-08-25
US5615677A1997-04-01
Attorney, Agent or Firm:
LOHR, Jason (US)
Download PDF:
Claims:
CLAIMS

WHAT IS CLAIMED IS:

1. A computer-implemented method, comprising: determining, for a set of frames, a simulation to be performed on particles located in a domain, the simulation including an incompressibility constraint, and the set of frames including at least a first frame and a second frame; obtaining, for a first frame, motion information of each particle located in the domain, the motion information including a first position and a first velocity; and determining a final position and a final velocity of each particle in the domain at an end of the first frame based at least in part on the motion information, a density tensor, and a rigid body mask tensor; outputting the final position and the final velocity of each particle in the domain for use in the second frame; and performing, using one or more processors, the simulation for the first frame and the second frame based at least in part on the final position and the final velocity of each particle.

2. The computer-implemented method of claim 1, wherein obtaining the motion information comprises: determining a set of external forces applied to each of the particles in the first frame, the motion information being obtained based at least in part on the set of external forces.

3. The computer-implemented method of claim 1, further comprising: utilizing at least one trained neural network to determine the final position and the final velocity of each particle in the domain.

4. The computer-implemented method of claim 3, wherein the obtaining the density tensor for the first frame comprises: spatially dividing the domain into voxels according to at least one capability of one or more processors to perform the simulation, each of the voxels representing subvolumes of particles for the simulation; determining a number of particles located within each of the voxels based at least in part on the incompressibility constraint; and generating the density tensor as a position map of the particles located in the domain based at least in part on the number of particles determined for each of the voxels.

5. The computer-implemented method of claim 4, further comprising obtaining the rigid body mask tensor for the first frame.

6. The computer-implemented method of claim 5, wherein the obtaining the rigid body mask tensor for the first frame comprises: determining a first subset of the voxels of the density tensor is covered by a rigid body and a second subset of the voxels of the density tensor is not covered by the rigid body; and assigning a first value to each of the first subset of voxels in the density tensor and a second value to each of the second subset of voxels in the density tensor.

7. The computer-implemented method of claim 6, wherein the determining the final position and the final velocity of each particle in the domain at an end of the first frame comprises: inputting the density tensor as a first channel and the rigid body mask tensor as a second channel into a network, the network having three spatial dimensions; and obtaining, using the one or more processors, an intermediate tensor for the set of frames based at least in part on the first channel and the second channel; and obtaining, using the one or more processors, a correction factor for each of the particles based at least in part on a depth-to- space operation and the intermediate tensor.

8. A system comprising: one or more processors; and memory including instructions that, when executed by the one or more processors, cause the system to: determine, for a set of frames, a simulation to be performed on particles located in a domain, the simulation including an incompressibility constraint, and the set of frames including at least a first frame and a second frame; obtain, for a first frame, motion information of each particle located in the domain, the motion information including a first position and a first velocity; and determine a final position and a final velocity of each particle in the domain at an end of the first frame based at least in part on the motion information, a density tensor, and a rigid body mask tensor; output the final position and the final velocity of each particle in the domain for use in the second frame; and perform, using the one or more processors, the simulation for the first frame and the second frame based at least in part on the final position and the final velocity of each particle.

9. The system of claim 8, wherein the one or more processors are configured to obtain the motion information by: determining a set of external forces applied to each of the particles in the first frame, the motion information being obtained based at least in part on the set of external forces.

10. The system of claim 8, wherein the one or more processors are further configured to obtain the density tensor for the first frame.

11. The computer-implemented method of claim 10, wherein the one or more processors are configured to obtain the density tensor for the first frame by: spatially dividing the domain into voxels according to at least one capability of one or more processors to perform the simulation, each of the voxels representing subvolumes of particles for the simulation; determining a number of particles located within each of the voxels based at least in part on the incompressibility constraint; and generating the density tensor as a position map of the particles located in the domain based at least in part on the number of particles determined for each of the voxels.

12. The system of claim 4, wherein the one or more processors are further configured to obtain the rigid body mask tensor for the first frame.

13. The system of claim 12, wherein the one or more processors are configured to obtain the rigid body mask tensor for the first frame by: determining a first subset of the voxels of the density tensor is covered by a rigid body and a second subset of the voxels of the density tensor is not covered by the rigid body; and assigning a first value to each of the first subset of voxels in the density tensor and a second value to each of the second subset of voxels in the density tensor.

14. The system of claim 13, wherein one or more processors are configured to determine the final position and the final velocity of each particle in the domain at an end of the first frame by: inputting the density tensor as a first channel and the rigid body mask tensor as a second channel into a network, the network having three spatial dimensions; and obtaining, using the one or more processors, an intermediate tensor for the set of frames based at least in part on the first channel and the second channel; and obtaining, using the one or more processors, a correction factor for each of the particles based at least in part on a depth-to- space operation and the intermediate tensor.

15. A non-transitory machine-readable medium having stored thereon a set of instructions, which if performed by one or more processors, cause the one or more processors to at least: determine, for a set of frames, a simulation to be performed on particles located in a domain, the simulation including an incompressibility constraint, and the set of frames including at least a first frame and a second frame; obtain, for a first frame, motion information of each particle located in the domain, the motion information including a first position and a first velocity; and determine a final position and a final velocity of each particle in the domain at an end of the first frame based at least in part on the motion information, a density tensor, and a rigid body mask tensor; output the final position and the final velocity of each particle in the domain for use in the second frame; and perform, using one or more processors, the simulation for the first frame and the second frame based at least in part on the final position and the final velocity of each particle.

16. The non-transitory machine-readable medium of claim 15, wherein instructions when performed to obtain the motion information further cause the one or more processors to: determine a set of external forces applied to each of the particles in the first frame, the motion information being obtained based at least in part on the set of external forces.

17. The non-transitory machine-readable medium of claim 15, wherein instructions when performed further cause the one or more processors to obtain the density tensor for the first frame.

18. The non-transitory machine-readable medium of claim 17, wherein instructions when performed to obtain the density tensor for the first frame further cause the one or more processors to: spatially divide the domain into voxels according to at least one capability of one or more processors to perform the simulation, each of the voxels representing subvolumes of particles for the simulation; and generate the density tensor as a position map of the particles located in the domain for each of the voxels.

19. The non-transitory machine-readable medium of claim 18, wherein instructions when performed to obtain the rigid body mask tensor for the first frame further cause the one or more processors to: determine a first subset of the voxels of the density tensor is covered by a rigid body and a second subset of the voxels of the density tensor is not covered by the rigid body; and assign a first value to each of the first subset of voxels in the density tensor and a second value to each of the second subset of voxels in the density tensor.

20. The non-transitory machine-readable medium of claim 19, wherein instructions when performed to determine the final position and the final velocity of each particle in the domain at an end of the first frame further cause the one or more processors to: input the density tensor as a first channel and the rigid body mask tensor as a second channel into a network, the network having three spatial dimensions; and obtain, using the one or more processors, an intermediate tensor for the set of frames based at least in part on the first channel and the second channel; and obtain, using the one or more processors, a correction factor for each of the particles based at least in part on a depth-to- space operation and the intermediate tensor.

Description:
FLUID SIMULATIONS USING ONE OR MORE NEURAL NETWORKS

BACKGROUND

[0001] Fluid simulation is a long standing problem in computer graphics, applied mathematics, and other fields. Computing high-resolution fluid simulations using traditional techniques is challenging in that fluid simulations require an undesirable amount of computational resources to compute a scene with a large number of water particles, such as a scene with a river or body of water. The main computational bottleneck is associated with a time step size that is small enough to provide stability. The simulation time for fluid simulations using traditional techniques may be in the range of hours to days, and hence, may be unable to achieve high- resolution fluid simulation in real-time. Consequently, computing high-resolution fluid simulation using traditional techniques may be impractical for certain use cases, such as computer simulations and video games.

BRIEF DESCRIPTION OF THE DRAWINGS

[0002] Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which:

[0003] FIG. 1 illustrates an example of a fluid simulation for a computer graphics application that can be performed in accordance with various embodiments.

[0004] FIG. 2 illustrates an example fluid incompressibility constraint that can be utilized in accordance with various embodiments.

[0005] FIG. 3 A illustrates an example approach to generating a density tensor that can be utilized in accordance with various embodiments.

[0006] FIG. 3B illustrates an example rigid body mask tensor that can be utilized in accordance with various embodiments. [0007] FIG. 4 illustrates an example network that can be utilized in accordance with various embodiments.

[0008] FIG. 5 illustrates an example inner depth-to-space operation that can be performed in accordance with various embodiments.

[0009] FIG. 6 illustrates a first example process for performing a simulation that can be utilized in accordance with various embodiments.

[0010] FIG. 7 illustrates inference and/or training logic, according to at least one embodiment; [0011] FIG. 8 illustrates inference and/or training logic, according to at least one embodiment; [0012] FIG. 9 illustrates a data center system, according to at least one embodiment;

[0013] FIG. 10 illustrates a computer system, according to at least one embodiment;

[0014] FIG. 11 illustrates a computer system, according to at least one embodiment;

[0015] FIG. 12 illustrates a computer system, according to at least one embodiment;

[0016] FIG. 13 illustrates a computer system, according to at least one embodiment; and [0017] FIG. 14 illustrates a computer system, according to at least one embodiment.

DETAILED DESCRIPTION

[0018] In the following description, various embodiments will be described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. However, it will also be apparent to one skilled in the art that the embodiments may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the embodiment being described.

[0019] As mentioned above, fluid simulation is a long standing problem in computer graphics, applied mathematics, and other fields due to the computational bottleneck caused by the time step size needed for high resolution fluid simulation in real-time. Conventional approaches to fluid simulation include the Eulerian computational paradigm (grid based) and the Lagrangian computational paradigm (particle based). Methods that perform fluid simulation using the Eulerian computational paradigm operate with scalar and vector physical fields on a regular grid, and by solving the Navier-Stokes equations for pressure, viscosity, surface tension, and incompressibility. Methods that operate using the Lagrangian computational paradigm approximate continuous quantities in the Navier-Stokes equations with quantities known at discrete entities called particles.

[0020] While incompressibility can be enforced in a fluid simulation using the Navier-Stokes equations, these computations are very time consuming, and hence, may be impractical for use in real time-fluid simulation. Thus, there is a need to compute high-resolution fluid simulations that enforces the incompressibility of a fluid without solving for the Navier-Stokes equations.

[0021] Approaches in accordance with various embodiments provide for fluid simulation with substantially reduced time and memory requirements with respect to conventional approaches.

In particular, various embodiments generate a density tensor and rigid body map tensor for a large number of particles contained in a given sub-domain of a particle domain. The density tensor may be represented in a number of different formats, and may include NVIDIA® GVDB Voxels. The rigid body map tensor may be represented as a binary determination of which of the NVIDIA® GVDB Voxels in the density tensor are covered by a rigid body in during a particular frame in a simulated scene. Collectively, the density tensor and rigid body map may represent input channels of a network ( e.g ., a trained neural network) with three spatial-dimensions. The network may apply a series of operations (e.g., convolutions, dilated convolutions, three- dimensional pixel shuffling, etc.) to the input channels to predict an updated position and updated velocity for each particle at the end of a frame. Such approaches can handle tens of millions of particles within a virtually unbounded simulation domain, as compared to classical approaches that solve for the Navier-Stokes equations. Data in some embodiments can be collected using a Position Based Fluids method, although other solvers can be used as well within the scope of various embodiments. [0022] Various other functions can be implemented within the various embodiments as well as discussed and suggested elsewhere herein.

[0023] As mentioned, it may be desirable in various instances to perform a high-resolution fluid simulation in real-time. For example, as illustrated in the example image 100 of FIG. 1, a computer graphics application for a movie or video game might involve a display objects 102 dropping into or moving through a body or basin 106 of water. While “basin” is used in various examples, it should be understood that a particle domain can be split into a number of sub- domains through which objects can move through. In at least some embodiments, as discussed in more detail elsewhere herein, density and rigid body masks can be constructed, and a network run, for each sub-domain in various embodiments, and these tasks are not limited to a specific basin. In order to model and render the movement and displacement of water in a realistic fashion, a particle simulation can be used wherein “droplets” of water are modeled as individual particles 104 and the behavior of the particles is determined by the particle simulation, which can take into account factors such as the positions, velocities, and incompressibility of the various particles. In many instances, the simulation can be performed on one or more processors, such as a set of graphics processing units (GPUs). As mentioned, a high-resolution fluid simulation can be performed on GPU hardware using a methodology that enforces the incompressibility of the fluid without solving for the Navier-Stokes equations. A simulation might involve tens of millions of particles within a virtually unbounded simulation domain. In some embodiments, the fluid can be modeled using an incompressibility constraint m that dictates that when at rest (x, = 0, v, = 0), the number of particles located within radius R from one another cannot be larger than m. As illustrated in the example image 200 in FIG. 2, the incompressibility constraint m is that no more than 10 particles can be located within radius R from any one particle 104 when at rest. However, other incompressibility constraints may be used without departing from the scope of the present disclosure.

[0024] When a fluid is simulated as a large set of particles, each particle may have vector quantities of position and velocity associated therewith. In some embodiments, the fluid simulation may be divided into a set of steps that may be performed for each frame of a scene being rendered. The set of steps may be performed to update the position and velocity for each of the particles throughout the duration of a frame. At the end of a frame, the updated positions and velocities may be used as the initial positions and velocities for the particles during the subsequent frame. Details associated with each of the steps are set forth below in connection with FIGs. 3 A-6.

[0025] At the beginning of a frame n , the one or more processors may determine a motion, also referred to as an advection in some embodiments, of individual particles 104 that is caused by the flow of the bulk fluid being simulated. As seen below, the velocity may be determined for particle i in frame n using equation 1, and the position (x 1* ) for particle i at frame n may be determined using equation 2. As seen in equation 1, velocity (vj 1* ) may be equal to the initial velocity for particle i at the start of frame n plus the sum of external forces (e.g., gravity, viscosity, surface tension, pressure, etc.) during time period At, where time period At is the duration of a frame. For example, if a scene is run at 60 frames / second, then time period At is equal to 16.6 ms.

[0026] As seen in equation 2, the position (x * ) may be equal to the initial velocity {vf) of particle i (e.g., at the start of frame n) plus an average of the initial velocity (u ) and the velocity (vj 1* ) during time period At, where time period At is the duration of frame //. The one or more processors may maintain a particle positions table the correlates one or more of the velocity (vj 1* ) and position (x * ) for each of the particles 104 in relation to all of the other particles 104 in frame n. Once the velocity (v ) and the position (x? * ) for each of the particles are determined, the one or more processors may perform collision detection, as seen below in equation 3. In certain implementations, the one or more processors may perform collision detection by estimating viscosity and surface tension on a grid using, for example, signed distance fields (SDFs). equation 1 equation 2 xf * = CollisionDetection(xf * ) equation 3

[0027] To determine a final velocity (vf +1 ) and final position (xf +1 ) for particle i at the end of frame n (e.g., which is also the initial velocity of particle i at the start of frame //+ 1 ), the one or more processors may determine a correction factor (6 ) in order to account for any other forces acting on the particles 104 during frame n. The other forces may include, for example, an object falling into the basin 106. For example, in frame n a ball begins falling into the basin 106 and colliding with a set of particles 104. In order to provide a realistic high-resolution fluid simulation, the one or more processors determine a correction factor d in order to predict and display the movement of water caused by the force imparted on the set of particles 104 by the falling object.

[0028] As seen below, the final velocity (vf +1 ) may be determined for particle i in frame n using equation 4, and the final position (xf * ) for particle i in frame n using equation 5. As seen in equation 4, the final velocity (vf +1 ) may be equal to the velocity (vf * ) for particle i in frame n plus a correction factor (Sf). As seen in equation 5, the final position (xf +1 ) for particle i at the end of frame n may be equal to the position (xf * ) plus half of the correction factor (S ) during time period At, where time period At is the duration of frame n. equation 4 equation 5

[0029] In order to obtain the correction factor d for each of the particles 104, the one or more processors may generate a density tensor and a rigid body mask tensor. The density tensor and the rigid body mask tensor may represent the input channels of a network (e.g., described below in connection with FIG. 4) that determines the correction factor d for each of the particles 104 in frame n.

[0030] For example, the one or more processors may spatially divide the basin 106, or the region in the basin 106 containing particles 104, into equal voxels 302, as illustrated in the example image 300 in FIG. 3A. In certain implementations, each voxel 302 may be a cube with a width, height, and depth equal to radius R described above in connection with the incompressibility constraint m. The one or more processors may determine the number of particles in each voxel 302 as an ordinary sum, or a weighted sum, as may be based at least in part on the determined velocity (v 1* ) and the position (xf * ) for each of the particles 104, as well as the incompressibility constraint m. The number of particles 104 in each voxel 302 may represent the density tensor.

[0031] In some embodiments, the one or more processors may obtain the rigid bodies mask tensor 404 by determining whether a rigid body 306a, 306b covers the center point 304 of any of the voxels 302 in frame n, as illustrated in the example image 310 in FIG. 3B. The example image 310 in FIG. 3B is a top down view of the topmost layer of voxels 302 in they-direction in the example image 300 of FIG. 3 A. All voxels 302 that share a center point axis in they- direction under a topmost layer voxel 302 with a center point 304 that is covered by a rigid body 306a, 306b may also be considered to be covered by the rigid body 306a, 306b. In some embodiments, the one or more processors may obtain the rigid body mask tensor by assigning a first value (e.g., 1) to each of the voxels 302 with a center point 304 that is covered by a rigid body 306a, 306b, and a second value (e.g., 0) to each of the voxels 302 with a center point 304 that remains uncovered by a rigid body 306a, 306b in frame n.

[0032] As mentioned above, the density tensor 402 and a rigid body mask tensor 404 may each represent a single input channel into an example network 400, as illustrated in the example image of FIG. 4. Each voxel in the density tensor 402 may contain information about one or more of the density of particles 104, the position of particles 104, or velocity of particles 104 located within that voxel during frame n. The rigid body mask tensor 404 may designate each voxel with a 1 or a 0 to indicate which voxels that are covered by a rigid body 306a, 306b, as described above in connection with FIG. 3B. The density tensor 402 and the rigid body mask tensor 404 may be padded with 0s along the x-direction, j ' -di recti on, and z-direction, for example, to maintain the same dimensions for the output channel after applying a convolution. [0033] In the example network 400 illustrated in FIG. 4, the one or more processors may apply a series of convolutions 406a, 406b to the density tensor 402 and rigid body mask tensor 404 input channels. The convolution 406a may have, for example, 32 filters of kernel size 3 x 3 x 3, which outputs a features map 408a with 32 channels. A second convolution 406b with, for example, 32 filters of kernel size 3 x 3 x 3 may be applied to the features map 408a, which may output a second features map 408b with 32 channels. The one or more processors may apply a dilated convolution 410 with, for example, with 32 filters of kernel size 3 x 3 x 3 to the second features map 408b, and output a third features map 408c with 32 channels. The dilated convolution 410 may provide an increased receptive view of the network 400 and the information contained in the density tensor 402 and the rigid body mask tensor 404. The one or more processors may apply another convolution 406c with, for example, 32 filters of kernel size 3 x 3 x 3 to the third features map 408c and output a fourth features map 408d. Another convolution 406d with, for example, 32 channels of kernel size l x l x l may be applied to the fourth features map 408d, and output a fifth features map 408e with 32 channels. Another convolution 406e with, for example, 24 channels of kernel size l x l x l may be applied to the fifth features map 408e, and output a sixth features map 408f with 24 channels. The sixth features map 408f may have dimensions of D x H x W x C · r 3 , where D, H, and W are the spatial dimensions, C is a constant, and r 3 is the number of channels.

[0034] In order to obtain information about spatially neighboring channels in the sixth features map 408f, the one or more processors may perform a depth-to-space operation 412 in which information from one channel is related to spatially neighboring channels, as seen in the example image 500 of FIG. 5. The depth-to-space operation 412 may produce a seventh features map 414 with dimensions rD x rHx rW x C. The one or more processors may perform nearest predictions pooling 418 based on information in the particle positions map 416 and the seventh features map 414. The nearest predictions pooling 418 may output a table 420 with correction factors d for each of the particles 104 in frame n. [0035] In some embodiments, the one or more processors may determine the final velocity (i7- l+1 ) and the final position (xf +1 ), respectively using equations 4 and 5 above, and the correction factors d listed in the table 420. By extracting channels of information from the density tensor 402 and the rigid body mask tensor 404 using the network described in connection with FIG. 4, the one or more processors of the present disclosure may perform high-resolution fluid simulation without solving for the Navier-Stokes equations, and hence, be used in real-time.

[0036] FIG. 6 illustrates an example process 600 for performing a simulation that can be utilized in accordance with various embodiments. It should be understood for this and other processes discussed herein that there can be additional, alternative, or fewer steps performed in similar or alternative orders, or in parallel, within the scope of the various embodiments unless otherwise stated. Further, although fluid simulations are utilized as a primary use case for description purposes, various other simulations and applications can take advantage of aspects of the various embodiments as discussed and suggested herein, as may relate to hair simulation and the like. In this example, a particle simulation to be performed is determined 602. This can include, for example, determining a fluid simulation to be performed as part of a computer graphics application, among other such options. Motion information associated with each particle located in a basin or domain may be determined 604, such as is described above in connection with equations 1-3. A density tensor for frame n may be obtained 604 or generated, such as described above in connection with FIG. 3 A. The density tensor may represent the positions and/or velocities of all particles with respect to voxels. A rigid body mask may be obtained 606 or generated for frame //, such as is described above in connection with FIG. 3B. The final position and final velocity for each of the particles in frame n may be determined 610, such as described above in connection with equations 4-5 and FIG. 4. The final position and final velocity for each of the particles in frame n may be output for use as the initial velocity and initial position in frame n + 1, as described above in connection with equations 4-5 and FIG. 4. The simulation for the first and second frames may be performed based, at least in part, on the final velocity and final position determined for each of the particles. INFERENCE AND TRAINING LOGIC

[0037] FIG. 7 illustrates inference and/or training logic 715 used to perform inferencing and/or training operations associated with at least one embodiment. Details regarding inference and/or training logic 715 are provided below in conjunction with FIGs. 7 and 8.

[0038] In at least one embodiment, inference and/or training logic 715 may include, without limitation, code and/or data storage 701 to store forward and/or output weight and/or input/output data, and/or other parameters to configure neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments. In at least one embodiment, training logic 715 may include, or be coupled to code and/or data storage 701 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs). In at least one embodiment, code, such as graph code, loads weight or other parameter information into processor ALUs based on an architecture of a neural network to which this code corresponds. In at least one embodiment, code and/or data storage 701 stores weight parameters and/or input/output data of each layer of a neural network trained or used in conjunction with one or more embodiments during forward propagation of input/output data and/or weight parameters during training and/or inferencing using aspects of one or more embodiments. In at least one embodiment, any portion of code and/or data storage 701 may be included with other on-chip or off-chip data storage, including a processor’s LI, L2, or L3 cache or system memory.

[0039] In at least one embodiment, any portion of code and/or data storage 701 may be internal or external to one or more processors or other hardware logic devices or circuits. In at least one embodiment, code and/or data storage 701 may be cache memory, dynamic randomly addressable memory (“DRAM”), static randomly addressable memory (“SRAM”), non-volatile memory (e.g., Flash memory), or other storage. In at least one embodiment, choice of whether code and/or data storage 701 is internal or external to a processor, for example, or comprised of DRAM, SRAM, Flash or some other storage type may depend on available storage on-chip versus off-chip, latency requirements of training and/or inferencing functions being performed, batch size of data used in inferencing and/or training of a neural network, or some combination of these factors.

[0040] In at least one embodiment, inference and/or training logic 715 may include, without limitation, a code and/or data storage 705 to store backward and/or output weight and/or input/output data corresponding to neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments. In at least one embodiment, code and/or data storage 705 stores weight parameters and/or input/output data of each layer of a neural network trained or used in conjunction with one or more embodiments during backward propagation of input/output data and/or weight parameters during training and/or inferencing using aspects of one or more embodiments. In at least one embodiment, training logic 715 may include, or be coupled to code and/or data storage 705 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs). In at least one embodiment, code, such as graph code, loads weight or other parameter information into processor ALUs based on an architecture of a neural network to which this code corresponds. In at least one embodiment, any portion of code and/or data storage 705 may be included with other on-chip or off-chip data storage, including a processor’s LI, L2, or L3 cache or system memory. In at least one embodiment, any portion of code and/or data storage 705 may be internal or external to on one or more processors or other hardware logic devices or circuits. In at least one embodiment, code and/or data storage 705 may be cache memory, DRAM, SRAM, non-volatile memory (e.g., Flash memory), or other storage. In at least one embodiment, choice of whether code and/or data storage 705 is internal or external to a processor, for example, or comprised of DRAM, SRAM, Flash or some other storage type may depend on available storage on-chip versus off-chip, latency requirements of training and/or inferencing functions being performed, batch size of data used in inferencing and/or training of a neural network, or some combination of these factors. [0041] In at least one embodiment, code and/or data storage 701 and code and/or data storage 705 may be separate storage structures. In at least one embodiment, code and/or data storage 701 and code and/or data storage 705 may be same storage structure. In at least one embodiment, code and/or data storage 701 and code and/or data storage 705 may be partially same storage structure and partially separate storage structures. In at least one embodiment, any portion of code and/or data storage 701 and code and/or data storage 705 may be included with other on-chip or off-chip data storage, including a processor’s LI, L2, or L3 cache or system memory.

[0042] In at least one embodiment, inference and/or training logic 715 may include, without limitation, one or more arithmetic logic unit(s) (“ALU(s)”) 910, including integer and/or floating point units, to perform logical and/or mathematical operations based, at least in part on, or indicated by, training and/or inference code (e.g., graph code), a result of which may produce activations (e.g., output values from layers or neurons within a neural network) stored in an activation storage 820 that are functions of input/output and/or weight parameter data stored in code and/or data storage 801 and/or code and/or data storage 805. In at least one embodiment, activations stored in activation storage 820 are generated according to linear algebraic and or matrix-based mathematics performed by ALU(s) in response to performing instructions or other code, wherein weight values stored in code and/or data storage 805 and/or code and/or data storage 801 are used as operands along with other values, such as bias values, gradient information, momentum values, or other parameters or hyperparameters, any or all of which may be stored in code and/or data storage 805 or code and/or data storage 801 or another storage on or off-chip.

[0043] In at least one embodiment, ALU(s) are included within one or more processors or other hardware logic devices or circuits, whereas in another embodiment, ALU(s) may be external to a processor or other hardware logic device or circuit that uses them (e.g., a co processor). In at least one embodiment, ALUs may be included within a processor’s execution units or otherwise within a bank of ALUs accessible by a processor’s execution units either within same processor or distributed between different processors of different types (e.g., central processing units, graphics processing units, fixed function units, etc.). In at least one embodiment, code and/or data storage 801, code and/or data storage 805, and activation storage 820 may be on same processor or other hardware logic device or circuit, whereas in another embodiment, they may be in different processors or other hardware logic devices or circuits, or some combination of same and different processors or other hardware logic devices or circuits.

In at least one embodiment, any portion of activation storage 820 may be included with other on- chip or off-chip data storage, including a processor’s LI, L2, or L3 cache or system memory. Furthermore, inferencing and/or training code may be stored with other code accessible to a processor or other hardware logic or circuit and fetched and/or processed using a processor’s fetch, decode, scheduling, execution, retirement and/or other logical circuits.

[0044] In at least one embodiment, activation storage 820 may be cache memory, DRAM, SRAM, non-volatile memory (e.g., Flash memory), or other storage. In at least one embodiment, activation storage 820 may be completely or partially within or external to one or more processors or other logical circuits. In at least one embodiment, choice of whether activation storage 820 is internal or external to a processor, for example, or comprised of DRAM, SRAM, Flash or some other storage type may depend on available storage on-chip versus off-chip, latency requirements of training and/or inferencing functions being performed, batch size of data used in inferencing and/or training of a neural network, or some combination of these factors. In at least one embodiment, inference and/or training logic 715 illustrated in FIG. 8 may be used in conjunction with an application-specific integrated circuit (“ASIC”), such as Tensorflow® Processing Unit from Google, an inference processing unit (IPU) from Graphcore™, or a Nervana® (e.g., “Lake Crest”) processor from Intel Corp. In at least one embodiment, inference and/or training logic 715 illustrated in FIG. 8 may be used in conjunction with central processing unit (“CPU”) hardware, graphics processing unit (“GPU”) hardware or other hardware, such as field programmable gate arrays (“FPGAs”).

[0045] FIG. 8 illustrates inference and/or training logic 815, according to at least one or more embodiments. In at least one embodiment, inference and/or training logic 815 may include, without limitation, hardware logic in which computational resources are dedicated or otherwise exclusively used in conjunction with weight values or other information corresponding to one or more layers of neurons within a neural network. In at least one embodiment, inference and/or training logic 815 illustrated in FIG. 8 may be used in conjunction with an application-specific integrated circuit (ASIC), such as Tensorflow® Processing Unit from Google, an inference processing unit (IPU) from Graphcore™, or aNervana® (e.g., “Lake Crest”) processor from Intel Corp. In at least one embodiment, inference and/or training logic 815 illustrated in FIG. 8 may be used in conjunction with central processing unit (CPU) hardware, graphics processing unit (GPU) hardware or other hardware, such as field programmable gate arrays (FPGAs). In at least one embodiment, inference and/or training logic 815 includes, without limitation, code and/or data storage 801 and code and/or data storage 805, which may be used to store code (e.g., graph code), weight values and/or other information, including bias values, gradient information, momentum values, and/or other parameter or hyperparameter information. In at least one embodiment illustrated in FIG. 8, each of code and/or data storage 801 and code and/or data storage 805 is associated with a dedicated computational resource, such as computational hardware 802 and computational hardware 806, respectively. In at least one embodiment, each of computational hardware 802 and computational hardware 806 comprises one or more ALUs that perform mathematical functions, such as linear algebraic functions, only on information stored in code and/or data storage 801 and code and/or data storage 805, respectively, result of which is stored in activation storage 820.

[0046] In at least one embodiment, each of code and/or data storage 801 and 805 and corresponding computational hardware 802 and 806, respectively, correspond to different layers of a neural network, such that resulting activation from one “storage/computational pair 801/802” of code and/or data storage 801 and computational hardware 802 is provided as an input to “storage/computational pair 805/806” of code and/or data storage 805 and computational hardware 806, in order to mirror conceptual organization of a neural network. In at least one embodiment, each of storage/computational pairs 801/802 and 805/806 may correspond to more than one neural network layer. In at least one embodiment, additional storage/computation pairs (not shown) subsequent to or in parallel with storage computation pairs 801/802 and 805/806 may be included in inference and/or training logic 815.

[0047] Inference and training logic such at that discussed with respect to FIG. 8 can be used to train one or more neural networks to analyze a media stream or file of video data, as may correspond to gameplay data. These neural networks, which can run on described hardware structures 815, can be used for inferencing for detecting objects or occurrences, as well as inferring types of events based upon data for these detected objects or occurrences. Different neural networks or machine learning models can be trained for different games, types of game, types of video content, or types of events, among other such options.

DATA CENTER

[0048] FIG. 9 illustrates an example data center 900, in which at least one embodiment may be used. In at least one embodiment, data center 900 includes a data center infrastructure layer 910, a framework layer 920, a software layer 930, and an application layer 940.

[0049] In at least one embodiment, as shown in FIG. 9, data center infrastructure layer 910 may include a resource orchestrator 912, grouped computing resources 914, and node computing resources (“node C.R.s”) 916(1)-916(N), where “N” represents any whole, positive integer. In at least one embodiment, node C.R.s 916(1)-916(N) may include, but are not limited to, any number of central processing units (“CPUs”) or other processors (including accelerators, field programmable gate arrays (FPGAs), graphics processors, etc.), memory devices (e.g., dynamic read-only memory), storage devices (e.g., solid state or disk drives), network input/output ("NW EO”) devices, network switches, virtual machines (“VMs”), power modules, and cooling modules, etc. In at least one embodiment, one or more node C.R.s from among node C.R.s 916(1)-916(N) may be a server having one or more of above-mentioned computing resources.

[0050] In at least one embodiment, grouped computing resources 914 may include separate groupings of node C.R.s housed within one or more racks (not shown), or many racks housed in data centers at various geographical locations (also not shown). Separate groupings of node C.R.s within grouped computing resources 914 may include grouped compute, network, memory or storage resources that may be configured or allocated to support one or more workloads. In at least one embodiment, several node C.R.s including CPUs or processors may grouped within one or more racks to provide compute resources to support one or more workloads. In at least one embodiment, one or more racks may also include any number of power modules, cooling modules, and network switches, in any combination.

[0051] In at least one embodiment, resource orchestrator 912 may configure or otherwise control one or more node C.R.s 916(1)-916(N) and/or grouped computing resources 914. In at least one embodiment, resource orchestrator 912 may include a software design infrastructure (“SDI”) management entity for data center 900. In at least one embodiment, resource orchestrator may include hardware, software or some combination thereof.

[0052] In at least one embodiment, as shown in FIG. 9, framework layer 920 includes a job scheduler 1022, a configuration manager 1024, a resource manager 1026 and a distributed file system 1028. In at least one embodiment, framework layer 920 may include a framework to support software 932 of software layer 930 and/or one or more application(s) 942 of application layer 940. In at least one embodiment, software 932 or application(s) 942 may respectively include web-based service software or applications, such as those provided by Amazon Web

Services, Google Cloud and Microsoft Azure. In at least one embodiment, framework layer 920 may be, but is not limited to, a type of free and open-source software web application framework such as Apache Spark™ (hereinafter “Spark”) that may utilize distributed file system 1028 for large-scale data processing (e.g., “big data”). In at least one embodiment, job scheduler 1022 may include a Spark driver to facilitate scheduling of workloads supported by various layers of data center 900. In at least one embodiment, configuration manager 1024 may be capable of configuring different layers such as software layer 930 and framework layer 920 including Spark and distributed file system 1028 for supporting large-scale data processing. In at least one embodiment, resource manager 1026 may be capable of managing clustered or grouped computing resources mapped to or allocated for support of distributed file system 1028 and job scheduler 1022. In at least one embodiment, clustered or grouped computing resources may include grouped computing resource 914 at data center infrastructure layer 910. In at least one embodiment, resource manager 1026 may coordinate with resource orchestrator 912 to manage these mapped or allocated computing resources.

[0053] In at least one embodiment, software 932 included in software layer 930 may include software used by at least portions of node C.R.s 916(1)-916(N), grouped computing resources 914, and/or distributed file system 1028 of framework layer 920. One or more types of software may include, but are not limited to, Internet web page search software, e-mail virus scan software, database software, and streaming video content software.

[0054] In at least one embodiment, application(s) 942 included in application layer 940 may include one or more types of applications used by at least portions of node C.R.s 916(1)-916(N), grouped computing resources 914, and/or distributed file system 1028 of framework layer 920. One or more types of applications may include, but are not limited to, any number of a genomics application, a cognitive compute, and a machine learning application, including training or inferencing software, machine learning framework software (e.g., PyTorch, TensorFlow, Caffe, etc.) or other machine learning applications used in conjunction with one or more embodiments.

[0055] In at least one embodiment, any of configuration manager 1024, resource manager 1026, and resource orchestrator 912 may implement any number and type of self-modifying actions based on any amount and type of data acquired in any technically feasible fashion. In at least one embodiment, self-modifying actions may relieve a data center operator of data center 900 from making possibly bad configuration decisions and possibly avoiding underutilized and/or poor performing portions of a data center.

[0056] In at least one embodiment, data center 900 may include tools, services, software or other resources to train one or more machine learning models or predict or infer information using one or more machine learning models according to one or more embodiments described herein. For example, in at least one embodiment, a machine learning model may be trained by calculating weight parameters according to a neural network architecture using software and computing resources described above with respect to data center 900. In at least one embodiment, trained machine learning models corresponding to one or more neural networks may be used to infer or predict information using resources described above with respect to data center 900 by using weight parameters calculated through one or more training techniques described herein.

[0057] In at least one embodiment, data center may use CPUs, application-specific integrated circuits (ASICs), GPUs, FPGAs, or other hardware to perform training and/or inferencing using above-described resources. Moreover, one or more software and/or hardware resources described above may be configured as a service to allow users to train or performing inferencing of information, such as image recognition, speech recognition, or other artificial intelligence services.

[0058] Inference and/or training logic 715 are used to perform inferencing and/or training operations associated with one or more embodiments. Details regarding inference and/or training logic 715 are provided below in conjunction with FIGs. 8 and/or 9. In at least one embodiment, inference and/or training logic 715 may be used in system FIG. 9 for inferencing or predicting operations based, at least in part, on weight parameters calculated using neural network training operations, neural network functions and/or architectures, or neural network use cases described herein. A data center 900 such as that illustrated in FIG. 9 can be used to perform at least some of the processing and/or storage discussed herein remote from a device on which the video is generated or analyzed, or upon which the game is being played. Such a data center may also include servers that host these games or other relevant applications in at least some embodiments.

COMPUTER SYSTEMS

[0059] FIG. 10 is a block diagram illustrating an exemplary computer system, which may be a system with interconnected devices and components, a system-on-a-chip (SOC) or some combination thereof 1000 formed with a processor that may include execution units to execute an instruction, according to at least one embodiment. In at least one embodiment, computer system 1000 may include, without limitation, a component, such as a processor 1002 to employ execution units including logic to perform algorithms for process data, in accordance with present disclosure, such as in embodiment described herein. In at least one embodiment, computer system 1000 may include processors, such as PENTIUM® Processor family, Xeon™, Itanium®, XScale™ and/or StrongARM™, Intel® Core™, or Intel® Nervana™ microprocessors available from Intel Corporation of Santa Clara, California, although other systems (including PCs having other microprocessors, engineering workstations, set-top boxes and like) may also be used. In at least one embodiment, computer system 1000 may execute a version of WINDOWS’ operating system available from Microsoft Corporation of Redmond, Wash., although other operating systems (UNIX and Linux for example), embedded software, and/or graphical user interfaces, may also be used.

[0060] Embodiments may be used in other devices such as handheld devices and embedded applications. Some examples of handheld devices include cellular phones, Internet Protocol devices, digital cameras, personal digital assistants (“PDAs”), and handheld PCs. In at least one embodiment, embedded applications may include a microcontroller, a digital signal processor (“DSP”), system on a chip, network computers (“NetPCs”), set-top boxes, network hubs, wide area network (“WAN”) switches, or any other system that may perform one or more instructions in accordance with at least one embodiment.

[0061] In at least one embodiment, computer system 1000 may include, without limitation, processor 1002 that may include, without limitation, one or more execution units 1008 to perform machine learning model training and/or inferencing according to techniques described herein. In at least one embodiment, computer system 1000 is a single processor desktop or server system, but in another embodiment computer system 1000 may be a multiprocessor system. In at least one embodiment, processor 1002 may include, without limitation, a complex instruction set computer (“CISC”) microprocessor, a reduced instruction set computing (“RISC”) microprocessor, a very long instruction word (“VLIW”) microprocessor, a processor implementing a combination of instruction sets, or any other processor device, such as a digital signal processor, for example. In at least one embodiment, processor 1002 may be coupled to a processor bus 1010 that may transmit data signals between processor 1002 and other components in computer system 1000. [0062] In at least one embodiment, processor 1002 may include, without limitation, a Level 1 (“LI”) internal cache memory (“cache”) 1004. In at least one embodiment, processor 1002 may have a single internal cache or multiple levels of internal cache. In at least one embodiment, cache memory may reside external to processor 1002. Other embodiments may also include a combination of both internal and external caches depending on particular implementation and needs. In at least one embodiment, register file 1006 may store different types of data in various registers including, without limitation, integer registers, floating point registers, status registers, and instruction pointer register.

[0063] In at least one embodiment, execution unit 1008, including, without limitation, logic to perform integer and floating point operations, also resides in processor 1002. In at least one embodiment, processor 1002 may also include a microcode (“ucode”) read only memory (“ROM”) that stores microcode for certain macro instructions. In at least one embodiment, execution unit 1008 may include logic to handle a packed instruction set 1009. In at least one embodiment, by including packed instruction set 1009 in an instruction set of a general-purpose processor 1002, along with associated circuitry to execute instructions, operations used by many multimedia applications may be performed using packed data in a general-purpose processor 1002. In one or more embodiments, many multimedia applications may be accelerated and executed more efficiently by using full width of a processor’s data bus for performing operations on packed data, which may eliminate need to transfer smaller units of data across processor's data bus to perform one or more operations one data element at a time.

[0064] In at least one embodiment, execution unit 1008 may also be used in microcontrollers, embedded processors, graphics devices, DSPs, and other types of logic circuits. In at least one embodiment, computer system 1000 may include, without limitation, a memory 1020. In at least one embodiment, memory 1020 may be implemented as a Dynamic Random Access Memory (“DRAM”) device, a Static Random Access Memory (“SRAM”) device, flash memory device, or other memory device. In at least one embodiment, memory 1020 may store instruction(s)

1019 and/or data 1021 represented by data signals that may be executed by processor 1002. [0065] In at least one embodiment, system logic chip may be coupled to processor bus 1010 and memory 1020. In at least one embodiment, system logic chip may include, without limitation, a memory controller hub (“MCH”) 1016, and processor 1002 may communicate with MCH 1016 via processor bus 1010. In at least one embodiment, MCH 1016 may provide a high bandwidth memory path 1018 to memory 1020 for instruction and data storage and for storage of graphics commands, data and textures. In at least one embodiment, MCH 1016 may direct data signals between processor 1002, memory 1020, and other components in computer system 1000 and to bridge data signals between processor bus 1010, memory 1020, and a system I/O 1022. In at least one embodiment, system logic chip may provide a graphics port for coupling to a graphics controller. In at least one embodiment, MCH 1016 may be coupled to memory 1020 through a high bandwidth memory path 1018 and graphics/video card 1112 may be coupled to MCH 1016 through an Accelerated Graphics Port (“AGP”) interconnect 1114.

[0066] In at least one embodiment, computer system 1000 may use system I/O 1022 that is a proprietary hub interface bus to couple MCH 1016 to I/O controller hub (“ICH”) 1030. In at least one embodiment, ICH 1030 may provide direct connections to some I/O devices via a local I/O bus. In at least one embodiment, local I/O bus may include, without limitation, a high-speed I/O bus for connecting peripherals to memory 1020, chipset, and processor 1002. Examples may include, without limitation, an audio controller 1029, a firmware hub (“flash BIOS”) 1028, a wireless transceiver 1026, a data storage 1024, a legacy EO controller 1023 containing user input and keyboard interfaces 1025, a serial expansion port 1027, such as Universal Serial Bus (“USB”), and a network controller 1034. Data storage 1024 may comprise a hard disk drive, a floppy disk drive, a CD-ROM device, a flash memory device, or other mass storage device.

[0067] In at least one embodiment, FIG. 10 illustrates a system, which includes interconnected hardware devices or “chips”, whereas in other embodiments, FIG. 10 may illustrate an exemplary System on a Chip (“SoC”). In at least one embodiment, devices illustrated in FIG. 10 may be interconnected with proprietary interconnects, standardized interconnects (e.g., PCIe) or some combination thereof. In at least one embodiment, one or more components of computer system 1000 are interconnected using compute express link (CXL) interconnects. [0068] Inference and/or training logic 715 are used to perform inferencing and/or training operations associated with one or more embodiments. Details regarding inference and/or training logic 715 are provided below in conjunction with FIGs. 8 and/or 9. In at least one embodiment, inference and/or training logic 715 may be used in system FIG. 10 for inferencing or predicting operations based, at least in part, on weight parameters calculated using neural network training operations, neural network functions and/or architectures, or neural network use cases described herein.

[0069] In an example use case, there may be a number of players playing in a multiplayer game using a single server, such as may involve using Nvidia GeForce Now on that server. In at least some embodiments, a fluid simulation can be executed within the GeForce Now infrastructure on one or even many computers or client devices, and that fluid state can be shared with the game server. In such a use case, the client devices for the players can provide renderings from their cameras within a given level. Approches discussed herein are thus scalable, and can be executed on many machines in parallel, where each machine can work on its subdomain and performs a respective inference.

[0070] FIG. 11 is a block diagram illustrating an electronic device 1100 for utilizing a processor 1110, according to at least one embodiment. In at least one embodiment, electronic device 1100 may be, for example and without limitation, a notebook, a tower server, a rack server, a blade server, a laptop, a desktop, a tablet, a mobile device, a phone, an embedded computer, or any other suitable electronic device.

[0071] In at least one embodiment, system 1100 may include, without limitation, processor 1110 communicatively coupled to any suitable number or kind of components, peripherals, modules, or devices. In at least one embodiment, processor 1110 coupled using a bus or interface, such as a 1°C bus, a System Management Bus (“SMBus”), a Low Pin Count (LPC) bus, a Serial Peripheral Interface (“SPI”), a High Definition Audio (“HD A”) bus, a Serial Advance Technology Attachment (“SATA”) bus, a Universal Serial Bus (“USB”) (versions 1, 2, 3), or a Universal Asynchronous Receiver/Transmitter (“UART”) bus. In at least one embodiment, FIG. 11 illustrates a system, which includes interconnected hardware devices or “chips”, whereas in other embodiments, FIG. 11 may illustrate an exemplary System on a Chip (“SoC”). In at least one embodiment, devices illustrated in FIG. 11 may be interconnected with proprietary interconnects, standardized interconnects (e.g., PCIe) or some combination thereof.

In at least one embodiment, one or more components of FIG. 11 are interconnected using compute express link (CXL) interconnects.

[0072] In at least one embodiment, FIG 12 may include a display 1124, a touch screen 1125, a touch pad 1130, a Near Field Communications unit (“NFC”) 1145, a sensor hub 1140, a thermal sensor 1246, an Express Chipset (“EC”) 1135, a Trusted Platform Module (“TPM”) 1138, BlOS/firmware/flash memory (“BIOS, FW Flash”) 1122, a DSP 1160, a drive 1120 such as a Solid State Disk (“SSD”) or a Hard Disk Drive (“HDD”), a wireless local area network unit (“WLAN”) 1150, a Bluetooth unit 1152, a Wireless Wide Area Network unit (“WWAN”) 1156, a Global Positioning System (GPS) 1155, a camera (“USB 3.0 camera”) 1154 such as a USB 3.0 camera, and/or a Low Power Double Data Rate (“LPDDR”) memory unit (“LPDDR3”) 1112 implemented in, for example, LPDDR3 standard. These components may each be implemented in any suitable manner.

[0073] In at least one embodiment, other components may be communicatively coupled to processor 1110 through components discussed above. In at least one embodiment, an accelerometer 1141, Ambient Light Sensor (“ALS”) 1142, compass 1143, and a gyroscope 1144 may be communicatively coupled to sensor hub 1140. In at least one embodiment, thermal sensor 1139, a fan 1137, a keyboard 1246, and a touch pad 1130 may be communicatively coupled to EC 1135. In at least one embodiment, speaker 1163, headphones 1164, and microphone (“mic”) 1165 may be communicatively coupled to an audio unit (“audio codec and class d amp”) 1162, which may in turn be communicatively coupled to DSP 1160. In at least one embodiment, audio unit 1164 may include, for example and without limitation, an audio coder/decoder (“codec”) and a class D amplifier. In at least one embodiment, SIM card (“SIM”) 1157 may be communicatively coupled to WWAN unit 1156. In at least one embodiment, components such as WLAN unit 1150 and Bluetooth unit 1152, as well as WWAN unit 1156 may be implemented in a Next Generation Form Factor (“NGFF”).

[0074] Inference and/or training logic 715 are used to perform inferencing and/or training operations associated with one or more embodiments. Details regarding inference and/or training logic 715 are provided below in conjunction with FIGs. 8 and/or 9. In at least one embodiment, inference and/or training logic 715 may be used in system FIG. 11 for inferencing or predicting operations based, at least in part, on weight parameters calculated using neural network training operations, neural network functions and/or architectures, or neural network use cases described herein.

[0075] FIG. 12 illustrates a computer system 1200, according to at least one embodiment. In at least one embodiment, computer system 1200 is configured to implement various processes and methods described throughout this disclosure.

[0076] In at least one embodiment, computer system 1200 comprises, without limitation, at least one central processing unit (“CPU”) 1202 that is connected to a communication bus 1210 implemented using any suitable protocol, such as PCI (“Peripheral Component Interconnect”), peripheral component interconnect express (“PCI-Express”), AGP (“Accelerated Graphics Port”), HyperTransport, or any other bus or point-to-point communication protocol(s). In at least one embodiment, computer system 1200 includes, without limitation, a main memory 1204 and control logic (e.g., implemented as hardware, software, or a combination thereof) and data are stored in main memory 1204 which may take form of random access memory (“RAM”). In at least one embodiment, a network interface subsystem (“network interface”) 1222 provides an interface to other computing devices and networks for receiving data from and transmitting data to other systems from computer system 1200.

[0077] In at least one embodiment, computer system 1200, in at least one embodiment, includes, without limitation, input devices 1208, parallel processing system 1212, and display devices 1206 which can be implemented using a cathode ray tube (“CRT”), liquid crystal display (“LCD”), light emitting diode (“LED”), plasma display, or other suitable display technologies. In at least one embodiment, user input is received from input devices 1208 such as keyboard, mouse, touchpad, microphone, and more. In at least one embodiment, each of foregoing modules can be situated on a single semiconductor platform to form a processing system.

[0078] Inference and/or training logic 715 are used to perform inferencing and/or training operations associated with one or more embodiments. Details regarding inference and/or training logic 715 are provided below in conjunction with FIGs. 8 and/or 9. In at least one embodiment, inference and/or training logic 715 may be used in system FIG. 12 for inferencing or predicting operations based, at least in part, on weight parameters calculated using neural network training operations, neural network functions and/or architectures, or neural network use cases described herein.

[0079] FIG. 13 illustrates an exemplary architecture in which a plurality of GPUs 1310-1313 is communicatively coupled to a plurality of multi-core processors 1305-1306 over high-speed links 1340-1343 (e.g., buses, point-to-point interconnects, etc.). In one embodiment, high-speed links 1340-1343 support a communication throughput of 4GB/s, 30GB/s, 80GB/s or higher. Various interconnect protocols may be used including, but not limited to, PCIe 4.0 or 5.0 and NVLink 2.0.

[0080] In addition, and in one embodiment, two or more of GPUs 1310-1313 are interconnected over high-speed links 1329-1330, which may be implemented using same or different protocol s/links than those used for high-speed links 1340-1343. Similarly, two or more of multi-core processors 1305-1306 may be connected over high speed link 1328 which may be symmetric multi -processor (SMP) buses operating at 20GB/s, 30GB/s, 120GB/s or higher. Alternatively, all communication between various system components shown in FIG. 13 may be accomplished using same protocol s/links (e.g., over a common interconnection fabric).

[0081] In one embodiment, each multi-core processor 1305-1306 is communicatively coupled to a processor memory 1301-1302, via memory interconnects 1326-1327, respectively, and each GPU 1310-1313 is communicatively coupled to GPU memory 1320-1323 over GPU memory interconnects 1350-1353, respectively. Memory interconnects 1326-1327 and 1350-1353 may utilize same or different memory access technologies. By way of example, and not limitation, processor memories 1301-1302 and GPU memories 1320-1323 may be volatile memories such as dynamic random access memories (DRAMs) (including stacked DRAMs), Graphics DDR SDRAM (GDDR) (e g., GDDR5, GDDR6), or High Bandwidth Memory (HBM) and/or may be non-volatile memories such as 3D XPoint or Nano-Ram. In one embodiment, some portion of processor memories 1301-1302 may be volatile memory and another portion may be non-volatile memory (e.g., using a two-level memory (2LM) hierarchy).

[0082] As described below, although various processors 1305-1306 and GPUs 1310-1313 may be physically coupled to a particular memory 1301-1302, 1320-1323, respectively, a unified memory architecture may be implemented in which a same virtual system address space (also referred to as “effective address” space) is distributed among various physical memories. For example, processor memories 1301-1302 may each comprise 64GB of system memory address space and GPU memories 1320-1323 may each comprise 32GB of system memory address space (resulting in a total of 256GB addressable memory in this example).

[0083] FIG. 13 illustrates additional details for an interconnection between a multi-core processor 1407 and a graphics acceleration module 1446 in accordance with one exemplary embodiment. Graphics acceleration module 1446 may include one or more GPU chips integrated on a line card which is coupled to processor 1407 via high-speed link 1340. Alternatively, graphics acceleration module 1446 may be integrated on a same package or chip as processor 1407.

[0084] In at least one embodiment, illustrated processor 1407 includes a plurality of cores 1460A-1460D, each with a translation lookaside buffer 1461 A-1461D and one or more caches 1462A-1462D. In at least one embodiment, cores 1460A-1460D may include various other components for executing instructions and processing data which are not illustrated. Caches 1462A-1462D may comprise level 1 (LI) and level 2 (L2) caches. In addition, one or more shared caches 1456 may be included in caches 1462A-1462D and shared by sets of cores 1460A- 1460D. For example, one embodiment of processor 1407 includes 24 cores, each with its own LI cache, twelve shared L2 caches, and twelve shared L3 caches. In this embodiment, one or more L2 and L3 caches are shared by two adjacent cores. Processor 1407 and graphics acceleration module 1446 connect with system memory 1414, which may include processor memories 1301-1302 of FIG. 13.

[0085] Coherency is maintained for data and instructions stored in various caches 1462A- 1462D, 1456 and system memory 1514 via inter-core communication over a coherence bus 1464 as illustrated in FIG. 14. For example, each cache may have cache coherency logic/circuitry associated therewith to communicate to over coherence bus 1464 in response to detected reads or writes to particular cache lines. In one implementation, a cache snooping protocol is implemented over coherence bus 1464 to snoop cache accesses.

[0086] In one embodiment, a proxy circuit 1425 communicatively couples graphics acceleration module 1446 to coherence bus 1464, allowing graphics acceleration module 1446 to participate in a cache coherence protocol as a peer of cores 1460A-1460D. In particular, an interface 1435 provides connectivity to proxy circuit 1425 over high-speed link 1440 (e.g., a PCIe bus, NVLink, etc.) and an interface 1437 connects graphics acceleration module 1446 to link 1440.

[0087] In one implementation, an accelerator integration circuit 1436 provides cache management, memory access, context management, and interrupt management services on behalf of a plurality of graphics processing engines 1431, 1432, N of graphics acceleration module 1446. Graphics processing engines 1431, 1432, N may each comprise a separate graphics processing unit (GPU). Alternatively, graphics processing engines 1431, 1432, N may comprise different types of graphics processing engines within a GPU such as graphics execution units, media processing engines (e.g., video encoders/decoders), samplers, and blit engines. In at least one embodiment, graphics acceleration module 1446 may be a GPU with a plurality of graphics processing engines 1431-1432, N or graphics processing engines 1431-1432, N may be individual GPUs integrated on a common package, line card, or chip. [0088] In one embodiment, accelerator integration circuit 1436 includes a memory management unit (MMU) 1439 for performing various memory management functions such as virtual-to-physical memory translations (also referred to as effective-to-real memory translations) and memory access protocols for accessing system memory 1514. MMU 1439 may also include a translation lookaside buffer (TLB) (not shown) for caching virtual/effective to physical/real address translations. In one implementation, a cache 1438 stores commands and data for efficient access by graphics processing engines 1431-1432, N. In one embodiment, data stored in cache 1438 and graphics memories 1433-1434, M is kept coherent with core caches 1462A-1462D, 1456, and system memory 1514. As mentioned above, this may be accomplished via proxy circuit 1425 on behalf of cache 1438 and memories 1433-1434, M (e.g., sending updates to cache 1438 related to modifications/accesses of cache lines on processor caches 1462A-1462D, 1456, and receiving updates from cache 1438).

[0089] A set of registers 1445 in FIG. 14 store context data for threads executed by graphics processing engines 1431-1432, N and a context management circuit 1448 manages thread contexts. For example, context management circuit 1448 may perform save and restore operations to save and restore contexts of various threads during contexts switches (e.g., where a first thread is saved and a second thread is stored so that a second thread can be executed by a graphics processing engine). For example, on a context switch, context management circuit 1448 may store current register values to a designated region in memory (e.g., identified by a context pointer). It may then restore register values when returning to a context. In one embodiment, an interrupt management circuit 1447 receives and processes interrupts received from system devices.

[0090] In one implementation, virtual/effective addresses from a graphics processing engine

1431 are translated to real/physical addresses in system memory 1514 by MMU 1439. One embodiment of accelerator integration circuit 1436 supports multiple (e.g., 4, 8, 16) graphics accelerator modules 1446 and/or other accelerator devices. Graphics accelerator module 1446 may be dedicated to a single application executed on processor 1407 or may be shared between multiple applications. In one embodiment, a virtualized graphics execution environment is presented in which resources of graphics processing engines 1431-1432, N are shared with multiple applications or virtual machines (VMs). In at least one embodiment, resources may be subdivided into “slices” which are allocated to different VMs and/or applications based on processing requirements and priorities associated with VMs and/or applications.

[0091] In at least one embodiment, accelerator integration circuit 1436 performs as a bridge to a system for graphics acceleration module 1446 and provides address translation and system memory cache services. In addition, accelerator integration circuit 1436 may provide virtualization facilities for a host processor to manage virtualization of graphics processing engines 1431-1432, N, interrupts, and memory management.

[0092] Because hardware resources of graphics processing engines 1431-1432, N are mapped explicitly to a real address space seen by host processor 1407, any host processor can address these resources directly using an effective address value. One function of accelerator integration circuit 1436, in one embodiment, is physical separation of graphics processing engines 1431- 1432, N so that they appear to a system as independent units.

[0093] In at least one embodiment, one or more graphics memories 1433-1434, M are coupled to each of graphics processing engines 1431-1432, N, respectively. Graphics memories 1433- 1434, M store instructions and data being processed by each of graphics processing engines 1431-1432, N. Graphics memories 1433-1434, M may be volatile memories such as DRAMs (including stacked DRAMs), GDDR memory (e.g., GDDR5, GDDR6), or HBM, and/or may be non-volatile memories such as 3D XPoint or Nano-Ram.

[0094] In one embodiment, to reduce data traffic over link 1440, biasing techniques are used to ensure that data stored in graphics memories 1433-1434, M is data which will be used most frequently by graphics processing engines 1431-1432, N and preferably not used by cores 1460A-1460D (at least not frequently). Similarly, a biasing mechanism attempts to keep data needed by cores (and preferably not graphics processing engines 1431-1432, N) within caches 1462A-1462D, 1456 of cores and system memory 1514. [0095] In at least one embodiment, a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. In at least one embodiment, multi-chip modules may be used with increased connectivity which simulate on-chip operation, and make substantial improvements over utilizing a central processing unit (“CPU”) and bus implementation. In at least one embodiment, various modules may also be situated separately or in various combinations of semiconductor platforms per desires of user.

[0096] In at least one embodiment, computer programs in form of machine-readable executable code or computer control logic algorithms are stored in main memory 1404 and/or secondary storage. Computer programs, if executed by one or more processors, enable system 1400 to perform various functions in accordance with at least one embodiment. In at least one embodiment, memory 1404, storage, and/or any other storage are possible examples of computer-readable media. In at least one embodiment, secondary storage may refer to any suitable storage device or system such as a hard disk drive and/or a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (“DVD”) drive, recording device, universal serial bus (“USB”) flash memory, etc. In at least one embodiment, architecture and/or functionality of various previous figures are implemented in context of CPU; parallel processing system; an integrated circuit capable of at least a portion of capabilities of both CPU; parallel processing system; a chipset (e.g., a group of integrated circuits designed to work and sold as a unit for performing related functions, etc.); and any suitable combination of integrated circuit(s).

[0097] In at least one embodiment, architecture and/or functionality of various previous figures are implemented in context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system, and more. In at least one embodiment, computer system may take form of a desktop computer, a laptop computer, a tablet computer, servers, supercomputers, a smart-phone (e.g., a wireless, hand-held device), personal digital assistant (“PDA”), a digital camera, a vehicle, a head mounted display, a hand-held electronic device, a mobile phone device, a television, workstation, game consoles, embedded system, and/or any other type of logic. Such hardware can be used to execute applications and code to support various types of processing, analysis, and storage discussed herein. For example, a given computer system might execute a game and analyze video output by the game to determine events, and then upload this event data to a remote server for analysis or storage. In other embodiments, a stream of video data might be generated by a gaming server that hosts a game involving several different client devices, and this stream might be analyzed by another server or computer system for determining events that occurred within gameplay. In some embodiments, this event data can then be used by the same, or yet another, server or computer system for processing, such as to generate a highlight video or video montage.

[0098] Other variations are within spirit of present disclosure. Thus, while disclosed techniques are susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in drawings and have been described above in detail.

It should be understood, however, that there is no intention to limit disclosure to specific form or forms disclosed, but on contrary, intention is to cover all modifications, alternative constructions, and equivalents falling within spirit and scope of disclosure, as defined in appended claims.

[0099] Use of terms “a” and “an” and “the” and similar referents in context of describing disclosed embodiments (especially in context of following claims) are to be construed to cover both singular and plural, unless otherwise indicated herein or clearly contradicted by context, and not as a definition of a term. Terms “comprising,” “having,” “including,” and “containing” are to be construed as open-ended terms (meaning “including, but not limited to,”) unless otherwise noted. Term “connected,” when unmodified and referring to physical connections, is to be construed as partly or wholly contained within, attached to, or joined together, even if there is something intervening. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within range, unless otherwise indicated herein and each separate value is incorporated into specification as if it were individually recited herein. Use of term “set” (e.g., “a set of items”) or “subset,” unless otherwise noted or contradicted by context, is to be construed as a nonempty collection comprising one or more members. Further, unless otherwise noted or contradicted by context, term “subset” of a corresponding set does not necessarily denote a proper subset of corresponding set, but subset and corresponding set may be equal.

[0100] Conjunctive language, such as phrases of form “at least one of A, B, and C,” or “at least one of A, B and C,” unless specifically stated otherwise or otherwise clearly contradicted by context, is otherwise understood with context as used in general to present that an item, term, etc., may be either A or B or C, or any nonempty subset of set of A and B and C. For instance, in illustrative example of a set having three members, conjunctive phrases “at least one of A, B, and C” and “at least one of A, B and C” refer to any of following sets: {A}, {B}, {C}, (A, B}, (A, C}, (B, C}, (A, B, C}. Thus, such conjunctive language is not generally intended to imply that certain embodiments require at least one of A, at least one of B, and at least one of C each to be present. In addition, unless otherwise noted or contradicted by context, term “plurality” indicates a state of being plural (e.g., “a plurality of items” indicates multiple items). A plurality is at least two items, but can be more when so indicated either explicitly or by context. Further, unless stated otherwise or otherwise clear from context, phrase “based on” means “based at least in part on” and not “based solely on.”

[0101] Operations of processes described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. In at least one embodiment, a process such as those processes described herein (or variations and/or combinations thereof) is performed under control of one or more computer systems configured with executable instructions and is implemented as code (e.g., executable instructions, one or more computer programs or one or more applications) executing collectively on one or more processors, by hardware or combinations thereof. In at least one embodiment, code is stored on a computer-readable storage medium, for example, in form of a computer program comprising a plurality of instructions executable by one or more processors. In at least one embodiment, a computer-readable storage medium is a non-transitory computer-readable storage medium that excludes transitory signals (e.g., a propagating transient electric or electromagnetic transmission) but includes non-transitory data storage circuitry (e.g., buffers, cache, and queues) within transceivers of transitory signals. In at least one embodiment, code (e.g., executable code or source code) is stored on a set of one or more non-transitory computer-readable storage media having stored thereon executable instructions (or other memory to store executable instructions) that, when executed (i.e., as a result of being executed) by one or more processors of a computer system, cause computer system to perform operations described herein. A set of non-transitory computer-readable storage media, in at least one embodiment, comprises multiple non-transitory computer-readable storage media and one or more of individual non-transitory storage media of multiple non-transitory computer-readable storage media lack all of code while multiple non- transitory computer- readable storage media collectively store all of code. In at least one embodiment, executable instructions are executed such that different instructions are executed by different processors — for example, a non-transitory computer-readable storage medium store instructions and a main central processing unit (“CPU”) executes some of instructions while a graphics processing unit (“GPU”) executes other instructions. In at least one embodiment, different components of a computer system have separate processors and different processors execute different subsets of instructions.

[0102] Accordingly, in at least one embodiment, computer systems are configured to implement one or more services that singly or collectively perform operations of processes described herein and such computer systems are configured with applicable hardware and/or software that enable performance of operations. Further, a computer system that implements at least one embodiment of present disclosure is a single device and, in another embodiment, is a distributed computer system comprising multiple devices that operate differently such that distributed computer system performs operations described herein and such that a single device does not perform all operations.

[0103] Use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate embodiments of disclosure and does not pose a limitation on scope of disclosure unless otherwise claimed. No language in specification should be construed as indicating any non-claimed element as essential to practice of disclosure. [0104] All references, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to same extent as if each reference were individually and specifically indicated to be incorporated by reference and were set forth in its entirety herein.

[0105] In description and claims, terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms may be not intended as synonyms for each other. Rather, in particular examples, “connected” or “coupled” may be used to indicate that two or more elements are in direct or indirect physical or electrical contact with each other. “Coupled” may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.

[0106] Unless specifically stated otherwise, it may be appreciated that throughout specification terms such as “processing,” “computing,” “calculating,” “determining,” or like, refer to action and/or processes of a computer or computing system, or similar electronic computing device, that manipulate and/or transform data represented as physical, such as electronic, quantities within computing system’s registers and/or memories into other data similarly represented as physical quantities within computing system’s memories, registers or other such information storage, transmission or display devices.

[0107] In a similar manner, term “processor” may refer to any device or portion of a device that processes electronic data from registers and/or memory and transform that electronic data into other electronic data that may be stored in registers and/or memory. As non-limiting examples, “processor” may be a CPU or a GPU. A “computing platform” may comprise one or more processors. As used herein, “software” processes may include, for example, software and/or hardware entities that perform work over time, such as tasks, threads, and intelligent agents. Also, each process may refer to multiple processes, for carrying out instructions in sequence or in parallel, continuously or intermittently. Terms “system” and “method” are used herein interchangeably insofar as system may embody one or more methods and methods may be considered a system. [0108] In present document, references may be made to obtaining, acquiring, receiving, or inputting analog or digital data into a subsystem, computer system, or computer-implemented machine. Obtaining, acquiring, receiving, or inputting analog and digital data can be accomplished in a variety of ways such as by receiving data as a parameter of a function call or a call to an application programming interface. In some implementations, process of obtaining, acquiring, receiving, or inputting analog or digital data can be accomplished by transferring data via a serial or parallel interface. In another implementation, process of obtaining, acquiring, receiving, or inputting analog or digital data can be accomplished by transferring data via a computer network from providing entity to acquiring entity. References may also be made to providing, outputting, transmitting, sending, or presenting analog or digital data. In various examples, process of providing, outputting, transmitting, sending, or presenting analog or digital data can be accomplished by transferring data as an input or output parameter of a function call, a parameter of an application programming interface or interprocess communication mechanism.

[0109] Although discussion above sets forth example implementations of described techniques, other architectures (neural network or otherwise) may be used to implement described functionality, and are intended to be within scope of this disclosure. Furthermore, although specific distributions of responsibilities are defined above for purposes of discussion, various functions and responsibilities might be distributed and divided in different ways, depending on circumstances.

[0110] Furthermore, although subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that subject matter claimed in appended claims is not necessarily limited to specific features or acts described. Rather, specific features and acts are disclosed as exemplary forms of implementing the claims.