Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEMS AND METHODS FOR DISAGGREGATED ACCELERATION OF ARTIFICIAL INTELLIGENCE OPERATIONS
Document Type and Number:
WIPO Patent Application WO/2023/091398
Kind Code:
A1
Abstract:
A disclosed system may include a disaggregated artificial intelligence (AI) operation accelerator including a dense AI operation accelerator configured to accelerate dense AI operations and a sparse AI operation accelerator, physically separate from the dense AI operation accelerator, configured to accelerate sparse AI operations. The system may also include a scheduler that includes (1) a receiving module that receives an AI operation, (2) an identifying module that identifies the AI operation as a dense AI operation or sparse AI operation, and (3) a directing module that directs (a) the dense AI operation accelerator to accelerate identified dense AI operations, and (b) the sparse AI operation accelerator to accelerate identified sparse AI operations. The system may also include a physical processor that executes the receiving module, the identifying module, and the directing module. Various other methods, systems, and computer-readable media are also disclosed.

Inventors:
PETERSEN CHRISTIAN MARKUS (US)
VIJAYRAO NARSING KRISHNA (US)
Application Number:
PCT/US2022/049935
Publication Date:
May 25, 2023
Filing Date:
November 15, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
META PLATFORMS INC (US)
International Classes:
G06N3/048; G06F9/50; G06N3/063
Foreign References:
US20210240797A12021-08-05
Other References:
SOOJEONG KIM ET AL: "Parallax: Automatic Data-Parallel Training of Deep Neural Networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 8 August 2018 (2018-08-08), XP080902501
Attorney, Agent or Firm:
COLBY, Steven et al. (US)
Download PDF:
Claims:
CLAIMS:

1. A system comprising: a disaggregated artificial intelligence (Al) operation accelerator comprising: a dense Al operation accelerator configured to accelerate dense Al operations; a sparse Al operation accelerator, physically separate from the dense Al operation accelerator, configured to accelerate sparse Al operations; and a scheduler comprising: a receiving module that receives an Al operation; an identifying module that identifies the Al operation as at least one of a dense Al operation or a sparse Al operation; and a directing module that directs: the dense Al operation accelerator to accelerate the Al operation when the identifying module identifies it as a dense Al operation; and the sparse Al operation accelerator to accelerate the Al operation when the identifying module identifies it as a sparse Al operation; and a physical processor that executes the receiving module, the identifying module, and the directing module.

2. The system of claim 1, wherein: the system further comprises an additional dense Al operation accelerator; and when the identifying module identifies the Al operation as a dense Al operation, the directing module directs at least one of the dense Al operation accelerator or the additional dense Al operation accelerator to accelerate the Al operation.

3. The system of claim 1, wherein: the system further comprises an additional sparse Al operation accelerator; and when the identifying module identifies the Al operation as a sparse Al operation, the directing module directs at least one of the sparse Al operation accelerator or the additional sparse Al operation accelerator to accelerate the sparse Al operation.

4. The system of claim 1, the disaggregated Al operation accelerator further comprising a high-bandwidth bus that communicatively couples the dense Al operation accelerator and the sparse Al operation accelerator.

5. The system of claim 1, the dense Al operation accelerator comprising: at least one of a wide matrix unit or a tensor unit; and

28 a memory cache local to the dense Al operation accelerator and associated with at least one of the wide matrix unit or the tensor unit; and optionally, wherein: the identifying module identifies the Al operation as a dense Al operation; and the directing module directs the dense Al operation accelerator to accelerate the dense Al operation by: loading a set of Al operation data into the memory cache local to the dense Al operation accelerator; and directing the dense Al operation accelerator to execute the dense Al operation using the set of Al operation data loaded into the memory cache local to the dense Al operation accelerator.

6. The system of claim 1, the sparse Al operation accelerator comprising: a general-purpose compute unit; and a high-bandwidth memory local to the sparse Al operation accelerator.

7. The system of claim 6, wherein: the identifying module identifies the Al operation as a sparse Al operation; and the directing module directs the sparse Al operation accelerator to accelerate the sparse Al operation by: loading a set of Al operation data into the high-bandwidth memory local to the sparse Al operation accelerator; and directing the sparse Al operation accelerator to execute the sparse Al operation using the set of Al operation data loaded into the high-bandwidth memory local to the sparse Al operation accelerator; and/or the sparse Al operation accelerator further comprising at least one wide vector unit.

8. The system of claim 1 wherein the sparse Al operation accelerator is configured to execute an element-wise Al operation; and optionally, wherein the element-wise Al operation comprises at least one of: a rectified linear unit (ReLU) operation; a sigmoid operation; or a hyperbolic tangent (tanh) function.

9. The system of claim 1, wherein the Al operation comprises at least one of: an Al training operation; or an Al inference operation.

10. A computer-implemented method comprising: receiving, by a scheduler included in a disaggregated artificial intelligence (Al) operation accelerator, an Al operation; identifying, by the scheduler included in the disaggregated Al operation accelerator, the Al operation as at least one of a dense Al operation or a sparse Al operation; and directing, by the scheduler included in the disaggregated Al operation accelerator: a dense Al operation accelerator, included in the disaggregated Al operation accelerator and configured to accelerate dense Al operations, to accelerate the Al operation when the scheduler identifies the Al operation as a dense Al operation; and a sparse Al operation accelerator, included in the disaggregated Al operation accelerator but physically separate from the dense Al operation accelerator and configured to accelerate sparse Al operations, to accelerate the Al operation when the scheduler identifies the Al operation as a sparse Al operation.

11. The method of claim 10, wherein: identifying the Al operation comprises identifying the Al operation as a dense Al operation; and directing the dense Al operation accelerator to accelerate the Al operation comprises: loading a set of Al operation data into a memory cache local to the dense Al operation accelerator, the memory cache associated with at least one of a wide matrix unit included in the dense Al operation accelerator or a tensor unit included in the dense Al operation accelerator; and directing the dense Al operation accelerator to execute the Al operation using the set of Al operation data loaded into the memory cache local to the dense Al operation accelerator.

12. The method of claim 10, wherein: identifying the Al operation comprises identifying the Al operation as a sparse Al operation; directing the sparse Al operation accelerator to accelerate the Al operation comprises: loading a set of Al operation data into a high-bandwidth memory local to the sparse Al operation accelerator; and directing the sparse Al operation accelerator to execute the Al operation using the set of Al operation data loaded into the high-bandwidth memory local to the sparse Al operation accelerator.

13. The method of claim 10, wherein: the Al operation comprises a set of Al operation data; identifying the Al operation comprises: determining whether the set of Al operation data meets a threshold density value; when the set of Al operation data meets the threshold density value, designating the Al operation as a dense Al operation; and when the set of Al operation data does not meet the threshold density value, designating the Al operation as a sparse Al operation.

14. The method of claim 10, wherein: the Al operation comprises a set of Al operation parameters; identifying the Al operation comprises: determining whether the set of Al operation parameters correspond to a dense Al operation; when the set of Al operation parameters correspond to a dense Al operation, designating the Al operation as a dense Al operation; and when the set of Al operation parameters correspond to a sparse Al operation, designating the Al operation as a sparse Al operation.

15. A non-transitory computer-readable medium comprising computer-readable instructions that, when executed by at least one processor of a scheduler included in a disaggregated artificial intelligence (Al) operation accelerator, cause the scheduler to perform the computer-implemented method steps of any of claims 10 to 14.

Description:
SYSTEMS AND METHODS FOR DISAGGREGATED ACCELERATION OF ARTIFICIAL INTELLIGENCE OPERATIONS

TECHNICAL FIELD

[0001] This disclosure generally relates to a system, computer-implemented method and non-transitory computer-readable media for disaggregated acceleration of artificial intelligence (Al) operations.

BACKGROUND

[0002] Al models may provide increasingly important and accurate ways of making predictions based on given input data. Unfortunately, Al operations (e.g., training of Al models, making predictions using trained Al models, etc.) may be highly demanding and may require significant investments in physical computing infrastructure and/or electrical resources. In some conventional examples, hardware central processing units (CPUs) and/or hardware graphics processing units (GPUs) may be employed in devices and/or accelerators to accomplish various Al processes. Such conventional Al accelerators may incorporate various resources to perform an Al function (e.g., a training function, a prediction function, etc.) such as caches, specialized processors, complex networking hardware, and so forth. Unfortunately, such conventional devices may be inefficiently configured to perform different Al operations, and conventional Al operations may inefficiently utilize resources of such conventional accelerators.

[0003] At a high level, Al operations may be logically divided into sparse operations and dense operations. Sparse operations may refer to Al operations performed on sparse data, which may include data that includes a relatively low number of non-zero elements. Likewise, dense operations may refer to Al operations performed on dense data, which may include data that includes a relatively high number of non-zero elements. While the terms "sparse" and "dense" may be relatively loosely defined, a data element (e.g., a vector) may be referred to as k-sparse if it contains at most k non-zero entities. Put another way, a vector's l_0 norm may be k.

[0004] In the context of neural networks and/or Al models, activations of units within a particular layerof an artificial neural network (ANN), weights of nodes within the ANN, and/or data within the ANN may be referred to as "sparse" or "dense". Additionally or alternatively, connectivity within portions of an ANN may be referred to as "sparse" or "dense". For example, a layer within an ANN may be referred to as having "sparse connectivity" in that only a small subset of elements within the layer may be connected to each other, whereas a layer may be referred to as having "dense connectivity" in that a relatively large subset of elements within the layer may be connected to each other.

[0005] Forcing both sparse and dense Al operations into a single conventional Al accelerator may prevent efficient use of resources included in the conventional Al accelerator. Such conventional, all-purpose Al accelerators may also have complex and/or complicated designs, and hence may be difficult to implement, reproduce, and/or scale. Moreover, such conventional Al accelerators may have disadvantageous power usage characteristics, which may further result in a need for specialized cooling infrastructure. Hence, the systems and methods described herein identify and address a need for improved Al accelerators, systems, and/or methods.

SUMMARY OF THE INVENTION

[0006] According to a first aspect of the present disclosure, there is provided a system comprising: a disaggregated artificial intelligence (Al) operation accelerator comprising: a dense Al operation accelerator configured to accelerate dense Al operations; a sparse Al operation accelerator, physically separate from the dense Al operation accelerator, configured to accelerate sparse Al operations; and a scheduler comprising: a receiving module that receives an Al operation; an identifying module that identifies the Al operation as at least one of a dense Al operation or a sparse Al operation; and a directing module that directs: the dense Al operation accelerator to accelerate the Al operation when the identifying module identifies it as a dense Al operation; and the sparse Al operation accelerator to accelerate the Al operation when the identifying module identifies it as a sparse Al operation; and a physical processor that executes the receiving module, the identifying module, and the directing module.

[0007] In an embodiment, the system further comprises an additional dense Al operation accelerator; and when the identifying module identifies the Al operation as a dense Al operation, the directing module directs at least one of the dense Al operation accelerator or the additional dense Al operation accelerator to accelerate the Al operation.

[0008] In an embodiment, the system further comprises an additional sparse Al operation accelerator; and when the identifying module identifies the Al operation as a sparse Al operation, the directing module directs at least one of the sparse Al operation accelerator or the additional sparse Al operation accelerator to accelerate the sparse Al operation. [0009] In an embodiment, the disaggregated Al operation accelerator further comprising a high-bandwidth bus that communicatively couples the dense Al operation accelerator and the sparse Al operation accelerator.

[0010] In an embodiment, the dense Al operation accelerator comprising: at least one of a wide matrix unit or a tensor unit; and a memory cache local to the dense Al operation accelerator and associated with at least one of the wide matrix unit or the tensor unit.

[0011] In an embodiment, the identifying module identifies the Al operation as a dense Al operation; and the directing module directs the dense Al operation accelerator to accelerate the dense Al operation by: loading a set of Al operation data into the memory cache local to the dense Al operation accelerator; and directing the dense Al operation accelerator to execute the dense Al operation using the set of Al operation data loaded into the memory cache local to the dense Al operation accelerator.

[0012] In an embodiment, the sparse Al operation accelerator comprising: a general-purpose compute unit; and a high-bandwidth memory local to the sparse Al operation accelerator.

[0013] In an embodiment, the identifying module identifies the Al operation as a sparse Al operation; and the directing module directs the sparse Al operation accelerator to accelerate the sparse Al operation by: loading a set of Al operation data into the high- bandwidth memory local to the sparse Al operation accelerator; and directing the sparse Al operation accelerator to execute the sparse Al operation using the set of Al operation data loaded into the high-bandwidth memory local to the sparse Al operation accelerator.

[0014] In an embodiment, the sparse Al operation accelerator further comprising at least one wide vector unit.

[0015] In an embodiment, the sparse Al operation accelerator is configured to execute an element-wise Al operation.

[0016] In an embodiment, the element-wise Al operation comprises at least one of: a rectified linear unit (ReLU) operation; a sigmoid operation; or a hyperbolic tangent (tanh) function.

[0017] In an embodiment, the Al operation comprises at least one of: an Al training operation; or an Al inference operation.

[0018] According to a second aspect of the present disclosure, there is provided a computer-implemented method comprising: receiving, by a scheduler included in a disaggregated artificial intelligence (Al) operation accelerator, an Al operation; identifying, by the scheduler included in the disaggregated Al operation accelerator, the Al operation as at least one of a dense Al operation or a sparse Al operation; and directing, by the scheduler included in the disaggregated Al operation accelerator: a dense Al operation accelerator, included in the disaggregated Al operation accelerator and configured to accelerate dense Al operations, to accelerate the Al operation when the scheduler identifies the Al operation as a dense Al operation; and a sparse Al operation accelerator, included in the disaggregated Al operation accelerator but physically separate from the dense Al operation accelerator and configured to accelerate sparse Al operations, to accelerate the Al operation when the scheduler identifies the Al operation as a sparse Al operation.

[0019] In an embodiment, identifying the Al operation comprises identifying the Al operation as a dense Al operation; and directing the dense Al operation accelerator to accelerate the Al operation comprises: loading a set of Al operation data into a memory cache local to the dense Al operation accelerator, the memory cache associated with at least one of a wide matrix unit included in the dense Al operation accelerator or a tensor unit included in the dense Al operation accelerator; and directing the dense Al operation accelerator to execute the Al operation using the set of Al operation data loaded into the memory cache local to the dense Al operation accelerator.

[0020] In an embodiment, identifying the Al operation comprises identifying the Al operation as a sparse Al operation; directing the sparse Al operation accelerator to accelerate the Al operation comprises: loading a set of Al operation data into a high- bandwidth memory local to the sparse Al operation accelerator; and directing the sparse Al operation accelerator to execute the Al operation using the set of Al operation data loaded into the high-bandwidth memory local to the sparse Al operation accelerator.

[0021] In an embodiment, the Al operation comprises a set of Al operation data; identifying the Al operation comprises: determining whether the set of Al operation data meets a threshold density value; when the set of Al operation data meets the threshold density value, designating the Al operation as a dense Al operation; and when the set of Al operation data does not meet the threshold density value, designating the Al operation as a sparse Al operation.

[0022] In an embodiment, the Al operation comprises a set of Al operation parameters; identifying the Al operation comprises: determining whether the set of Al operation parameters correspond to a dense Al operation; when the set of Al operation parameters correspond to a dense Al operation, designating the Al operation as a dense Al operation; and when the set of Al operation parameters correspond to a sparse Al operation, designating the Al operation as a sparse Al operation.

[0023] According to a third aspect of the present disclosure, there is provided a non-transitory computer-readable medium comprising computer-readable instructions that, when executed by at least one processor of a scheduler included in a disaggregated artificial intelligence (Al) operation accelerator, cause the scheduler to: receive an Al operation; identify the Al operation as at least one of a dense Al operation or a sparse Al operation; and direct: a dense Al operation accelerator, included in the disaggregated Al operation accelerator and configured to accelerate dense Al operations, to accelerate the Al operation when it is identified as a dense Al operation; and a sparse Al operation accelerator, included in the disaggregated Al operation accelerator but physically separate from the dense Al operation accelerator and configured to accelerate sparse Al operations, to accelerate the Al operation when it is identified as a sparse Al operation.

[0024] The third aspect of the present disclosure may comprise a non-transitory computer-readable medium comprising computer-readable instructions that, when executed by at least one processor of a scheduler included in a disaggregated artificial intelligence (Al) operation accelerator, cause the scheduler to perform any of the computer-implemented method steps of the second aspect.

[0025] In an embodiment, the third aspect further comprises computer-readable instructions that, when executed by the at least one processor of the scheduler, cause the scheduler to: identify the Al operation as a dense Al operation; and direct the dense Al operation accelerator to accelerate the dense Al operation by: loading a set of Al operation data into a memory cache local to the dense Al operation accelerator, the memory cache associated with at least one of a wide matrix unit included in the dense Al operation accelerator or a tensor unit included in the dense Al operation accelerator; and directing the dense Al operation accelerator to execute the dense Al operation using the set of Al operation data loaded into the memory cache local to the dense Al operation accelerator.

[0026] In an embodiment, the third aspect further comprises computer-readable instructions that, when executed by the at least one processor of the scheduler, cause the scheduler to: identify the Al operation as a sparse Al operation; direct the sparse Al operation accelerator to accelerate the sparse Al operation by: loading a set of Al operation data into a high-bandwidth memory local to the sparse Al operation accelerator; and directing the sparse Al operation accelerator to execute the sparse Al operation using the set of Al operation data loaded into the high-bandwidth memory local to the sparse Al operation accelerator.

BRIEF DESCRIPTION OF THE DRAWINGS

[0027] The accompanying drawings illustrate a number of exemplary embodiments and are a part of the specification. Together with the following description, these drawings demonstrate and explain various principles of the instant disclosure.

[0028] FIG. 1 is a block diagram of an example system that includes a disaggregated artificial intelligence (Al) operation accelerator.

[0029] FIG. 2 is a block diagram of an example system that includes a disaggregated Al operation accelerator.

[0030] FIG. 3 is a block diagram of an example system that includes a disaggregated Al operation accelerator.

[0031] FIG. 4 is a block diagram of an example disaggregated Al operation accelerator.

[0032] FIG. 5 is a block diagram of an example disaggregated Al operation accelerator having a plurality of dense accelerators and/or a plurality of sparse accelerators.

[0033] FIG. 6 is a block diagram of an example scheduler system for disaggregated acceleration of artificial intelligence operations.

[0034] FIG. 7 is a flow diagram of an example method for disaggregated acceleration of artificial intelligence operations.

[0035] Throughout the drawings, identical reference characters and descriptions indicate similar, but not necessarily identical, elements. While the exemplary embodiments described herein are susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and will be described in detail herein. However, the exemplary embodiments described herein are not intended to be limited to the particular forms disclosed. Rather, the instant disclosure covers all modifications, equivalents, and alternatives falling within the scope of the appended claims.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

[0036] The present disclosure is generally directed to systems and methods for disaggregated acceleration of artificial intelligence operations. As will be explained in greater detail below, embodiments of the instant disclosure may include a disaggregated Al operation accelerator. The disaggregated Al operation accelerator may include a dense Al operation accelerator (also "dense Al accelerator" herein) configured to accelerate dense Al training operations. The disaggregated Al operation accelerator may also include a sparse Al operation accelerator (also "sparse Al accelerator" herein), physically separate from the dense Al accelerator, and configured to accelerate sparse Al training operations. Embodiments may also include a scheduler that may include various modules that may perform and/or direct various operations involving the disaggregated Al operation accelerator. For example, the scheduler may include a receiving module that may receive an Al operation and an identifying module that may identify the Al operation as a dense Al operation and/or a sparse Al operation. The scheduler may also include a directing module that may direct the dense Al accelerator to accelerate the Al operation when the identifying module identifies the Al operation as a dense training operation and/or may direct the sparse Al accelerator to accelerate the Al operation when the identifying module identifies the Al operation as a sparse Al operation. In some embodiments, the scheduler may be implemented as part of a system (e.g., a computing device) that includes at least one physical processor that may execute the receiving module, the identifying module, and the directing module.

[0037] Embodiments of the systems and methods described herein may therefore effectively disaggregate Al operations into separate sparse and dense portions, thus enabling development of an accelerator design that is specifically built for each type of operation and/or function. In this new approach, the sparse Al accelerator and the dense Al accelerator may scale independently of each other as needed by a particular Al operation, task, and/or model. For example, an embodiment may have more sparse resources made available when an Al operation (e.g., training of an Al model, predicting an output from input data via a trained Al model, etc.) requires more sparse resources than dense resources. Likewise, an additional or alternative embodiment may have more dense resources made available when an additional Al operation requires more dense resources than sparse resources. As an illustration, multiple sparse Al accelerators could be connected to and/or included in a system that includes only one dense Al accelerator, or vice versa. This flexibility makes the systems and methods described herein highly scalable, especially in comparison to conventional or traditional approaches. Furthermore, the high flexibility of the systems and methods described herein may make such embodiments able to accelerate Al operations involving not only existing Al models, but future Al models as well.

[0038] The following will provide, with reference to FIGS. 1-6, detailed descriptions of systems for disaggregated acceleration of artificial intelligence operations. Detailed descriptions of corresponding computer-implemented methods will also be provided in connection with FIG. 7.

[0039] FIG. 1 is a block diagram of an example system 100 that includes a disaggregated Al operation accelerator in accordance with some embodiments described herein. As shown, example system 100 includes a disaggregated Al operation accelerator 102 that includes a dense Al accelerator 104, a sparse Al accelerator 106, and a scheduler 108. As will be described in greater detail below, one or more components of disaggregated Al operation accelerator 102 (e.g., scheduler 108) may receive one or more Al operations 110. Likewise, one or more components of disaggregated Al operation accelerator 102 (e.g., dense Al accelerator 104 and/or sparse Al accelerator 106) may process one or more Al operations 110 as directed by scheduler 108 to produce an Al accelerator output 112.

[0040] In some examples, dense Al accelerator 104 may include any suitable hardware and/or software components that may enable dense Al accelerator 104 to accelerate one or more dense Al operations. For example, dense Al accelerator 104 may include one or more matrix multiplication units, wide vector units, and/or tensor units that may be configured to operate efficiently on dense data (e.g., data having a relatively high number of non-zero values) and/or to efficiently execute operations that generally may apply to and/or use dense data (e.g., compute and/or tensor operations). Hence, dense Al accelerator 104 may be primarily (though not necessarily exclusively) focused on compute and/or tensor operations involving an Al model (e.g., training the Al model, predicting a result from input data via the Al model, etc.).

[0041] In some embodiments, sparse Al accelerator 106 may include any suitable hardware and/or software components that may enable sparse Al accelerator 106 to accelerate one or more sparse Al operations. For example, sparse Al accelerator 106 may include memory, such as high-bandwidth memory (also "HBM" herein) and/or other forms of memory that may be configured to store and/or operate on sparse data (e.g., data having a relatively low number of non-zero values) and or to efficiently execute operations that generally may apply to and/or use sparse data (e.g., element wise operations). Sparse Al accelerator 106 may also include one or more vector units that may enable element-wise operations including, but not limited to, rectified linear unit (ReLU) operations, sigmoid operations, hyperbolic tangent (tanh) functions, and so forth. Hence, sparse Al accelerator 106 may be primarily (though not necessarily exclusively) focused on memory embedding and/or other memory operations (e.g., memory management, sparse data processing, etc.) involving an Al model.

[0042] As shown in FIG. 1, in some examples, one or more elements of disaggregated Al operation accelerator 102 (e.g., scheduler 108) may receive Al operations 110 In some examples, Al operations 110 may be referred to as "Al operations" in that they may serve as input to one or more elements of disaggregated Al operation accelerator 102. In some examples, Al operations (e.g., Al operations 110) may include any suitable data set including, but not limited to, one or more trained Al models, one or more Al model training parameters, Al model training data, feature inputs to be run through a trained Al model, and so forth.

[0043] FIG. 2 is a block diagram of an example system 200 that includes a disaggregated Al operation accelerator in accordance with some embodiments described herein. Example system 200 may illustrate an example embodiment of a system that may be configured to use a disaggregated Al operation accelerator (e.g., disaggregated Al operation accelerator 102) to train an Al model. As shown, example system 200 includes a disaggregated Al operation accelerator 102 that includes a dense Al accelerator 104, a sparse Al accelerator 106, and a scheduler 108. As will be described in greater detail below, one or more components of disaggregated Al operation accelerator 102 (e.g., scheduler 108) may receive Al operations in a form of Al model training data 202 and/or Al model training parameters 204. Likewise, one or more components of disaggregated Al operation accelerator 102 (e.g., dense Al accelerator 104 and/or sparse Al accelerator 106) may process Al model training data 202 and/or Al model training parameters 204 as directed by scheduler 108 to produce a trained Al model 206.

[0044] As shown in FIG. 2, in some examples, one or more elements of disaggregated Al operation accelerator 102 (e.g., scheduler 108) may receive Al operations in a form of Al model training data 202 and/or Al model training parameters 204. In some examples, Al model training data 202 and/or Al model training parameters 204 may be referred to as "Al model training operations" in that they may serve as input to one or more elements of disaggregated Al operation accelerator 102. In some examples, Al model training data (e.g., Al model training data 202) may include any data set input into a training algorithm and used to train an Al model, such as training data sets, validation data sets, and/ortest data sets. Likewise, in some embodiments, an Al model training parameter (e.g., Al model training parameters 204) may include any value, setting, parameter, and so forth associated with an Al model that may be predetermined in advance of a training process.

[0045] In Al and/or machine learning contexts, a model may be defined and/or represented by model parameters. Training parameters may include parameters that may control the learning process. In some examples, training parameters may be referred to as "hyperparameters" in that they may influence and/or control the learning process and model parameters that may result therefrom. Training parameters may be determined (e.g., selected by a user, determined as a result of a selection process, etc.) in advance of training of the model. In some examples, training parameters and/or hyperparameters may be considered external to an Al model because, while used by a learning algorithm, they may not be included as part of a resulting trained model. Examples may include, without limitation, a train-test split ratio, a learning rate in optimization algorithm (e.g. gradient descent), a choice of optimization algorithm (e.g., gradient descent, stochastic gradient descent, Adam optimizer, etc.), a choice of an choice of activation function in a neural network layer (e.g. sigmoid, ReLU, tanh), a choice of cost or loss function, a number of hidden layers in a neural network, a number of activation units in each layer, a dropout probability, a number of iterations or epochs in training of a neural network, a number of clusters in a clustering task, a kernel or filter size in a convolutional layer, a pooling size, a batch size, and so forth.

[0046] Trained Al model 206 may include any model, program, tool, algorithm, process, and so forth, based on a predefined data set, that, when provided with input data, may arrive at an inference regarding the input data. In some examples, trained Al model 206 may include a program that has been trained on a predefined training data set (also called a "training set") to recognize patterns from input data that may differ from and/or are congruent with the training data set. In some examples, trained Al model 206 may include and/or represent a supervised, unsupervised, and/or reinforcement-based machine learning model. In some examples, trained Al model 206 may include or represent, without limitation, an ANN such as a deep learning model, autoencoder, a multilayer perceptron, a recurrent neural network, a convolutional neural network (CNN), and so forth. In some examples, trained Al model 206 may include a portion of (e.g., a layer of) another trained Al model. [0047] Hence, in embodiments, such as the example illustrated in FIG. 2, scheduler 108, included in disaggregated Al operation accelerator 102, may receive Al operations in the form of Al model training data 202 and/or Al model training parameters 204. Scheduler 108 may identify each received Al operation as a dense Al operation and/or a sparse Al operation. Scheduler 108 may then direct dense Al accelerator 104 to execute Al operations identified as dense Al operations. Likewise, scheduler 108 may also direct sparse Al accelerator 106 to accelerate Al operations identified as sparse Al operations. Acceleration of Al model training data 202 and/or Al model training parameters 204 via disaggregated Al operation accelerator 102 may thus result in trained Al model 206.

[0048] FIG. 3 is a block diagram of an example system 300 that includes a disaggregated Al operation accelerator in accordance with some embodiments described herein. Example system 300 may illustrate an example embodiment of a system that may be configured to use a disaggregated Al operation accelerator (e.g., disaggregated Al operation accelerator 102) to make an inference (e.g., inference 306) regarding input data (e.g., feature inputs 302) via a trained Al model (e.g., trained Al model 304). As shown, example system 300 includes a disaggregated Al operation accelerator 102 that includes a dense Al accelerator 104, a sparse Al accelerator 106, and a scheduler 108. As will be described in greater detail below, one or more components of disaggregated Al operation accelerator 102 (e.g., scheduler 108) may receive Al operations in a form of feature inputs 302 and/or trained Al model 304. Likewise, one or more components of disaggregated Al operation accelerator 102 (e.g., dense Al accelerator 104 and/or sparse Al accelerator 106) may process feature inputs 302 and/or trained Al model 304 as directed by scheduler 108 to produce an inference 306.

[0049] As shown in FIG. 3, in some examples, one or more elements of disaggregated Al operation accelerator 102 (e.g., scheduler 108) may receive Al operations in a form of feature inputs 302 and/or trained Al model 304. In some examples, feature inputs 302 and/or trained Al model 304 may be referred to as "Al inference operations" in that they may serve as input to one or more elements of disaggregated Al operation accelerator 102. In some examples, feature inputs (e.g., feature inputs 302) may include any data set to be input into at least a portion of a trained Al model (e.g., trained Al model 304) to produce an inference regarding and/or associated with the feature inputs. Likewise, in some embodiments, a trained Al model (e.g., trained Al model 304) may include any Al model that has been previously trained to make inferences regarding one or more feature inputs. [0050] Like trained Al model 206, trained model 304 may include any model, program, tool, algorithm, process, and so forth, based on a predefined data set, that, when provided with input data, may arrive at an inference regarding the input data. Also like trained Al model 206, in some examples, trained model 304 may include a program that has been trained on a predefined training data set (also called a "training set") to recognize patterns from input data that may differ from and/or are congruent with the training data set. In some examples, trained model 304 may include and/or represent a supervised, unsupervised, and/or reinforcement-based machine learning model. In some examples, trained model 304 may include or represent, without limitation, an ANN such as a deep learning model, autoencoder, a multilayer perceptron, a recurrent neural network, a CNN, and so forth. In some examples, trained model 304 may include a portion of (e.g., a layer of) another trained Al model.

[0051] Hence, in embodiments such as the example illustrated in FIG. 3, scheduler 108, included in disaggregated Al operation accelerator 102, may receive Al operations in the form of feature inputs 302 and/or trained Al model 304. Scheduler 108 may identify each received Al operation (e.g., each of feature inputs 302 and/or each portion of trained Al model 304) as a dense Al operation or a sparse Al operation. Scheduler 108 may then direct dense Al accelerator 104 to execute Al operations identified as dense Al operations. Likewise, scheduler 108 may also direct sparse Al accelerator 106 to accelerate Al operations identified as sparse Al operations. Acceleration of evaluation of feature inputs 302 using trained Al model 304 via disaggregated Al operation accelerator 102 may thus result in an inference 306. In some examples, inference 306 may include any suitable representation of an inference regarding and/or associated with feature inputs 302 such as, without limitation, a score, a probability, a threshold, a binary value, representations thereof, and so forth.

[0052] FIG. 4 is a block diagram of an example disaggregated Al operation accelerator 400 in accordance with some embodiments described herein. Al operation accelerator 400 may be an example and/or a detailed illustration of disaggregated Al operation accelerator 102. As shown, disaggregated Al operation accelerator 400 may include dense Al accelerator 104 and sparse Al accelerator 106.

[0053] As illustrated, in some embodiments, dense Al accelerator 104 may be separate and distinct from sparse Al accelerator 106. In some examples, dense Al accelerator 104 may be physically and/or logically separate from sparse Al accelerator 106. By way of illustration, in some examples, dense Al accelerator 104 may be included as part of a primary integrated circuit and sparse Al accelerator 106 may be included as part of a secondary integrated circuit. In additional examples, dense Al accelerator 104 may communicate with sparse Al accelerator 106 via a suitable high bandwidth bus. In the example illustrated in FIG. 4, dense Al accelerator 104 may be communicatively coupled to sparse Al accelerator 106 via high-bandwidth bus 402. High-bandwidth bus 402 may include any suitable bus or communication facility that may enable separate dense Al accelerator 104 and sparse Al accelerator 106 to communicate Al operations, Al data, and/or output data one with another. For example, and not by way of limitation, high-bandwidth bus 402 may include an internal bus such as an internal data bus, a memory bus, a front-side bus, and/or an external or expansion bus.

[0054] As mentioned above and as illustrated in FIG. 4, dense Al accelerator 104 may include a vector unit 404. In some examples, vector unit 404 may include any hardware or software processor that implements an instruction set designed to operate efficiently and effectively on single-dimension or multidimensional arrays of data called vectors. Vector units or vector processors may improve performance on certain workloads such as some machine learning tasks. Although not illustrated in FIG. 4, dense Al accelerator 104 may also include any suitable memory and/or storage device that may receive and/or store preliminary, initial, intermediary, and/or final data for one or more vector operations supported and/or executed by vector unit 404.

[0055] Sparse Al accelerator 106 may include a general-purpose compute unit 406 and a high-bandwidth memory 408. General-purpose compute unit 406 may include any suitable processor that may be configured to efficiently execute sparse Al operations in hardware. In some examples, general-purpose compute unit 406 may include and/or may implement an instruction set directed to executing sparse Al operations. As mentioned above, sparse Al operations may also include one or more wide vector units that may enable and/or execute element wise operations like ReLU, sigmoid, tanh, and similar operations. As shown, sparse Al accelerator 106 may also include a high-bandwidth memory 408. High-bandwidth memory 408 may include or represent any suitable memory and/or storage device that may receive and/or store preliminary, initial, intermediary, and/or final data for one or more sparse Al operations supported by and/or executed by general-purpose compute unit 406.

[0056] It may be clear that the design of disaggregated Al operation accelerator 102 may be highly modular and may support addition of any suitable number of dense Al accelerators and/or sparse Al accelerators to efficiently accelerate a desired Al operation or process. For example, FIG. 5 shows a block diagram of an example disaggregated Al operation accelerator 500 having a plurality of dense accelerators and/or a plurality of sparse accelerators. As shown in this example, dense Al accelerator 104 may be paired with at least one additional dense Al accelerator 504. Likewise, sparse Al accelerator 106 may be paired with at least one additional sparse Al accelerator 506. The dense Al accelerator(s) (e.g., dense Al accelerator 104 and/or additional dense Al accelerator 504) may be communicatively coupled to the sparse Al accelerator(s) (e.g., sparse Al accelerator 106 and/or additional sparse accelerator 506) via high-bandwidth bus 402. In this way, the design of disaggregated Al operation accelerator 500 may allow sparse and dense functions of disaggregated Al operation accelerator 500 to scale independently of each other as dictated by, required by, and/or may be beneficial to the efficient training and/or use of an Al model by disaggregated Al operation accelerator 500.

[0057] An important feature of the systems and methods described herein may be a scheduler (e.g., scheduler 108) that effectively and efficiently orchestrates operations of the disaggregated Al operation accelerator. At a high level, such a scheduler (e.g., scheduler 108) may distinguish dense Al operations from sparse Al operations. The scheduler may also direct a suitable dense Al accelerator (e.g., dense Al accelerator 104, additional dense Al accelerator 504, etc.) to accelerate the dense Al operations and/or may direct a suitable sparse Al accelerator (e.g., sparse Al accelerator 106, additional sparse Al accelerator 506, etc.) to accelerate the sparse Al operations. The scheduler may further collect results of the accelerated Al operations. As shown in FIG. 1, scheduler 108 may include any suitable hardware and/or software system that receives Al operations, identifies dense Al operations and/or sparse Al operations, and directs dense Al accelerator 104 and/or sparse Al accelerator 106 to execute respective dense Al operations and/or sparse Al operations. In some examples, scheduler 108 may also collect results of various Al operations executed by one or more components of the disaggregated Al operation accelerator.

[0058] Fig. 6 is a block diagram of an example scheduler system 600 for disaggregated acceleration of artificial intelligence operations as described herein. In some examples, example scheduler system 600 may be an example and/or implementation of scheduler 108. As illustrated in this figure, example scheduler system 600 may include one or more modules 602 for performing one or more tasks. Modules 602 may be included in a memory 620 in communication with a physical processor 630, a data store 640, and a disaggregated Al operation accelerator 650.

[0059] As will be explained in greater detail below, modules 602 may include a receiving module 604 that receives an Al operation (e.g., one of Al operations 642 included in data store 640) and an identifying module 606 that identifies the Al operation as a dense Al operation and/or a sparse Al operation. Example scheduler system 600 may also include a directing module 608 that directs a dense Al accelerator (e.g., dense Al accelerator 652 included in disaggregated Al operation accelerator 650) to accelerate the Al operation when identifying module 606 identifies the Al operation as a dense Al training operation, and that directs a sparse Al accelerator (e.g., sparse Al accelerator 654 included in disaggregated Al operation accelerator 650) to accelerate the Al operation when identifying module 606 identifies the Al operation as a sparse Al training operation.

[0060] As further illustrated in FIG. 6, example scheduler system 600 may also include one or more memory devices, such as memory 620. Memory 620 generally represents any type or form of volatile or non-volatile storage device or medium capable of storing data and/or computer-readable instructions. In one example, memory 620 may store, load, and/or maintain one or more of modules 602. Examples of memory 620 include, without limitation, Random Access Memory (RAM), Read Only Memory (ROM), flash memory, Hard Disk Drives (HDDs), Solid-State Drives (SSDs), optical disk drives, caches, variations or combinations of one or more of the same, or any other suitable storage memory.

[0061] As further illustrated in FIG. 6, example scheduler system 600 may also include one or more physical processors, such as physical processor 630. Physical processor 630 generally represents any type or form of hardware-implemented processing unit capable of interpreting and/or executing computer-readable instructions. In one example, physical processor 630 may access and/or modify one or more of modules 602 stored in memory 620. Additionally or alternatively, physical processor 630 may execute one or more of modules 602 to facilitate disaggregated acceleration of artificial intelligence operations. Examples of physical processor 630 may include, without limitation, microprocessors, microcontrollers, central processing units (CPUs), Field-Programmable Gate Arrays (FPGAs) that implement softcore processors, Application-Specific Integrated Circuits (ASICs), portions of one or more of the same, variations or combinations of one or more of the same, or any other suitable physical processor.

[0062] As also shown in FIG. 6, example scheduler system 600 may also include (e.g., be in communication with) one or more data stores, such as data store 640, that may receive, store, and/or maintain data. Data store 640 may represent portions of a single data store or computing device or a plurality of data stores or computing devices. In some embodiments, data store 640 may be a logical container for data and may be implemented in various forms (e.g., a database, a file, a file system, a data structure, etc.). Examples of data store 640 may include, without limitation, files, file systems, data stores, databases, and/or database management systems such as an operational data store (ODS), a relational database, a NoSQL database, a NewSQL database, and/or any other suitable organized collection of data.

[0063] In at least one example, data store 640 may include (e.g., store, host, access, maintain, etc.) Al operations 642. As explained above, in some examples, Al operations 642 may include any data that may serve as input to one or more elements of a disaggregated Al operation accelerator (e.g., disaggregated Al operation accelerator 102, disaggregated Al operation accelerator 650, etc.) such as Al model training data (e.g., Al model training data 202), Al model training parameters (e.g., Al model training parameters 204), feature inputs (e.g., feature inputs 302), trained Al models (e.g., trained Al model 304), and so forth.

[0064] As further shown in FIG. 6, example scheduler system 600 may include (e.g., may be in communication with) a disaggregated Al operation accelerator 650 that may include a dense Al accelerator 652 and a sparse Al accelerator 654. Disaggregated Al operation accelerator 650 may include and/or represent any of the disaggregated Al operation accelerators described herein (e.g., disaggregated Al operation accelerator 102, disaggregated Al operation accelerator400, disaggregated Al operation accelerator 500, etc.).

[0065] Example scheduler system 600 in FIG. 6 may be implemented in any suitable way. For example, a computing device (e.g., a user device and/or server) having at least one processor may be programmed with one or more of modules 602. In at least one embodiment, one or more of modules 602 may, when executed by the computing device, enable the computing device to perform one or more operations to disaggregate Al model training operations. For example, receiving module 604 may cause the computing device to receive (e.g., from data store 640) an Al model training operation (e.g., one or more of Al model training operations 642). Furthermore, identifying module 606 may cause the computing device to identify the Al model training operation as a dense Al training operation or a sparse Al training operation. Moreover, directing module 608 may cause the computing device to direct the dense Al accelerator to accelerate the Al model training operation when identifying module 606 identifies the Al model training operation as a dense Al training operation. Directing module 608 may also cause the computing device to direct the sparse Al accelerator to direct the sparse Al accelerator to accelerate the Al model training operation when identifying module 606 identifies the Al model training operation as a sparse Al training operation.

[0066] Additionally, in some examples, a scheduler device as described herein (e.g., scheduler 108, example scheduler system 600, etc.) may, when a disaggregated Al operation accelerator includes multiple dense Al accelerators and/or sparse Al accelerators, determine which Al accelerator should execute an identified Al training operation and direct the selected Al accelerator to accelerate (e.g., execute using resources of the accelerator) the identified Al training operation. Hence, a scheduler device (e.g., scheduler 108, example scheduler system 600, etc.) may also perform a load balancing function among multiple dense and/or sparse Al accelerators included as part of a disaggregated Al operation accelerator (e.g., disaggregated Al operation accelerator 102, disaggregated Al operation accelerator400, disaggregated Al operation accelerator 500, and so forth).

[0067] For example, as described above in reference to FIG. 5, a disaggregated Al operation accelerator may be configured with a dense Al accelerator and an additional dense Al accelerator. When identifying module 606 identifies an Al training operation as a dense Al training operation, one or more of modules 602 (e.g., identifying module 606, directing module 608, etc.) may select the dense Al accelerator or the additional Al accelerator (e.g., based on a workload currently being accelerated by the dense Al accelerators) to direct to execute the Al training operation. Directing module 608 may then direct the selected dense Al accelerator (e.g., the dense Al accelerator or the additional dense Al accelerator) to accelerate (e.g., execute using resources of the accelerator) the Al training operation.

[0068] Many other devices or subsystems may be connected to example scheduler system 600 in FIG. 6. Conversely, all of the components and devices illustrated in FIG. 6 need not be present to practice the embodiments described and/or illustrated herein. The devices and subsystems referenced above may also be interconnected in different ways from those shown in FIG. 6. Example scheduler system 600 may also employ any number of software, firmware, and/or hardware configurations. For example, one or more of the example embodiments of scheduler 108 and/or example scheduler system 600 disclosed herein may be encoded as a computer program (also referred to as computer software, software applications, computer-readable instructions, and/or computer control logic) on a computer- readable medium.

[0069] FIG. 7 is a flow diagram of an example computer-implemented method 700 for disaggregated acceleration of artificial intelligence operations. The steps shown in FIG. 7 may be performed by any suitable computer-executable code and/or computing system, including scheduler 108 in FIG. 1, example scheduler system 600 in FIG. 6, and/or variations or combinations of one or more of the same. In one example, each of the steps shown in FIG. 7 may represent an algorithm whose structure includes and/or is represented by multiple sub-steps, examples of which will be provided in greater detail below.

[0070] As illustrated in FIG. 7, at step 710, one or more of the systems described herein may receive an Al operation. For example, receiving module 604 may, as part of a scheduler device (e.g., scheduler 108, example scheduler system 600, etc.), cause the scheduler device to receive at least one of Al operations 642 stored and/or maintained by data store 640. Receiving module 604 may receive the Al model training operation in any of the ways described herein such as via any suitable data connection that may couple receiving module 604 to data store 640.

[0071] At step 720, one or more of the systems described herein may identify the Al operation as at least one of a dense Al operation or a sparse Al operation. For example, identifying module 606 may, as part of a scheduler device (e.g., scheduler 108, example scheduler system 600, etc.), cause the scheduler device to identify the received Al operation as a dense Al operation or a sparse Al operation.

[0072] Identifying module 606 may identify Al operation in a variety of contexts. For example, identifying module 606 may identify the Al operation by determining that the Al operation is included in a predefined set of dense and/or sparse Al operations. By way of illustration, identifying module 606 may identify an Al operation by determining that the Al operation is an Al model training parameter specifying a matrix multiplication operation as applied to a set of Al model training data. As mentioned above, this type of Al training operation may be classified as a dense Al training operation. Hence, identifying module 606 may identify the Al model training parameter (and any associated Al model training data) as a dense Al training operation.

[0073] As an additional example, identifying module 606 may identify an additional Al model training operation as an Al model training parameter specifying a ReLU operation associated with an additional set of Al model training data. As described above, this type of operation may be classified as a sparse Al training operation. Hence, identifying module 606 may identify the Al model training parameter (and any associated Al model training data), as a sparse Al training operation.

[0074] In some embodiments, identifying module 606 may identify Al operation data (and any associated Al models and/or parameters) as dense Al operations or sparse Al operations based on a density of non-zero data elements included in the Al operation data. For example, receiving module 604 may receive an Al operation, and identifying module 606 may identify the Al operation as Al model training data (e.g., Al model training data 202). Identifying module 606 may analyze the Al model training data and may determine that the Al model training data has a density of non-zero data elements greater than a threshold density. Hence, identifying module 606 may identify the Al model training data (and any associated Al model training parameters) as dense Al training operations.

[0075] Conversely, identifying module 606 may analyze the Al model training data and may determine that the Al model training data has a density of non-zero elements less than or equal to the threshold density. Hence, identifying module 606 may identify the Al model training data (and any associated Al model training parameters) as sparse Al training operations.

[0076] In accordance with principles disclosed herein, embodiments of the systems and methods described herein may similarly accelerate other Al operations such as inference operations. For example, receiving module 604 may receive an Al operation by receiving a feature input for a trained Al model (e.g., one or more of feature inputs 302) and/or an Al model trained to make inferences regarding feature inputs (e.g., trained Al model 304). Identifying module 606 may analyze the feature input and/or the trained Al model and may determine that the trained Al model may generate an inference regarding the feature input more efficiently using a dense Al accelerator versus a sparse Al accelerator or vice versa. Hence, identifying module 606 may identify the inference to be made regarding the feature input via the trained Al model as a dense Al operation and/or a sparse Al operation, and directing module 608 may direct dense Al accelerator 652 and/or sparse Al accelerator 654 to accelerate the inference operation.

[0077] By dynamically identifying Al operations (e.g., Al model training data and/or Al model training parameters) as dense or sparse Al model training operations, the systems and methods described herein may dynamically and effectively direct dense Al training operations towards purpose-built dense Al accelerators and sparse Al training operations toward purpose-built sparse Al training accelerators.

[0078] Hence, returning to FIG. 7, at step 730, one or more of the systems described herein may direct a dense Al accelerator, included in a disaggregated Al operation accelerator and configured to accelerate dense Al operations, to accelerate the Al operation when it is identified as a dense Al training operation. For example, directing module 608 may, as part of a scheduler device (e.g., scheduler 108, example scheduler system 600, etc.), cause the scheduler device to direct dense Al accelerator 652, included in disaggregated Al operation accelerator 650, to accelerate the identified Al operation.

[0079] Directing module 608 may direct dense Al accelerator 652 to accelerate the Al operation in any suitable way. For example, as described above, a dense Al accelerator (e.g., dense Al accelerator 104, dense Al accelerator 652, etc.) may include a vector unit, a wide matrix unit and/or a tensor unit. The dense Al accelerator may also include a memory cache local to the dense Al accelerator and associated with the vector unit, the wide matrix unit, and/or the tensor unit. When identifying module 606 identifies an Al operation as a dense Al operation, directing module 608 may direct dense Al accelerator 652 to accelerate the dense Al training operation by (1) loading a set of Al data (e.g., Al model training data, Al model training parameters, feature inputs, etc., a trained Al model, etc.) into the memory cache local to the dense Al accelerator, and (2) directing the dense Al accelerator to execute the dense Al operation (e.g., via the wide vector unit and/or the tensor unit) using the set of Al data loaded into the memory cache local to the dense Al accelerator.

[0080] Returning to FIG. 7, at step 740, one or more of the systems described herein may direct a sparse Al accelerator, included in a disaggregated Al operation accelerator and configured to accelerate sparse Al operations, to accelerate the Al operation when it is identified as a sparse Al operation. For example, directing module 608 may, as part of a scheduler device (e.g., scheduler 108, example scheduler system 600, etc.) cause the scheduler device to direct sparse Al accelerator 654, included in disaggregated Al operation accelerator 650, to accelerate the identified Al operation.

[0081] Directing module 608 may direct sparse Al accelerator 654 to accelerate the Al training operation in any suitable way. For example, as described above, sparse Al accelerator 654 may include a general-purpose compute unit and a high-bandwidth memory local to the sparse Al accelerator. When identifying module 606 identifies the Al operation as a sparse Al operation, directing module 608 may direct sparse Al accelerator 654 to accelerate the sparse Al training operation by (1) loading a set of Al data (e.g., Al model training data, Al model training parameters, feature inputs, etc., a trained Al model, etc.) into the high- bandwidth memory local to the sparse Al accelerator, and (2) directing the sparse Al accelerator to execute the sparse Al operation using the set of Al data loaded into the high- bandwidth memory local to the sparse Al accelerator.

[0082] As discussed throughout the instant disclosure, the disclosed systems and methods may provide one or more advantages over traditional options for accelerating Al operations. For example, by disaggregating Al operations into independent sparse and dense portions, systems and methods described herein may effectively utilize two different accelerator architectures, each targeting a specific category of functions (e.g., dense functions versus sparse functions), thus resulting in a more efficient Al training solution.

[0083] In some examples, the dense accelerators described herein may include wide matrix and tensor units and an associated cache. This may simplify accelerator design tremendously and may also provide an appropriately sized solution for specific dense training applications. The sparse accelerators described herein may also be constructed from high- bandwidth memory or other forms of memory. In addition to the memory, sparse accelerators may also include wide vector units that may enable element wise operations like ReLU, sigmoid, hyperbolic tanh, and similar operations. The sparse accelerators described herein may be primarily focused on embedding and memory operations of Al model training and/or inference.

[0084] This may effectively disaggregate an Al training and/or inference problem into two portions (e.g., dense and/or sparse), thus enabling the development of a design that may include hardware specifically built to efficiently execute each training function. In this new approach, the dense and sparse portions may scale independently of each other as may be beneficial for a particular Al operation and/or model. For example, as illustrated in FIG. 5 above, an example disaggregated Al operation accelerator may include more or fewer dense resources and/or more or fewer sparse resources. The amounts of each resource may be based on the needs for efficient, beneficial, and/or appropriate training of and/or inferences via a particular Al model. This flexibility may further enable efficient scaling of Al infrastructures, particularly Al training and/or inference infrastructures dealing with large amounts of Al training and/or inference requests and/or extensive available Al training and/or inference data.

[0085] Example Embodiments

[0086] Example 1: A system comprising (1) a disaggregated artificial intelligence (Al) operation accelerator comprising: (A) a dense Al operation accelerator configured to accelerate dense Al operations, (B) a sparse Al operation accelerator, physically separate from the dense Al operation accelerator, configured to accelerate sparse Al operations, and (3) a scheduler comprising: (A) a receiving module that receives an Al operation, (B) an identifying module that identifies the Al operation as at least one of a dense Al operation or a sparse Al operation, and (C) a directing module that directs: (i) the dense Al operation accelerator to accelerate the Al operation when the identifying module identifies it as a dense Al operation, and (ii) the sparse Al operation acceleratorto accelerate the Al operation when the identifying module identifies it as a sparse Al operation, and (D) a physical processor that executes the receiving module, the identifying module, and the directing module.

[0087] Example 2: The system of example 1, wherein (1) the system further comprises an additional dense Al operation accelerator, and (2) when the identifying module identifies the Al operation as a dense Al operation, the directing module directs at least one of the dense Al operation accelerator or the additional dense Al operation accelerator to accelerate the Al operation.

[0088] Example 3: The system of any of examples 1-2, wherein (1) the system further comprises an additional sparse Al operation accelerator, and (2) when the identifying module identifies the Al operation as a sparse Al operation, the directing module directs at least one of the sparse Al operation accelerator or the additional sparse Al operation accelerator to accelerate the sparse Al operation.

[0089] Example 4: The system of any of examples 1-3, the disaggregated Al operation accelerator further comprising a high-bandwidth bus that communicatively couples the dense Al operation accelerator and the sparse Al operation accelerator.

[0090] Example 5: The system of any of examples 1-4, the dense Al operation accelerator comprising (1) at least one of a wide matrix unit or a tensor unit, and (2) a memory cache local to the dense Al operation accelerator and associated with at least one of the wide matrix unit or the tensor unit.

[0091] Example 6: The system of example 5, wherein (1) the identifying module identifies the Al operation as a dense Al operation, and (2) the directing module directs the dense Al operation accelerator to accelerate the dense Al operation by (A) loading a set of Al operation data into the memory cache local to the dense Al operation accelerator, and (B) directing the dense Al operation accelerator to execute the dense Al operation using the set of Al operation data loaded into the memory cache local to the dense Al operation accelerator.

[0092] Example 7: The system of any of examples 1-6, the sparse Al operation accelerator comprising (1) a general-purpose compute unit, and (2) a high-bandwidth memory local to the sparse Al operation accelerator.

[0093] Example 8: The system of example 7, wherein (1) the identifying module identifies the Al operation as a sparse Al operation, and (2) the directing module directs the sparse Al operation accelerator to accelerate the sparse Al operation by (A) loading a set of Al operation data into the high-bandwidth memory local to the sparse Al operation accelerator, and (B) directing the sparse Al operation accelerator to execute the sparse Al operation using the set of Al operation data loaded into the high-bandwidth memory local to the sparse Al operation accelerator.

[0094] Example 9: The system of any of examples 7-8, the sparse Al operation accelerator further comprising at least one wide vector unit.

[0095] Example 10: The system of example 1-9, wherein the sparse Al operation accelerator is configured to execute an element-wise Al operation.

[0096] Example 11: The system of example 10, wherein the element-wise Al operation comprises at least one of (1) a rectified linear unit (ReLU) operation, (2) a sigmoid operation, or (3) a hyperbolic tangent (tanh) function.

[0097] Example 12: The system of any of examples 1-11, wherein the Al operation comprises at least one of (1) an Al training operation, or (2) an Al inference operation.

[0098] Example 13: A computer-implemented method comprising (1) receiving, by a scheduler included in a disaggregated artificial intelligence (Al) operation accelerator, an Al operation, (2) identifying, by the scheduler included in the disaggregated Al operation accelerator, the Al operation as at least one of a dense Al operation or a sparse Al operation, and (3) directing, by the scheduler included in the disaggregated Al operation accelerator (A) a dense Al operation accelerator, included in the disaggregated Al operation accelerator and configured to accelerate dense Al operations, to accelerate the Al operation when the scheduler identifies the Al operation as a dense Al operation, and (B) a sparse Al operation accelerator, included in the disaggregated Al operation accelerator but physically separate from the dense Al operation accelerator and configured to accelerate sparse Al operations, to accelerate the Al operation when the scheduler identifies the Al operation as a sparse Al operation.

[0099] Example 14: The method of example 13, wherein (1) identifying the Al operation comprises identifying the Al operation as a dense Al operation, and (2) directing the dense Al operation accelerator to accelerate the Al operation comprises (A) loading a set of Al operation data into a memory cache local to the dense Al operation accelerator, the memory cache associated with at least one of a wide matrix unit included in the dense Al operation accelerator or a tensor unit included in the dense Al operation accelerator, and (B) directing the dense Al operation accelerator to execute the Al operation using the set of Al operation data loaded into the memory cache local to the dense Al operation accelerator.

[0100] Example 15: The method of any of examples 13-14, wherein (1) identifying the Al operation comprises identifying the Al operation as a sparse Al operation, (2) directing the sparse Al operation accelerator to accelerate the Al operation comprises (A) loading a set of Al operation data into a high-bandwidth memory local to the sparse Al operation accelerator, and (B) directing the sparse Al operation accelerator to execute the Al operation using the set of Al operation data loaded into the high-bandwidth memory local to the sparse Al operation accelerator.

[0101] Example 16: The method of any of examples 13-15, wherein (1) the Al operation comprises a set of Al operation data, (2) identifying the Al operation comprises (A) determining whether the set of Al operation data meets a threshold density value, (B) when the set of Al operation data meets the threshold density value, designating the Al operation as a dense Al operation, and (C) when the set of Al operation data does not meet the threshold density value, designating the Al operation as a sparse Al operation.

[0102] Example 17: The method of any of examples 13-16, wherein (1) the Al operation comprises a set of Al operation parameters, (2) identifying the Al operation comprises (A) determining whetherthe set of Al operation parameters correspond to a dense Al operation, (B) when the set of Al operation parameters correspond to a dense Al operation, designating the Al operation as a dense Al operation, and (C) when the set of Al operation parameters correspond to a sparse Al operation, designating the Al operation as a sparse Al operation.

[0103] Example 18: A non-transitory computer-readable medium comprising computer-readable instructions that, when executed by at least one processor of a scheduler included in a disaggregated artificial intelligence (Al) operation accelerator, cause the scheduler to (1) receive an Al operation, (2) identify the Al operation as at least one of a dense Al operation or a sparse Al operation, and (3) direct (A) a dense Al operation accelerator, included in the disaggregated Al operation accelerator and configured to accelerate dense Al operations, to accelerate the Al operation when it is identified as a dense Al operation, and (B) a sparse Al operation accelerator, included in the disaggregated Al operation accelerator but physically separate from the dense Al operation accelerator and configured to accelerate sparse Al operations, to accelerate the Al operation when it is identified as a sparse Al operation.

[0104] Example 19: The non-transitory computer-readable medium of example 18, further comprising computer-readable instructions that, when executed by the at least one processor of the scheduler, cause the scheduler to (1) identify the Al operation as a dense Al operation, and (2) direct the dense Al operation accelerator to accelerate the dense Al operation by (A) loading a set of Al operation data into a memory cache local to the dense Al operation accelerator, the memory cache associated with at least one of a wide matrix unit included in the dense Al operation accelerator or a tensor unit included in the dense Al operation accelerator, and (B) directing the dense Al operation accelerator to execute the dense Al operation using the set of Al operation data loaded into the memory cache local to the dense Al operation accelerator.

[0105] Example 20: The non-transitory computer-readable medium of any of examples 18-19, further comprising computer-readable instructions that, when executed by the at least one processor of the scheduler, cause the scheduler to (1) identify the Al operation as a sparse Al operation, (2) direct the sparse Al operation accelerator to accelerate the sparse Al operation by (A) loading a set of Al operation data into a high-bandwidth memory local to the sparse Al operation accelerator, and (B) directing the sparse Al operation accelerator to execute the sparse Al operation using the set of Al operation data loaded into the high-bandwidth memory local to the sparse Al operation accelerator.

[0106] As detailed above, the computing devices and systems described and/or illustrated herein broadly represent any type or form of computing device or system capable of executing computer-readable instructions, such as those contained within the modules described herein. In their most basic configuration, these computing device(s) may each include at least one memory device and at least one physical processor.

[0107] As detailed above, the computing devices and systems described and/or illustrated herein broadly represent any type or form of computing device or system capable of executing computer-readable instructions, such as those contained within the modules described herein. In their most basic configuration, these computing device(s) may each include at least one memory device and at least one physical processor.

[0108] Although illustrated as separate elements, the modules described and/or illustrated herein may represent portions of a single module or application. In addition, in certain embodiments one or more of these modules may represent one or more software applications or programs that, when executed by a computing device, may cause the computing device to perform one or more tasks. For example, one or more of the modules described and/or illustrated herein may represent modules stored and configured to run on one or more of the computing devices or systems described and/or illustrated herein. One or more of these modules may also represent all or portions of one or more special-purpose computers configured to perform one or more tasks.

[0109] In addition, one or more of the modules described herein may transform data, physical devices, and/or representations of physical devices from one form to another. For example, one or more of the modules recited herein may receive Al training data to be transformed, transform the Al training data, output a result of the transformation to use e.g., make inferences regarding input data using) a trained Al model, use the result of the transformation to make a prediction using a trained Al model, and store the result of the transformation to revise and/or refine a training of a trained Al model. Additionally or alternatively, one or more of the modules recited herein may transform a processor, volatile memory, non-volatile memory, and/or any other portion of a physical computing device from one form to another by executing on the computing device, storing data on the computing device, and/or otherwise interacting with the computing device. [0110] The term "computer-readable medium/' as used herein, generally refers to any form of device, carrier, or medium capable of storing or carrying computer-readable instructions. Examples of computer-readable media include, without limitation, transmissiontype media, such as carrier waves, and non-transitory-type media, such as magnetic-storage media (e.g., hard disk drives, tape drives, and floppy disks), optical-storage media (e.g., Compact Disks (CDs), Digital Video Disks (DVDs), and BLU-RAY disks), electronic-storage media (e.g., solid-state drives and flash media), and other distribution systems.

[0111] The process parameters and sequence of the steps described and/or illustrated herein are given by way of example only and can be varied as desired. For example, while the steps illustrated and/or described herein may be shown or discussed in a particular order, these steps do not necessarily need to be performed in the order illustrated or discussed. The various exemplary methods described and/or illustrated herein may also omit one or more of the steps described or illustrated herein or include additional steps in addition to those disclosed.

[0112] The preceding description has been provided to enable others skilled in the art to best utilize various aspects of the exemplary embodiments disclosed herein. This exemplary description is not intended to be exhaustive or to be limited to any precise form disclosed. Many modifications and variations are possible without departing from the scope of the instant disclosure. The embodiments disclosed herein should be considered in all respects illustrative and not restrictive. Reference should be made to the appended claims in determining the scope of the instant disclosure.

[0113] Unless otherwise noted, the terms "connected to" and "coupled to" (and their derivatives), as used in the specification and claims, are to be construed as permitting both direct and indirect (i.e., via other elements or components) connection. In addition, the terms "a" or "an," as used in the specification and claims, are to be construed as meaning "at least one of." Finally, for ease of use, the terms "including" and "having" (and their derivatives), as used in the specification and claims, are interchangeable with and have the same meaning as the word "comprising." 1