Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
HOST ENDPOINT ADAPTIVE COMPUTE COMPOSABILITY
Document Type and Number:
WIPO Patent Application WO/2024/043951
Kind Code:
A1
Abstract:
Embodiments herein describe a processor system that inciudes an integrated, adaptive accelerator. In one embodiment, the processor system includes multiple core complex chiplets that each contain one or processing cores for a host CPU. In addition the processor system inciudes an accelerator chiplet. The processor system can assign one or more of the core complex chiplets to the accelerator chiplet to form an IO device while the remaining core complex chiplets form the CPU for the host. In this manner, rather than the accelerator and the CPU having independent computer resources, the accelerator can be integrated into the processor system of the host so that hardware resources can be divided between the CPU and the accelerator depending on the needs of the particular application(s) executed by the host.

Inventors:
DASTIDAR JAIDEEP (US)
MITTAL MILLIND (US)
Application Number:
PCT/US2023/019312
Publication Date:
February 29, 2024
Filing Date:
April 20, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
XILINX INC (US)
International Classes:
G06F9/50; G06F15/78
Foreign References:
US20200136906A12020-04-30
US11100028B12021-08-24
Attorney, Agent or Firm:
TABOADA, Keith (US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1 . A processor system in a host, comprising. a substrate; a plurality of core complex chiplets each comprising at least one processor core; an accelerator chiplet; and a composable agent configured to assign at least one of the plurality of core complex chiplets to the accelerator chiplet to form an IO device while the remaining plurality of core complex chiplets form a central processing unit (CPU) for the host.

2. The processor system of claim 1 , wherein the plurality of core complex chiplets are duplicate integrated circuits, wherein the accelerator chiplet is implemented using an integrated circuit that is different from the duplicate integrated circuits.

3. The processor system of claim 2, wherein the accelerator chiplet does not include any processor cores.

4. The processor system of claim 1 , further comprising: an interconnect disposed on the substrate and implemented using an integrated circuit separate from the plurality of core complex chiplets and the accelerator chiplet, wherein the interconnect is configured to permit each of the plurality of core complex chiplets to communicate with each other and for the plurality of core complex chiplets to communicate with the accelerator chiplet.

5. The processor system of claim 1 , further comprising: an interconnect is configured to permit each of the plurality of core complex chiplets to communicate with each other and for the plurality of core complex chiplets to communicate with the accelerator chiplet, wherein the interconnect and the accelerator chiplet are part of a same integrated circuit.

6. The processor system of claim 1 , wherein, after assigning the at least one of the plurality of core complex chiplets to the accelerator chiplet, the composable agent is configured to reassign the at least one of the plurality of core complex chiplets to the CPU such that the at least one of the plurality of core complex chiplets is no longer part of the IO device.

7. The processor system of claim 1 , wherein the composable agent is configured to assign an additional one of the plurality of core complex chiplets to the IO device such that IO device includes multiple ones of the plurality of core complex chiplets.

8. A processor system, comprising. a plurality of core complex chiplets each comprising at least one processor core; an accelerator chiplet; and an interconnect connecting the plurality of core complex chiplets to each other and to the accelerator chiplet, the interconnect comprising: a composable agent configured to assign at least one of the plurality of core complex chiplets to the accelerator chiplet to form an IO device while the remaining plurality of core complex chiplets form a central processing unit (CPU) for a host.

9. The processor system of claim 8, wherein the plurality of core complex chiplets are duplicate integrated circuits, wherein the accelerator chiplet is implemented using an integrated circuit that is different from the duplicate integrated circuits, and wherein the interconnect is implemented using an integrated circuit separate from plurality of core complex chiplets.

10. The processor system of claim 8, wherein the accelerator chiplet does not include any processor cores.

11 . The processor system of claim 8, wherein the interconnect includes first circuitry that supports interrupt semantics used by the plurality of core complex chiplets forming the CPU and interrupt semantics used by the at least one of the plurality of core complex chiplets assigned to the IO device, and second circuitry that supports memory accesses used by the plurality of core complex chiplets forming the CPU and memory accesses used by the plurality of core complex chiplets assigned to the IO device.

12. A method, comprising: selecting at least one of a plurality of core complex chiplets to assign to an IO device while the remaining ones of the plurality of core complex chiplets are assigned to a CPU of a host; removing the selected core complex chiplet as a peer of the remaining ones of the plurality of core complex chiplets; and add the selected core complex chiplet to the IO device, wherein the plurality of core complex chiplets are disposed on a same substrate as an accelerator chiplet also assigned to the IO device.

13. The method of claim 12, wherein the plurality of core complex chiplets are duplicate integrated circuits, wherein the accelerator chiplet is implemented using an integrated circuit that is different from the duplicate integrated circuits; wherein each of the plurality of core complex chiplets contains at least one processor core, wherein the accelerator chiplet does not include any processor cores; wherein the accelerator chip, in combination with the at least one of the plurality of core complex chiplets, forms a DPU; and wherein an interconnect disposed on the substrate and implemented using an integrated circuit separate from the plurality of core complex chiplets and the accelerator chiplet.

14. The method of claim 12, further comprising: determining workloads of the CPU and the IO device, wherein the workloads determine a number of the plurality of core complex chiplets to assign to the IO device and the CPU.

15. The method of claim 14, further comprising: determining the workloads of at least one of the CPU and the IO device has changed; and adjusting the number of the plurality of core complex chiplets assigned to the IO device.

Description:
HOST ENDPOINT ADAPTIVE COMPUTE COMPOSABILITY

TECHNICAL FIELD

[0001] Examples of the present disclosure generally relate to a processor system that includes processing ohiplets that can be assigned to a central processing unit (CPU) of a host or to an accelerator chlplet to form an Integrated IO device.

BACKGROUND

[0002] Current acceleration devices (e.g., input/output (IO) devices) such as Data Processing Units (DPUs) include different components such as I/O gateways, processor subsystems, network on chips (NoCs), storage and data accelerators, data processing engines, and programmable logic (PL). Currently, the DPU is connected to a processor complex of a host using a PCIe connection. The host processing capabilities and CPU’s embedded processing capabilities are dimensioned Independently. For some workloads, the DPU may have much more compute processing than is required to perform its tasks, thereby wasting power and space In the computing system. For other workloads, the DPU may not have enough compute processing required to perform its tasks, and becomes a bottleneck.

SUMMARY

[0003] One embodiment describes a computing system that includes a processor system in a host that includes a substrate, a plurality of core complex chiplets each comprising at least one processor core, an accelerator chlplet, and a composable agent configured to assign at least one of the plurality of core complex chiplets to the accelerator chlplet to form an IO device while the remaining plurality of core complex chiplets form a central processing unit (CPU) for the host.

[0004] Another embodiment described herein is an processor system that includes a plurality of core complex chiplets each comprising at least one processor core, an accelerator chiplet, and an interconnect connecting the plurality of core complex chiplets to each other and to the accelerator chiplet. The interconnect includes a composable agent configured to assign at least one of the plurality of core complex chiplets to the accelerator chiplet to form an IO device while the remaining plurality of core complex chiplets form a central processing unit (CPU) for a host. [0005] Another embodiment described herein is a method that includes selecting at least one of a plurality of core complex chiplets to assign to an IO device while the remaining ones of the plurality of core complex chiplets are assigned to a CPU of a host, removing the selected core complex chiplet as a peer of the remaining ones of the plurality of core complex chiplets, and add the selected core complex chiplet to the IO device, wherein the plurality of core complex chiplets are disposed on a same substrate as an accelerator chiplet also assigned to the IO device.

BRIEF DESCRIPTION OF DRAWINGS

[0006] So that the manner in which the above recited features can be understood in detail, a more particular description, briefly summarized above, may be had by reference to example implementations, some of which are Illustrated in the appended drawings, it is to be noted, however, that the appended drawings illustrate only typical example implementations and are therefore not to be considered limiting of its scope.

[0007] Fig. 1 illustrates a processor system that includes an integrated, adaptive accelerator, according to an embodiment.

[0008] Fig. 2 is a flowchart for adding processing cores to the integrated accelerator, according to an embodiment.

[0009] Fig. 3 is a block diagram of a processor system, according to an embodiment.

[0010] Fig. 4 illustrates a processor system that includes an integrated, adaptive accelerator, according to an embodiment.

[0011] Fig. 5 illustrates a processor system that includes an integrated, adaptive accelerator, according to an embodiment.

[0012] Fig. 6 illustrates a processor system that includes multiple integrated, adaptive accelerators, according to an embodiment.

[0013] To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. It is contemplated that elements of one example may be beneficially incorporated in other examples. DETAILED DESCRIPTION

[0014] Various features are described hereinafter with reference to the figures, it should be noted that the figures may or may not be drawn to scale and that the elements of similar structures or functions are represented by like reference numerals throughout the figures. It should be noted that the figures are only intended to facilitate the description of the features. They are not intended as an exhaustive description of the description or as a limitation on the scope of the claims. In addition, an illustrated example need not have all the aspects or advantages shown. An aspect or an advantage described in conjunction with a particular example is not necessarily limited to that example and can be practiced in any other examples even if not so illustrated, or if not so explicitly described.

[0015] Embodiments herein describe a processor system that includes an integrated, adaptive accelerator. In one embodiment, the processor system includes multiple core complex chiplets that each contain one or processing cores for a host CPU. In addition, the processor system includes an accelerator chiplet (e.g., a SmartNIC, database accelerator, artificial intelligence (Al) accelerator, graphic processor unit (GPU), etc.). The processor system can assign one or more of the core complex chiplets to the accelerator chiplet to form an IO device while the remaining core complex chiplets form the CPU for the host. For example, the number of core complex chiplets assigned to the accelerator can be determined in response to whether the applications executed by the host will require more computer resources from the CPU or from the accelerator. For accelerator intensive applications, more core complex chiplets may be assigned to the accelerator and the IO device, while for CPU intensive applications, fewer (or none) of the core complex chiplets are assigned to the accelerator.

[0016] In this manner, rather than the accelerator and the CPU having independent compute resources, the accelerator can be integrated into the processor system of the host so that hardware resources can be divided between the CPU and the accelerator depending on the needs of the particular appiication(s) executed by the host.

[0017] Fig. 1 illustrates a processor system 100 that includes an integrated, adaptive accelerator, according to an embodiment. In this example, the processor system 100 includes a substrate 105 on which multiple core complex chiplets 110, an interconnect 120, and an accelerator chiplet 115 are disposed, in one embodiment, the core complex chipiets 110, the interconnect 120, and the accelerator chiplet 115 are individual integrated circuits (ICs). Each of the core complex chipiets 110 and the accelerator chiplet 115 can be connected to the interconnect 120 using traces in the substrate 105. That is, the interconnect 120 enables the core complex chipiets 110 to communicate with each other, as well as permitting the chipiets 110 to communicate with the accelerator chiplet 115. However, in other embodiments, the core complex chipiets 110 may also have direct connections to each other.

[0018] The interconnect 120 permits the core complex chipiets 110 to be assigned with other of the chipiets 110 to form a CPU 140 for the host, or be assigned to the accelerator chiplet 115 to form an IO device 130. That is, each core complex chiplet 110 can be assigned as a peer with the other chipiets 110 to form the CPU 140 or can be severed from the other chipiets 110 to be assigned to the IO device 130. When a core complex chiplet 110 is assigned to the IO device 130, the coherent connections between it and the other chipiets 110 are severed so it is no longer treated as a peer. Instead, the core complex chiplet 110 may no longer share the same memory space, or run the same operating system, as the other chipiets 110 in the CPU 140 and rely on a different Interrupt semantics. This is discussed in more detail in Fig. 3.

[0019] In one embodiment, the core complex chipiets 110 are duplicate ICs - e.g., have the same number of processing cores and other hardware circuitry. However, the accelerator chiplet 115 may be different than the core complex chipiets 110. For example, the accelerator chiplet 115 may include specialized circuitry for performing an accelerator task such as a network interface (if the accelerator chiplet 115 is a DPU or a database accelerator), specialized data processing engines (e.g., for performing network or Al tasks), programmable logic for additional user configurability, a host interface for managing memory requests and interrupts with the interconnect 120, and the like. In one embodiment, the accelerator chiplet 115 does not include an embedded processor (e.g., a processor core). That is, in systems where the accelerator chiplet 115 and IO device 130 are not integrated into the processor system 100 of the host, these accelerators can include their own processor cores (e.g., general purpose processing units) as well as specialized accelerator engines. However, because the processing system 100 can assign one or more of the core complex chipiets 110 to the accelerator chiplet 115, the accelerator chiplet 115 no longer needs its own processor system, thereby saving space and power In the processor system 100. However, in other embodiments, it may be advantageous to still include a processor core in the accelerator chiplet 115. [0020] The interconnect 120 includes a composable agent 125 that assigns the core complex chiplets 110 to the accelerator chiplet 115. In this embodiment, the agent 125 has assigned the core complex chiplets 110A and 110C to the accelerator chiplet 115 to form the IO device 130. In contrast, the composable agent 125 has assigned the core complex chiplets 110B and 110D-G to form the CPU 140. In one embodiment, the composable agent 125 makes this assignment when the computing system is booting.

[0021] In one embodiment, this assignment can be changed. For example, if the workload changes and the IO device 130 no longer needs as much compute resources to perform its acceleration tasks, the composable agent 125 can remove the core compute chiplet 110C from the IO device 130 and reassign it to the CPU 140. In this manner, each of the core complex chiplets 110 can either be assigned to the integrated IO device 130 or the CPU 140.

[0022] Further, having a separate interconnect 120 IC as shown is not necessary. In other embodiments, the functions performed by the interconnect 120 may be performed using hardware on the core complex chiplets 110 and the accelerator chiplet 115. The substrate 105 could include a network of interconnects for selectively connecting the core complex chiplets 110 and the accelerator chiplet 115. For example, portions of the Interconnect 120 can also be distributed across each core complex chiplet 110, with the substrate connections 105 achieving the multicore-complex connectivity to create a CPU 140.

[0023] In another embodiment, the accelerator chiplet 115, or the circuitry in the chiplet, is part of the interconnect 120. That is, the accelerator chiplet 115 can be integrated into the same IC as the interconnect 120 rather than having separate ICs as shown in Fig. 1 . Or stated oppositely, the interconnect 120 (which can include the composable agent 125 and the other circuitry shown in Fig. 1 ) can be part of the same IC as the accelerator chiplet 115.

[0024] Fig. 2 is a flowchart of a method 200 for adding processing cores to the integrated accelerator, according to an embodiment. At block 205, the workloads of the CPU and the accelerator are determined. This determination may be performed by a user application executed on the host, a system administrator, or the processor system. For example, the composable agent may rely on historical data to determine the workloads of the CPU and the accelerator. Or the composable agent may determine what type of user applications are, or will be, executed on the host which then can be used to estimate the workloads of the CPU and the accelerator when executing those applications. Alternatively, a silicon manufacturer may create a different set of products, each with a different mix of processor cores as part of the I/O device 130, and a static configuration of the composable agent 125, with those different set of products each targeting different workloads with different compute resources for the CPU 140.

[0025] For example, a control plane heavy workload such as a disaggregated storage workload can use significantly higher processing when compared to a data plane heavy workload, where the accelerator performs the bulk of the computation, and the processor cores in the CPU are lightly loaded. By determining (or estimating) the workloads on the CPU and the accelerator, the composable agent can determine how much compute resources each will need.

[0026] At block 210, the composable agent selects at least one of the core complex chiplets to assign to the IO device. That is, the composable agent selects at least one core complex chiplet to add to the accelerator chiplet to form the IO device. The number of core complex chiplets to assign to the IO device can depend on the workload for the IO device. For heavier workloads, additional core complex chiplets can be assigned to the IO device. For lighter loads, fewer (or no) core complex chiplets are assigned to the IO device.

[0027] At block 215, the composable agent removes the selected core complex chiplet as a peer of the remaining core complex chiplets. For example, the agent may deactivate connections between the selected core complex chiplets and the chiplets used to form the CPU. These connections may be cache-coherent connections and inter-processor interrupt connections between the selected core complex chiplets. Further, the hardware in the selected core complex chiplet may be reconfigured so it behaves as a processor core in an IO device rather than a peer processor in the CPU. For example, the selected core complex chiplet may issue read and write memory requests to an iO memory management unit (MMU) in the interconnect. Further, the selected core complex chiplet may use message signaled interrupts (MSI) to signal an interrupt to the host processing system (e.g., the Interconnect) even though the selected core complex chiplet is physically part of the host processing system. The MSI may be generated by the core complex, or the interconnect may translate the core complex interrupt into a MSI to the host processing system. However, the MSI is just one example of a suitable IO interrupt protocol that can be used.

[0028] At block 220, the composable agent adds the selected core complex chiplet to the IO device. For example, the interconnect may establish communication paths between the selected core complex chiplet and the accelerator chiplet such that the core complex chiplet functions like a processor subsystem that is integrated into the accelerator. With the established communication paths, the core complex chiplet may run an operating system independent from the operating system running on the CPU. The operating system’s resources may now' include the accelerator chiplet using the established communication paths between the selected core complex chiplet and the accelerator chiplet.

[0029] At block 225, the method 400 determines whether the workload has changed. This could be the workload of the CPU or the workload on the IO device. For example, the workload on the CPU may have increased such that the composable agent determines to move a core complex chiplet previously assigned to the IO device to the CPU. Or the workload on the IO device may have increased such that the composable agent moves a core complex chiplet previously assigned to the CPU to the IO device.

[0030] At block 230, the composable agent adjusts the number of core complex chiplets assigned to the IO device. This can include adding more core complex chiplets to the IO device, or removing a core complex chiplet from the IO device. This adjustment of the core complex chiplets can occur when rebooting the computing system, or it may be possible to adjust the number of core complex chiplets assigned to the IO device without having to reboot the computing system.

[0031] Fig. 3 is a block diagram of a processor system 300, according to an embodiment. Like the processor system 100, the processor system 300 includes core complex chiplets 110, the interconnect 120, and the accelerator chiplet 115. In this example, the core complex chiplets 110 have both coherent interconnects and IO interconnects to the interconnect 120. In one embodiment, the coherent interconnects are used when the core complex chiplets 110 are part of the CPU but the IO interconnects are used when the core complex chiplets 110 are part of the IO device. [0032] The core complex chiplets 110 includes processing cores 305, coherent hardware 310, and IO hardware 315. The coherent hardware 310 includes circuitry or firmware that is used when the core complex chiplet 110 is a peer in the CPU while the IO hardware 315 includes circuitry or firmware used when the core complex chiplet 110 is part of the IO device. For example, when part of the CPU, the coherent hardware 310 may perform memory read and write requests as a cache coherent peer to the other core complex chiplets 110 forming the CPU. However, when part of the IO domain, the IO hardware 315 submits requests to read and write data to an IO MMU 320 in the interconnect 120, similar to the accelerator chiplet 115 and other IO devices. Moreover, rather than maintaining page tables with the other core complex chiplets 110 forming the CPU, the core complex chiplets 110 may receive page table translations from the IO MMU 320 when part of the IO device, and accessing CPU Memory. The composable agent 125 may assign at least one of memory ports 330 as IO domain memory. When part of the IO domain, the core complex chiplets 110 may run an operating system independent from the operating system running on the CPU, and maintain its own page tables mapped to IO domain memory. Further, the IO hardware 315 may support IO device interrupt semantics (e.g., MSI) that are subordinate to the interrupts used by the coherent hardware 310. The interrupt semantics may send interrupts to an interrupt manager 325 in the interconnect 120 that handles the interrupts received from chiplets in the IO device. [0033] The interconnect 120 includes the composable agent 125, the IO MMU 320, and the interrupt manager 325, and memory ports 330. As discussed above, the composable agent 125 selects and assigns the core complex chiplets 110 to the CPU or the IO device. Moreover, the composable agent 125 can inform the IO MMU 320, interrupt manager 325, and the memory ports 330 which one of the core complex chiplets 110 are part of the CPU and which are part of the IO device. These circuits can then use the appropriate techniques to transmit data and handle interrupts.

[0034] In one embodiment, the interconnect 120 can include programmable logic to manage the transition of the core complex chiplets 110 between the coherent and IO interconnects (or pathways), the transition between homogeneous core interrupts and IO device semantic interrupts (e.g., MSI), and the transition between homogeneous core MMU and IO device semantics IO translation lookaside buffering (IOTLB) performed by the IO MMU 320. For example, the programmable login can implement a bridge between the Cere interrupt Semantics used when the cere complex chipiets 110 are part of the CPU and the iO message interrupt semantics used with the core complex chipiets 110 are part of the iO device. The programmable logic can also include a bridge for converting between a core page tabie walk used when the core complex chipiets 110 are part of the CPU and, e.g., iO PCie address translation service (ATS) used when the core complex chipiets 110 are part of the IO device. This bridge can translate core page table accesses into PCIe ATS messages, and then the data structure received in the ATS response is modified to appear as a native-to-core instruction set architecture (ISA) page table.

[0035] The accelerator chiplet 115 includes acceleration circuits 335, a host interface 340, and a network interface 345. The acceleration circuits 335 can include acceleration engines (e.g,, data processing engines, crypto engines, compression engines, Al engines, etc.), programmable logic, or combinations thereof. The acceleration circuits 335 can be any circuit that performs acceleration tasks assigned by the host (e.g., the CPU). For example, the core complex chipiets 110 forming the CPU (or the software executed on the chipiets 110) can offload acceleration tasks to the accelerator chiplet 115 using the interconnect 120.

[0036] The host interface 340 enables the accelerator chiplet 115 to communicate with the interconnect 120 via the IO interconnects. The host interface 340 can use a PCie or Universal Interconnect Express (UCIe) connection to communicate with the interconnect 120. The host interface 340 may use a cache coherent protocol to communicate with the interconnect 120 and the core complex chipiets 110 assigned to the CPU, such as Compute Express Link (CXL™) or Cache Coherent Interconnect for Accelerators (CCIX®). While these protocols enable the accelerator 120 and any core complex chipiets 110 that are assigned to the IO device to be cache coherent with the core complex chipiets 110 forming the CPU, they are nonetheless subordinate to those core complex chipiets 110.

[0037] The network interface 345 enables the accelerator chiplet 115 to communicate with a network (e.g., a local area network (LAN) or a wide area network (WAN)). In this example, the accelerator chiplet 115 may be a DPU or a Database Accelerator. However, in embodiments where the accelerator is an Al or crypto accelerator that does not communicate with a network, the network interface 345 may be omitted. [0038] Fig. 4 illustrates a processor system that includes an integrated, adaptive accelerator, according to an embodiment. For simplicity, the processor system in Fig. 4 only illustrates the core complex chiplets 110 and the accelerator chiplets 115, but does not illustrate the interconnect, which may also be present.

[0039] In this example, the processor system includes two accelerator chiplets 115A and 115B. In one embodiment, the accelerator chiplets 115 are the same type of accelerators (e.g., two DPU chiplets or two database accelerators). In one embodiment, the accelerator chiplets 115 may be the same integrated circuits. For example, the processor system may be designed to be used to execute an application that requires a particular accelerator task performed by the accelerator chiplets 115. Thus, the processor system may include two of the accelerator chiplets 115 rather than just one as shown in Fig. 1.

[0040] Like above, one or more of the core complex chiplets 110 can be assigned to the accelerator chiplets 115 to form the IO device 405. In this example, the core complex chiplet 110A is assigned to the accelerator chiplets 115A and 1158 while the core complex chiplets 110B-D form the CPU 410 for the host. However, in other embodiments, more than one of the core complex chiplets 110 can be assigned to the IO device 405, or none of the core complex chiplets 110 are assigned to the IO device 405, in which case the accelerator tasks are performed solely by the accelerator chiplets 115A and 115B.

[0041] Fig. 5 illustrates a processor system that includes an integrated, adaptive accelerator, according to an embodiment. For simplicity, the processor system in Fig. 5 only illustrates the core complex chiplets 110 and the accelerator chiplets 115, but does not illustrate the interconnect, which may also be present.

[0042] Like in Fig. 4, here, the processor system includes two accelerator chiplets 115A and 115B. In one embodiment, the accelerator chiplets 115 are the same type of accelerators (e.g., two DPU chiplets or two database accelerators). In one embodiment, the accelerator chiplets 115 may be the same integrated circuits. However, unlike in Fig. 4, Fig. 5 illustrates that one of the accelerator chiplets can be assigned to be part of the CPU 510. Thus, Fig. 5 illustrates that in addition to assigning the core complex chiplets 110 to the IO device 505, the accelerator chiplets 115 can be assigned to the CPU 510. For example, the current workload of the IO device 505 may be much less than the workload on the CPU 510. Thus, the composable agent can reassign the accelerator chipiet 115B to the CPU 510 white leaving the accelerator chiptet 1 15A to perform the functions of the IO device 505.

[0043] The accelerator chipiet 115B can serve as a peer processor to the other core complex chiplets 110 in the processor system. Toda so, the accelerator chiplets 115 may also include coherent interconnects to the core complex chiplets 110 as well as hardware that supports the interrupt and memory management protocols used by the core complex chiplets 110 when part of the CPU 510.

[0044] Although not shown in Fig. 5, if the workloads for the IO device 505 and the CPU 510 change, the composable agent can then reassign the accelerator chiptet 115B to the IO device 505. Thus, Fig. 5 illustrates that the accelerator chiplets 115 can be designed to support both coherent and IO modes of operation, like the core complex chiplets 110.

[0045] Fig. 6 illustrates a processor system that includes multiple integrated, adaptive accelerators, according to an embodiment. For simplicity, the processor system in Fig. 6 only illustrates the core complex chiplets 110 and the accelerator chiplets 115, but does not illustrate the interconnect, which may also be present.

[0046] The processor system in Fig. 6 includes accelerator chipiet 605 and accelerator chipiet 610 which are different accelerators (i.e. , the chiplets 605 and 610 are different). For example, the accelerator chipiet 605 may be a different type of accelerator than the accelerator chipiet 610. For instance, the accelerator chipiet 605 may be a DPU while the accelerator chipiet 610 is an Al accelerator. Thus, Fig. 6 illustrates that a processor system can include multiple different types of accelerator chiplets.

[0047] Further, the accelerator chiplets can be used to form different IO devices. In this example, the accelerator chipiet 605 and the core complex chipiet 110A form IO device 615 while the accelerator chiptet 610 and the core complex chipiet 110B form IO device 620. The remaining core complex chiplets 110 form the CPU 625 for the host. The composable agent can assign the core complex chiplets 110 in response to the individual workloads of the IO devices. For example, if the IO device 615 has (or is expected to have) a larger workload than the IO device 620, then the composable agent may assign two of the core complex chiplets 110 to the IO device 620. Thus, Fig. 6 illustrates that a processor system can include any number of different types of accelerators to support any number of different types of IO devices. Further, these IO devices can also include any number of the cere complex chiplets 110 depending on their workloads.

[0048] In the preceding, reference is made to embodiments presented in this disclosure. However, the scope of the present disclosure is not limited to specific described embodiments. Instead, any combination of the described features and elements, whether related to different embodiments or not, is contemplated to implement and practice contemplated embodiments. Furthermore, although embodiments disclosed herein may achieve advantages over other possible solutions or over the prior art, whether or not a particular advantage is achieved by a given embodiment is not limiting of the scope of the present disclosure. Thus, the preceding aspects, features, embodiments and advantages are merely illustrative and are not considered elements or limitations of the appended claims except where explicitly recited in a claim(s).

[0049] As will be appreciated by one skilled in the art, the embodiments disclosed herein may be embodied as a system, method or computer program product. Accordingly, aspects may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.

[0050] Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium is any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus or device.

[0051] A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

[0052] Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

[0053] Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).

[0054] Aspects of the present disclosure are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments presented in this disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

[0055] These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

[0056] The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

[0057] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various examples of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative Implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

[0058] The technology disclosed herein can be expressed in the following nonlimiting examples. [0059] Example 1. A processor system in a host, comprising: a substrate; a plurality of core complex chiplets each comprising at least one processor core; an accelerator chiplet; and composabie agent configured to assign at least one of the plurality of core complex chiplets to the accelerator chiplet to form an IO device while the remaining plurality of core complex chiplets form a central processing unit (CPU) for the host.

[0060] Example 2. The processor system of example 1 , wherein the plurality of core complex chiplets are duplicate integrated circuits, wherein the accelerator chiplet is implemented using an integrated circuit that is different from the duplicate integrated circuits.

[0061] Example 3. The processor system of example 2, wherein the accelerator chiplet does not include any processor cores.

[0062] Example 4. The processor system of example 3, wherein the accelerator chiplet, in combination with the at least one of the plurality of core complex chiplets. forms a data processing unit (DPU).

[0063] Example 5. The processor system of example 1 , further comprising: an interconnect disposed on the substrate and implemented using an integrated circuit separate from the plurality of core complex chiplets and the accelerator chiplet, wherein the interconnect is configured to permit each of the plurality of core complex chiplets to communicate with each other and for the plurality of core complex chiplets to communicate with the accelerator chiplet.

[0064] Example 6. The processor system of example 5, wherein the interconnect includes first circuitry that supports interrupt semantics used by the plurality of core complex chiplets forming the CPU and interrupt semantics used by the at least one of the plurality of core complex chiplets assigned to the IQ device, and second circuitry that supports memory accesses used by the plurality of core complex chiplets forming the CPU and memory accesses used by the at least one of the plurality of core complex chiplets assigned to the IO device.

[0065] Example 7. The processor system of example 1 , further comprising: an interconnect is configured to permit each of the plurality of core complex chiplets to communicate with each other and for the plurality of core complex chiplets to communicate with the accelerator chiplet, wherein the interconnect and the accelerator chiplet are part of a same integrated circuit. [0066] Example 8. The processor system of example 1 , wherein, after assigning the at least one of the plurality of core complex chiplets to the accelerator chiplei, the composable agent is configured to reassign the at least one of the plurality of core complex chiplets to the CPU such that the at least one of the plurality of core complex chiplets is no longer part of the IO device.

[0067] Example 9. The processor system of example 1, wherein the composable agent is configured to assign an additional one of the plurality of core complex chiplets to the IO device such that IO device includes multiple ones of the plurality of core complex chiplets.

[0068] Example 10. A processor system, comprising: a plurality of core complex chiplets each comprising at least one processor core; an accelerator chiplet; and an interconnect connecting the plurality of core complex chiplets to each other and to the accelerator chiplet, the interconnect comprising: a composable agent configured to assign at least one of the plurality of core complex chiplets to the accelerator chiplet to form an IO device while the remaining plurality of core complex chiplets form a central processing unit (CPU) for a host.

[0069] Example 11. The processor system of example 10, wherein the plurality of core complex chiplets are duplicate integrated circuits, wherein the accelerator chiplet is implemented using an integrated circuit that is different from the duplicate integrated circuits, and wherein the interconnect is implemented using an integrated circuit separate from plurality of core complex chiplets.

[0070] Example 12. The processor system of example 10, wherein the accelerator chiplet does not include any processor cores.

[0071] Example 13. The processor system of example 10, wherein the interconnect includes first circuitry that supports interrupt semantics used by the plurality of core complex chiplets forming the CPU and interrupt semantics used by the at least one of the plurality of core complex chiplets assigned to the IO device, and second circuitry that supports memory accesses used by the plurality of core complex chiplets forming the CPU and memory accesses used by the plurality of core complex chiplets assigned to the IO device.

[0072] Example 14. A method, comprising: selecting at least one of a plurality of core complex chiplets to assign to an IO device while the remaining ones of the plurality of core complex chiplets are assigned to a CPU of a host: removing the selected core complex chiplet as a peer of the remaining ones of the plurality of core comptex chiplets; and add the selected core complex chiplet to the IO device, wherein the plurality of core complex chiplets are disposed on a same substrate as an accelerator chiplet also assigned to the IO device.

[0073] Example 15. The method of example 14, wherein the plurality of core complex chiplets are duplicate integrated circuits, wherein the accelerator chiplet is implemented using an integrated circuit that is different from the duplicate integrated circuits.

[0074] Example 16. The method of example 14, wherein each of the plurality of core complex chiplets contains at least one processor core, wherein the accelerator chiplet does not include any processor cores.

[0075] Example 17. The method of example 14, wherein the accelerator chip, in combination with the at least one of the plurality of core complex chiplets, forms a DPU.

[0076] Example 18. The method of example 14, wherein an interconnect disposed on the substrate and implemented using an integrated circuit separate from the plurality of core complex chiplets and the accelerator chiplet.

[0077] Example 19. The method of example 14, further comprising: determining workloads of the CPU and the IO device, wherein the workloads determine a number of the plurality of core complex chiplets to assign to the IO device and the CPU.

[0078] Example 20. The method of example 19, further comprising: determining the workloads of at least one of the CPU and the IO device has changed; and adjusting the number of the plurality of core complex chiplets assigned to the IO device.

[0079] While the foregoing is directed to specific examples, other and further examples may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.