Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
LOCK CIRCUIT FOR COMPETING KERNELS IN A HARDWARE ACCELERATOR
Document Type and Number:
WIPO Patent Application WO/2020/222915
Kind Code:
A1
Abstract:
An example hardware accelerator in a computing system includes a bus interface coupled to a peripheral bus of the computing system; a lock circuit coupled to the bus interface; and a plurality of kernel circuits coupled to the lock circuit and the bus interface; wherein the plurality of kernel circuits provide lock requests to the lock circuit, the lock requests for data stored in system memory of the computing system; wherein the lock circuit is configured to process the lock requests from the plurality of kernel circuits and to issue atomic transactions over the peripheral bus through the bus interface based on the lock requests.

Inventors:
JAIN SUNITA (US)
RAO SWEATHA (US)
Application Number:
PCT/US2020/022035
Publication Date:
November 05, 2020
Filing Date:
March 11, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
XILINX INC (US)
International Classes:
G06F9/52
Foreign References:
EP2515294A22012-10-24
Other References:
YUNLONG XU ET AL: "Lock-based synchronization for GPU architectures", CF' 16 PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS, ACM, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, 16 May 2016 (2016-05-16), pages 205 - 213, XP058259517, ISBN: 978-1-4503-4128-8, DOI: 10.1145/2903150.2903155
Attorney, Agent or Firm:
TABOADA, Keith (US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A hardware accelerator in a computing system, comprising:

a bus interface coupled to a peripheral bus of the computing system;

a lock circuit coupled to the bus interface; and

a plurality of kernel circuits coupled to the lock circuit and the bus interface; wherein the plurality of kernel circuits provide lock requests to the lock circuit, the lock requests for data stored in system memory of the computing system;

wherein the lock circuit is configured to process the lock requests from the plurality of kernel circuits and to issue atomic transactions over the peripheral bus through the bus interface based on the lock requests.

2. The hardware accelerator of claim 1 , wherein the lock circuit is configured to maintain a kernel lock array, and wherein the kernel lock array includes a plurality of elements each having a pending request field and a lock status field.

3. The hardware accelerator of claim 2, wherein the pending request field includes a plurality of entries corresponding to the plurality of kernel circuits, and wherein the lock status field includes a plurality of entries corresponding to the plurality of kernel circuits.

4. The hardware accelerator of claim 3, wherein the lock circuit is configured to, for each of the lock requests, check the lock status field of a corresponding element in the kernel lock array, set a lock in the lock status field of the corresponding element if available, and indicating a pending lock request in the pending request field if a lock is not available.

5. The hardware accelerator of claim 1 , wherein the lock circuit is configured to maintain a kernel lock array, and wherein the kernel lock array includes a plurality of elements each having a counter field and a lock status field.

6. The hardware accelerator of claim 5, wherein the counter field includes a value, and wherein the lock status field includes a plurality of entries corresponding to the plurality of kernel circuits.

7. The hardware accelerator of claim 6, wherein the lock circuit is configured to, for each of the lock requests, check the lock status field of a corresponding element in the kernel lock array, set a lock in the lock status field of the corresponding element, and increment the value in the counter field of the corresponding element.

8. A computing system, comprising:

a system memory;

a processor coupled to the system memory;

a peripheral bus coupled to the system memory; and

the hardware accelerator of claim 1 coupled to the peripheral bus.

9. The computing system of claim 8, wherein the lock circuit is configured to maintain a kernel lock array, and wherein the kernel lock array includes a plurality of elements each having a pending request field and a lock status field.

10. A method of managing locks to data stored in memory among a plurality of kernels executing in a hardware accelerator of a computing system, the method comprising:

receiving, at a lock circuit in the hardware accelerator, a lock request from a kernel of the plurality of kernels;

determining whether a lock is held by another kernel of the plurality of kernels; indicating a pending request for the kernel in response to the lock being held by another kernel; and

issuing, from the lock circuit, an atomic request for the lock over a bus interface of the computing system to obtain the lock in response to the lock not being held by another kernel.

11. The method of claim 10, further comprising:

indicating that the kernel has the lock.

12. The method of claim 10, further comprising:

receiving a lock release request from the kernel at the lock circuit;

determining whether another kernel of the plurality of kernels has a pending lock request;

issuing, from the lock circuit, another atomic request to release the lock over the bus interface of the computing system in response to absence of a pending lock request; and

granting, by the lock circuit, the lock to another kernel of the plurality of kernels in response to presence of a pending lock request.

13. The method of claim 10, wherein the lock circuit is configured to maintain a kernel lock array, and wherein the kernel lock array includes a plurality of elements each having a pending request field and a lock status field.

14. The method of claim 13, wherein the pending request field includes a plurality of entries corresponding to the plurality of kernel circuits, and wherein the lock status field includes a plurality of entries corresponding to the plurality of kernel circuits.

15. The method of claim 10, wherein the lock circuit is configured to maintain a kernel lock array, and wherein the kernel lock array includes a plurality of elements each having a counter field and a lock status field.

Description:
LOCK CIRCUIT FOR COMPETING KERNELS IN A HARDWARE ACCELERATOR

TECHNICAL FIELD

[0001] Examples of the present disclosure generally relate to hardware

acceleration in computing systems and, in particular, to a lock circuit for competing kernels in a hardware accelerator.

BACKGROUND

[0002] Hardware acceleration involves the use of hardware to perform some functions more efficiently than software executing on a general-purpose CPU. A hardware accelerator is special-purpose hardware designed to implement hardware acceleration for some application. Example applications include neural networks, video encoding, decoding, transcoding, etc., network data processing, and the like. Software executing on the computing system interacts with the hardware accelerator through various drivers and libraries. One type of hardware accelerator includes a programmable device and associated circuitry. For example, the programmable device can be a field programmable gate array (FPGA) or a system-on-chip (SOC) that includes FPGA programmable logic among other subsystems, such as a processing system, data processing engine (DPE) array, network-on-chip (NOG), and the like.

[0003] In multiprocessing systems, thread synchronization can be achieved by mutex lock to avoid race conditions. Use of mutexes is common in software environments, where mutual exclusion of shared data is achieved via atomic operations. Protocols such as Peripheral Component Interface Express (PCIe) and Cache Coherent Interconnect for Accelerators (CCIX) also provide support for atomic operations, which enables hardware acceleration kernels to obtain locks and compete with software threads. For systems that have multiple acceleration kernels operating in parallel, lock requests to the host computer system by the acceleration kernels can lead to unnecessary peripheral bus utilization and increased contention handling by the host computer. There is a need for a more efficient technique for handling access to shared data by multiple acceleration kernels in a hardware acceleration system. SUMMARY

[0004] Techniques for providing a lock circuit for competing kernels in a hardware accelerator are described. In an example, a hardware accelerator in a computing system includes: a bus interface coupled to a peripheral bus of the computing system; a lock circuit coupled to the bus interface; and a plurality of kernel circuits coupled to the lock circuit and the bus interface; wherein the plurality of kernel circuits provide lock requests to the lock circuit, the lock requests for data stored in system memory of the computing system; wherein the lock circuit is configured to process the lock requests from the plurality of kernel circuits and to issue atomic transactions over the peripheral bus through the bus interface based on the lock requests.

[0005] In another example, a computing system includes a system memory; a processor coupled to the system memory; a peripheral bus coupled to the system memory; and a hardware accelerator coupled to the peripheral bus. The hardware accelerator includes a bus interface coupled to the peripheral bus; a lock circuit coupled to the bus interface; and a plurality of kernel circuits coupled to the lock circuit and the bus interface; wherein the plurality of kernel circuits provide lock requests to the lock circuit, the lock requests for data stored in the system memory; wherein the lock circuit is configured to process the lock requests from the plurality of kernel circuits and to issue atomic transactions over the peripheral bus through the bus interface based on the lock requests.

[0006] In another example, a method of managing locks to data stored in memory among a plurality of kernels executing in a hardware accelerator of a computing system includes: receiving, at a lock circuit in the hardware accelerator, a lock request from a kernel of the plurality of kernels; determining whether a lock is held by another kernel of the plurality of kernels; indicating a pending request for the kernel in response to the lock being held by another kernel; and issuing, from the lock circuit, an atomic request for the lock over a bus interface of the computing system to obtain the lock in response to the lock not being held by another kernel.

[0007] These and other aspects may be understood with reference to the following detailed description. BRIEF DESCRIPTION OF THE DRAWINGS

[0008] So that the manner in which the above recited features can be understood in detail, a more particular description, briefly summarized above, may be had by reference to example implementations, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical example implementations and are therefore not to be considered limiting of its scope.

[0009] Fig. 1A is a block diagram depicting a hardware acceleration system according to an example.

[0010] Fig. 1 B is a block diagram depicting an accelerated application according to an example.

[0011] Fig. 1 C is a block diagram depicting an acceleration circuit according to an example.

[0012] Fig. 2 is a block diagram depicting a logical view of the computing system of Fig. 1 A.

[0013] Fig. 3 is a block diagram depicting an example kernel lock array.

[0014] Fig. 4 is a block diagram depicting another example kernel lock array.

[0015] Fig. 5 is a flow diagram depicting a method of managing lock requests according to an example.

[0016] Fig. 6 is a flow diagram depicting a method of managing lock releases according to an example.

[0017] Fig. 7A is a block diagram depicting a multi-integrated circuit (IC) programmable device according to an example.

[0018] Fig. 7B is a block diagram depicting a programmable IC according to an example.

[0019] Fig. 7C is a block diagram depicting a System-on-Chip (SOC)

implementation of a programmable IC according to an example.

[0020] Fig. 7D illustrates a field programmable gate array (FPGA) implementation of a programmable IC according to an example.

[0021] To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. It is contemplated that elements of one example may be beneficially incorporated in other examples. DETAILED DESCRIPTION

[0022] Various features are described hereinafter with reference to the figures. It should be noted that the figures may or may not be drawn to scale and that the elements of similar structures or functions are represented by like reference numerals throughout the figures. It should be noted that the figures are only intended to facilitate the description of the features. They are not intended as an exhaustive description of the claimed invention or as a limitation on the scope of the claimed invention. In addition, an illustrated example need not have all the aspects or advantages shown. An aspect or an advantage described in conjunction with a particular example is not necessarily limited to that example and can be practiced in any other examples even if not so illustrated or if not so explicitly described.

[0023] Techniques for providing a lock circuit for competing kernels in a hardware accelerator are described. The techniques provide an efficient way of handling atomic operations initiated by multiple hardware acceleration kernels. The lock circuit provides a central contention handling circuit in the integrated circuit (1C) having the hardware acceleration kernels. The lock circuit is responsible for initiating atomic requests over a bus interface to a host computer system having the hardware accelerator having the shared data structure. This prevents the need for the kernel circuits to issue atomic requests directly through the bus interface. As such, the techniques reduce the frequency of atomic transactions over the bus interface thereby reducing contention at the host. The techniques also benefit performance by allowing different acceleration kernels to execute in parallel when multiple kernels can be granted a lock (based on use-case). These and other aspects are described below with respect to the drawings.

[0024] Fig. 1A is a block diagram depicting a hardware acceleration system 100 according to an example. The hardware acceleration system 100 includes a host computing system 102. The host computing system 102 includes a hardware platform (“hardware 104”) and a software platform (“software 106”) executing on the hardware 104. The hardware 104 includes a processing system 110, system memory 116, storage devices (“storage 118”), and a hardware accelerator 122. The software 106 includes an operating system (OS) 144, an acceleration stack 146, a host application 150, and competing threads 139.

[0025] The processing system 110 includes a microprocessor 112, support circuits 114, and a peripheral bus 115. The microprocessor 112 can be any type of general-purpose central processing unit (CPU), such as an x86-based processor, ARM®-based processor, or the like. The microprocessor 112 can include one or more cores and associated circuitry (e.g., cache memories, memory management units (MMUs), interrupt controllers, etc.). The microprocessor 112 is configured to execute program code that perform one or more operations described herein and which can be stored in the system memory 116 and/or the storage 118. The support circuits 114 include various devices that cooperate with the microprocessor 112 to manage data flow between the microprocessor 112, the system memory 116, the storage 118, the hardware accelerator 122, or any other peripheral device. For example, the support circuits 114 can include a chipset (e.g., a north bridge, south bridge, platform host controller, etc.), voltage regulators, firmware (e.g., a basic input-output system (BIOS)), and the like. The support circuits 114 manage data flow between the microprocessor 112 and the peripheral bus 115, to which various peripherals, such as the hardware accelerator 122, are connected. In some examples, the microprocessor 112 can be a System-in-Package (SiP), System-on- Chip (SOC), or the like, which absorbs all or a substantial portion of the functionality of the chipset (e.g., north bridge, south bridge, etc.). The peripheral bus 115 can implement an expansion bus standard, such as Peripheral Component Interconnect Express (PCIe) or the like.

[0026] The system memory 116 is a device allowing information, such as executable instructions and data, to be stored and retrieved. The system memory 116 can include, for example, one or more random access memory (RAM) modules, such as double-data rate (DDR) dynamic RAM (DRAM). The storage 118 includes local storage devices (e.g., one or more hard disks, flash memory modules, solid state disks, and optical disks) and/or a storage interface that enables the computing system 102 to communicate with one or more network data storage systems. The hardware 104 can include various other conventional devices and peripherals of a computing system, such as graphics cards, universal serial bus (USB) interfaces, and the like.

[0027] In an example, the hardware accelerator 122 includes a programmable device 128 and RAM 126. The hardware accelerator 122 can optionally include a non-volatile memory (NVM) 124. The programmable device 128 can be a field programmable gate array (FPGA) or an SOC having FPGA programmable logic along with other embedded subsystems. The NVM 124 can include any type of non- volatile memory, such as flash memory or the like. The RAM 126 can include DDR DRAM or the like. The RAM 126 can be organized into discrete RAM banks 127, as described further below. The programmable device 128 is coupled to the NVM 124 and the RAM 126. The programmable device 128 is also coupled to the peripheral bus 115 of the processing system 110.

[0028] The OS 144 can be any commodity operating system known in the art, such as such as Linux®, Microsoft Windows®, Mac OS®, or the like. The

acceleration stack 146 includes drivers and libraries that provide application programming interfaces (APIs) to the hardware accelerator 122 for command and control thereof.

[0029] Fig. 1 B is a block diagram depicting an accelerated application 180 according to an example. The accelerated application 180 includes the host application 150 and an acceleration circuit 130. The acceleration circuit 130 is programmed in programmable logic (PL) 3 of the programmable device 128 on the hardware accelerator 122. The host application 150 includes software executing on the microprocessor 112 that invokes the acceleration circuit 130 using API calls to the acceleration stack 146 to perform some work. The host application 150 can include neural network, video processing, network processing, or the like type applications that offload some functions to the hardware accelerator 122.

[0030] Fig. 1 C is a block diagram depicting the acceleration circuit 130 according to an example. The acceleration circuit 130 includes a bus interface 141 , kernels 138, and a lock circuit 140. In particular, the host application 150 calls APIs of the acceleration stack 146 to program kernel circuits (“kernel(s) 138”) in the PL 3 of the programmable device 128. The kernel(s) 138 include compute units for processing data. Once the kernel(s) 138 have been programmed, the host application 150 can access the kernel(s) 138 through the bus interface 141. The kernels 138 can process data stored in the system memory 116 and/or the RAM 126. In particular, the kernels 138 access the system memory 116 through the bus interface 141 and the RAM 126 through memory interfaces of the programmable device 128. The kernels 138 can access data in the system memory 116 in competition with the competing threads 139. Since the system memory 116 is shared between the competing threads 139 and the kernels 138, the acceleration circuit 130 includes a lock circuit 140. The kernels 138 are coupled to the lock circuit 140 and use the lock circuit 140 to issue atomic transactions over the bus interface 141 for acquiring locks to data in the system memory 116. The lock circuit 140 provides a single source of atomic transactions over the bus interface 141 , rather than having all the individual kernels 138 issue atomic transactions over the bus interface 141. Operation of the lock circuit 140 is discussed further below.

[0031] In the example, the processing system 110 is shown separate from the hardware accelerator 122. In other examples discussed further below, the

processing system 110 and the hardware accelerator 122 can be implemented within the same programmable device (e.g., a programmable device with an embedded processing system). In such case, the processing system 110 can utilize alternative interconnects with the PL 3 for communicating with the acceleration circuit 130, examples of which are described below. Further, in the examples discussed herein, the acceleration circuit 130 is implemented in a programmable device 128. In other examples, the programmable device 128 can be replaced by any integrated circuit (IC), including an application specific integrated circuit (ASIC) in which the

acceleration circuit 130 comprises hardened circuitry formed therein. Thus, the lock circuit 140 and mutual exclusion scheme discussed herein applies to acceleration circuits in both programmable devices and ASICs.

[0032] Fig. 2 is a block diagram depicting a logical view of the computing system 102 according to an example. As shown in Fig. 2, the kernels 138 access data 202 in the system memory 116 through the bus interface 141 and the peripheral bus 115. The competing threads 139 execute on the microprocessor 112 and also access the data 202 in the system memory 116. The competing threads 139 use a lock array 204 to control access to the data 202. The lock array 204 is indexed by identifiers for certain portions of the data 202. Before accessing a portion of the data 202, a competing thread 139 checks the lock array 204 using an identifier for the data portion to see if a lock has been set by another thread. If not, the competing thread 139 sets the lock and accesses the data exclusive of the other competing threads 139. The competing threads 139 set and check locks in the lock array 204 using atomic instructions of the microprocessor 112.

[0033] The lock circuit 140 maintains a kernel lock array 206. The lock circuit 140 is the central contention handling block for all the kernels 138. Assume N kernels 138 where N is an integer greater than one. The lock circuit 140 maintains the kernel lock array 206 indexed by identifiers for the data 202. In an example, the identifiers are all or a portion of hash values generated from keys to the data 202. Each element in the kernel lock array 206 is (2*N) bits wide where the lower N bits indicate lock status and the upper N bits indicate pending requests. All of the kernels 138 direct their request for locks to the lock circuit 140, rather than directly over the peripheral bus 115 through the bus interface 141. The lock circuit 140 is the only circuit that requests locks through the bus interface 141.

[0034] Fig. 3 is a block diagram depicting an example of the kernel lock array 206. In the example, the kernel lock array 206 includes an array index 302. The array index 302 can be any set of identifiers for the data 202, such as all or a portion of hash values derived from keys to the data (e.g., memory addresses or some other keys associated with the data 202). Each entry in the kernel lock array 206 includes pending requests 304 and lock status 306. The pending requests 304 include N bits, one for each of the kernels 138. The lock status 306 includes N bits, one for each of the kernels 138.

[0035] Fig. 5 is a flow diagram depicting a method 500 of managing lock requests according to an example. Referring to Figs. 2 and 5, the lock circuit 140 operates as follows. On receiving a lock request from a kernel 138 (step 502), which includes the kernel ID and an index value, the lock circuit 140 checks the kernel lock array 206 indexed by the index value to determine if a lock is held by some other kernel 138 (step 504). If the lock is not held (step 506), the lock circuit 140 issues an atomic request through the bus interface 141 to the peripheral bus 115 to check the lock array 204 (step 508). If the requested data portion is not locked (510), the peripheral bus 115 returns the lock to the lock circuit 140. The lock circuit 140 then marks the status of the lock in the kernel lock array 206 as T in the bit position corresponding to the kernel ID (512). If the requested data portion cannot be locked (510), the lock circuit 140 can issue the atomic request again after some waiting period. If the status of the lock is non-zero (step 506), which indicates a lock is held by another kernel 138, the lock circuit 140 instead sets a bit for the kernel ID in the pending requests field (step 514). The kernel 138 must then wait to access the requested data portion.

[0036] Fig. 6 is a flow diagram depicting a method 600 of managing lock releases according to an example. Referring to Figs. 2 and 6, on receiving a lock release request from a kernel 138 (step 602), which includes the kernel ID and the index, the lock circuit 140 checks the pending request field of that lock index (604). If the pending request field is zero (step 606), then the lock circuit 140 releases the lock by sending an atomic transaction over the peripheral bus through the bus interface 141 (step 608). If the pending request field is non-zero (step 606), the lock circuit 140 instead grants the lock to another kernel 138 that had previously requested a lock and did not receive the lock, but instead had a pending request set (step 610).

[0037] The lock circuit 140 prevents unnecessary atomic traffic over the

peripheral bus 115 and handles acceleration circuit related contention locally. A kernel 138 awaiting grant of a lock need not send repeated atomic transactions over the peripheral bus 115.

[0038] In cases where the kernels 138 need a lock to only access data (not modify), multiple kernels 138 can be granted locks concurrently by the lock circuit 140. The pending request field can be converted to a counter to keep track of how many kernels are currently granted a lock. Fig. 4 is a block diagram depicting an example of the kernel lock array 206 having the counter field 308 rather than the pending requests field 304. The counter field 308 includes one value for each entry indicating the number of kernels that have been granted locks. The lock circuit 140 decrements the counter field 308 on lock release requests from the kernels 138. When the counter field 308 reaches zero, the lock circuit 140 sends an atomic request over the peripheral bus 115 to release the lock to the data. This scheme is useful in applications such as memcached, where a lock is basically taken to prevent a host application from modifying data as in SET operations when GET operations are offloaded to multiple kernels 138. This has the benefit of enabling parallel processing that is not possible with software alone.

[0039] Based on application requirements, if a write (e.g., increment of some field in a data structure) is also required when a kernel 138 takes a lock, that increment can be done via an atomic store-add operation and multiple user kernels 138 can still operate in parallel providing better overall performance. If use-case analytics imply that holding a lock for a certain period of time causes starvation of the competing threads 139, then a threshold counter can be implemented in lock circuit 140 such that a lock is not held for more than a set number of kernels 138 once a lock is taken by a first kernel 138.

[0040] Fig. 7A is a block diagram depicting a programmable device 54 according to an example. The programmable device 54 can be used to implement the programmable device 128 in the hardware accelerator 122. The programmable device 54 includes a plurality of programmable integrated circuits (ICs) 1 , e.g., programmable ICs 1A, 1 B, 1 C, and 1 D. In an example, each programmable IC 1 is an IC die disposed on an interposer 51. Each programmable IC 1 comprises a super logic region (SLR) 53 of the programmable device 54, e.g., SLRs 53A, 53B, 53C, and 53D. The programmable ICs 1 are interconnected through conductors on the interposer 51 (referred to as super long lines (SLLs) 52).

[0041] Fig. 7B is a block diagram depicting a programmable IC 1 according to an example. The programmable IC 1 can be used to implement the programmable device 128 or one of the programmable ICs 1 A-1 D in the programmable device 54. The programmable IC 1 includes programmable logic 3 (also referred to as a programmable fabric), configuration logic 25, and configuration memory 26. The programmable IC 1 can be coupled to external circuits, such as nonvolatile memory 27, DRAM 28, and other circuits 29. The programmable logic 3 includes logic cells 30, support circuits 31 , and programmable interconnect 32. The logic cells 30 include circuits that can be configured to implement general logic functions of a plurality of inputs. The support circuits 31 include dedicated circuits, such as transceivers, input/output blocks, digital signal processors, memories, and the like. The logic cells and the support circuits 31 can be interconnected using the

programmable interconnect 32. Information for programming the logic cells 30, for setting parameters of the support circuits 31 , and for programming the

programmable interconnect 32 is stored in the configuration memory 26 by the configuration logic 25. The configuration logic 25 can obtain the configuration data from the nonvolatile memory 27 or any other source (e.g., the DRAM 28 or from the other circuits 29). In some examples, the programmable IC 1 includes a processing system 2. The processing system 2 can include microprocessor(s), memory, support circuits, IO circuits, and the like. In some examples, the programmable IC 1 includes a network-on-chip (NOC) 55 and data processing engine (DPE) array 56. The NOC 55 is configured to provide for communication between subsystems of the programmable IC 1 , such as between the PS 2, the PL 3, and the DPE array 56.

The DPE array 56 can include an array of DPE’s configured to perform data processing, such as an array of vector processors.

[0042] Fig. 7C is a block diagram depicting an SOC implementation of the programmable IC 1 according to an example. In the example, the programmable IC 1 includes the processing system 2 and the programmable logic 3. The processing system 2 includes various processing units, such as a real-time processing unit (RPU) 4, an application processing unit (APU) 5, a graphics processing unit (GPU) 6, a configuration and security unit (CSU) 12, a platform management unit (PMU) 122, and the like. The processing system 2 also includes various support circuits, such as on-chip memory (OCM) 14, transceivers 7, peripherals 8, interconnect 16, DMA circuit 9, memory controller 10, peripherals 15, and multiplexed IO (MIO) circuit 13. The processing units and the support circuits are interconnected by the interconnect 16. The PL 3 is also coupled to the interconnect 16. The transceivers 7 are coupled to external pins 24. The PL 3 is coupled to external pins 23. The memory controller 10 is coupled to external pins 22. The MIO 13 is coupled to external pins 20. The PS 2 is generally coupled to external pins 21. The APU 5 can include a CPU 17, memory 18, and support circuits 19.

[0043] In the example of Fig. 3C, the programmable IC 1 can be used in the hardware accelerator 122 and can function as described above. The acceleration circuit 130 can be programmed in the PL 3 and function as described above. In another example, the functionality of the hardware 104 described above can be implemented using the PS 2, rather than through hardware of a computing system.

In such case, the software 106 executes on the PS 2 and functions as described above.

[0044] Referring to the PS 2, each of the processing units includes one or more central processing units (CPUs) and associated circuits, such as memories, interrupt controllers, direct memory access (DMA) controllers, memory management units (MMUs), floating point units (FPUs), and the like. The interconnect 16 includes various switches, busses, communication links, and the like configured to

interconnect the processing units, as well as interconnect the other components in the PS 2 to the processing units.

[0045] The OCM 14 includes one or more RAM modules, which can be

distributed throughout the PS 2. For example, the OCM 14 can include battery backed RAM (BBRAM), tightly coupled memory (TCM), and the like. The memory controller 10 can include a DRAM interface for accessing external DRAM. The peripherals 8, 15 can include one or more components that provide an interface to the PS 2. For example, the peripherals 15 can include a graphics processing unit (GPU), a display interface (e.g., DisplayPort, high-definition multimedia interface (HDMI) port, etc.), universal serial bus (USB) ports, Ethernet ports, universal asynchronous transceiver (UART) ports, serial peripheral interface (SPI) ports, general purpose 10 (GPIO) ports, serial advanced technology attachment (SATA) ports, PCIe ports, and the like. The peripherals 15 can be coupled to the MIO 13. The peripherals 8 can be coupled to the transceivers 7. The transceivers 7 can include serializer/deserializer (SERDES) circuits, multi-gigabit transceivers (MGTs), and the like.

[0046] Fig. 7D illustrates a field programmable gate array (FPGA) implementation of the programmable IC 1 that includes the PL 3. The PL 3 shown in Fig. 7D can be used in any example of the programmable devices described herein. The PL 3 includes a large number of different programmable tiles including transceivers 37, configurable logic blocks (“CLBs”) 33, random access memory blocks (“BRAMs”) 34, input/output blocks (“lOBs”) 36, configuration and clocking logic

(“CONFIG/CLOCKS”) 42, digital signal processing blocks (“DSPs”) 35, specialized input/output blocks (“I/O”) 41 (e.g., configuration ports and clock ports), and other programmable logic 39 such as digital clock managers, analog-to-digital converters, system monitoring logic, and so forth. The PL 3 can also include PCIe interfaces 40, analog-to-digital converters (ADC) 38, and the like.

[0047] In some PLs, each programmable tile can include at least one

programmable interconnect element (“INT”) 43 having connections to input and output terminals 48 of a programmable logic element within the same tile, as shown by examples included at the top of Fig. 7D. Each programmable interconnect element 43 can also include connections to interconnect segments 49 of adjacent programmable interconnect element(s) in the same tile or other tile(s). Each programmable interconnect element 43 can also include connections to interconnect segments 50 of general routing resources between logic blocks (not shown). The general routing resources can include routing channels between logic blocks (not shown) comprising tracks of interconnect segments (e.g., interconnect segments 50) and switch blocks (not shown) for connecting interconnect segments. The

interconnect segments of the general routing resources (e.g., interconnect segments 50) can span one or more logic blocks. The programmable interconnect elements 43 taken together with the general routing resources implement a programmable interconnect structure (“programmable interconnect”) for the illustrated PL.

[0048] In an example implementation, a CLB 33 can include a configurable logic element (“CLE”) 44 that can be programmed to implement user logic plus a single programmable interconnect element (“INT”) 43. A BRAM 34 can include a BRAM logic element (“BRL”) 45 in addition to one or more programmable interconnect elements. Typically, the number of interconnect elements included in a tile depends on the height of the tile. In the pictured example, a BRAM tile has the same height as five CLBs, but other numbers (e.g., four) can also be used. A DSP tile 35 can include a DSP logic element (“DSPL”) 46 in addition to an appropriate number of programmable interconnect elements. An IOB 36 can include, for example, two instances of an input/output logic element (“IOL”) 47 in addition to one instance of the programmable interconnect element 43. As will be clear to those of skill in the art, the actual I/O pads connected, for example, to the I/O logic element 47 typically are not confined to the area of the input/output logic element 47.

[0049] In the pictured example, a horizontal area near the center of the die

(shown in Fig. 3D) is used for configuration, clock, and other control logic. Vertical columns 51 extending from this horizontal area or column are used to distribute the clocks and configuration signals across the breadth of the PL.

[0050] Some PLs utilizing the architecture illustrated in Fig. 5D include additional logic blocks that disrupt the regular columnar structure making up a large part of the PL. The additional logic blocks can be programmable blocks and/or dedicated logic.

[0051] Note that Fig. 7D is intended to illustrate only an exemplary PL architecture. For example, the numbers of logic blocks in a row, the relative width of the rows, the number and order of rows, the types of logic blocks included in the rows, the relative sizes of the logic blocks, and the interconnect/logic implementations included at the top of Fig. 7D are purely exemplary. For example, in an actual PL more than one adjacent row of CLBs is typically included wherever the CLBs appear, to facilitate the efficient implementation of user logic, but the number of adjacent CLB rows varies with the overall size of the PL.

[0052] The disclosure also may be expressed, but not limited to, in one or more of the following examples:

[0053] Example 1 : A hardware accelerator in a computing system, comprising: a bus interface coupled to a peripheral bus of the computing system; a lock circuit coupled to the bus interface; and a plurality of kernel circuits coupled to the lock circuit and the bus interface; wherein the plurality of kernel circuits provide lock requests to the lock circuit, the lock requests for data stored in system memory of the computing system; wherein the lock circuit is configured to process the lock requests from the plurality of kernel circuits and to issue atomic transactions over the peripheral bus through the bus interface based on the lock requests.

[0054] Example 2: The hardware accelerator of example 1 , wherein the lock circuit is configured to maintain a kernel lock array, and wherein the kernel lock array includes a plurality of elements each having a pending request field and a lock status field.

[0055] Example 3: The hardware accelerator of example 2, wherein the pending request field includes a plurality of entries corresponding to the plurality of kernel circuits, and wherein the lock status field includes a plurality of entries corresponding to the plurality of kernel circuits.

[0056] Example 4: The hardware accelerator of example 3, wherein the lock circuit is configured to, for each of the lock requests, check the lock status field of a corresponding element in the kernel lock array, set a lock in the lock status field of the corresponding element if available, and indicating a pending lock request in the pending request field if a lock is not available.

[0057] Example 5: The hardware accelerator of example 1 , wherein the lock circuit is configured to maintain a kernel lock array, and wherein the kernel lock array includes a plurality of elements each having a counter field and a lock status field.

[0058] Example 6: The hardware accelerator of example 5, wherein the counter field includes a value, and wherein the lock status field includes a plurality of entries corresponding to the plurality of kernel circuits.

[0059] Example 7: The hardware accelerator of example 6, wherein the lock circuit is configured to, for each of the lock requests, check the lock status field of a corresponding element in the kernel lock array, set a lock in the lock status field of the corresponding element, and increment the value in the counter field of the corresponding element.

[0060] Example 8: A computing system, comprising: a system memory; a processor coupled to the system memory; a peripheral bus coupled to the system memory; and a hardware accelerator coupled to the peripheral bus; a bus interface coupled to the peripheral bus; a lock circuit coupled to the bus interface; and a plurality of kernel circuits coupled to the lock circuit and the bus interface; wherein the plurality of kernel circuits provide lock requests to the lock circuit, the lock requests for data stored in the system memory; wherein the lock circuit is configured to process the lock requests from the plurality of kernel circuits and to issue atomic transactions over the peripheral bus through the bus interface based on the lock requests.

[0061] Example 9: The computing system of example 8, wherein the lock circuit is configured to maintain a kernel lock array, and wherein the kernel lock array includes a plurality of elements each having a pending request field and a lock status field.

[0062] Example 10: The computing system of example 9, wherein the pending request field includes a plurality of entries corresponding to the plurality of kernel circuits, and wherein the lock status field includes a plurality of entries corresponding to the plurality of kernel circuits.

[0063] Example 1 1 : The computing system of example 10, wherein the lock circuit is configured to, for each of the lock requests, check the lock status field of a corresponding element in the kernel lock array, set a lock in the lock status field of the corresponding element if available, and indicating a pending lock request in the pending request field if a lock is not available.

[0064] Example 12: The computing system of example 8, wherein the lock circuit is configured to maintain a kernel lock array, and wherein the kernel lock array includes a plurality of elements each having a counter field and a lock status field.

[0065] Example 13: The computing system of example 12, wherein the counter field includes a value, and wherein the lock status field includes a plurality of entries corresponding to the plurality of kernel circuits.

[0066] Example 14: The computing system of example 13, wherein the lock circuit is configured to, for each of the lock requests, check the lock status field of a corresponding element in the kernel lock array, set a lock in the lock status field of the corresponding element, and increment the value in the counter field of the corresponding element.

[0067] Example 15: A method of managing locks to data stored in memory among a plurality of kernels executing in a hardware accelerator of a computing system, the method comprising: receiving, at a lock circuit in the hardware accelerator, a lock request from a kernel of the plurality of kernels; determining whether a lock is held by another kernel of the plurality of kernels; indicating a pending request for the kernel in response to the lock being held by another kernel; and issuing, from the lock circuit, an atomic request for the lock over a bus interface of the computing system to obtain the lock in response to the lock not being held by another kernel. [0068] Example 16: The method of example 15, further comprising: indicating that the kernel has the lock.

[0069] Example 17: The method of example 15, further comprising: receiving a lock release request from the kernel at the lock circuit; determining whether another kernel of the plurality of kernels has a pending lock request; issuing, from the lock circuit, another atomic request to release the lock over the bus interface of the computing system in response to absence of a pending lock request; and granting, by the lock circuit, the lock to another kernel of the plurality of kernels in response to presence of a pending lock request.

[0070] Example 18: The method of example 15, wherein the lock circuit is configured to maintain a kernel lock array, and wherein the kernel lock array includes a plurality of elements each having a pending request field and a lock status field.

[0071] Example 19: The method of example 18, wherein the pending request field includes a plurality of entries corresponding to the plurality of kernel circuits, and wherein the lock status field includes a plurality of entries corresponding to the plurality of kernel circuits.

[0072] Example 20: The method of example 15, wherein the lock circuit is configured to maintain a kernel lock array, and wherein the kernel lock array includes a plurality of elements each having a counter field and a lock status field.

[0073] While the foregoing is directed to specific examples, other and further examples may be devised without departing from the basic scope thereof, and the scope thereof is determined by the examples that follow.

[0074] While the foregoing is directed to specific examples, other and further examples may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.