Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND APPARATUS FOR REDUCING READ/WRITE CONTENTION TO A CACHE
Document Type and Number:
WIPO Patent Application WO/2018/112373
Kind Code:
A1
Abstract:
A cache is presented. The cache comprises a tag array configured to store one or more tag addresses; a tag control buffer configured to store cache control information; a data array configured to store data acquired from a memory device; and a write buffer configured to store information related to a write request. The tag array is configured to be accessed independently from the tag control buffer, and the data array is configured to be accessed independently from the write buffer.

Inventors:
JIANG XIAOWEI (US)
Application Number:
PCT/US2017/066730
Publication Date:
June 21, 2018
Filing Date:
December 15, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ALIBABA GROUP HOLDING LTD (US)
International Classes:
G06F12/00; G06F9/30; G06F11/07; G06F12/08
Foreign References:
US20090006756A12009-01-01
US5678020A1997-10-14
US20150036418A12015-02-05
US6993633B12006-01-31
US20120124327A12012-05-17
US5680572A1997-10-21
US20050235115A12005-10-20
US20110035616A12011-02-10
US20040243764A12004-12-02
Attorney, Agent or Firm:
CAPRON, Aaron, J. (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A cache comprising:

a tag array configured to store one or more tag addresses;

a tag control buffer configured to store cache control information;

a data array configured to store data acquired from a memory device; and a write buffer configured to store information related to a write request, wherein the tag array is configured to be accessed independently from the tag control buffer and wherein the data array is configured to be accessed independently from the write buffer.

2. The cache of claim 1 , wherein the tag array comprises an array of single-ported static random access (SRAM) devices.

3. The cache of claim 1 or 2, wherein the data array comprises an array of single-ported SRAM devices.

4. The cache of any one of claims 1 - 3, wherein the tag control buffer comprises an array of flip-flops.

5. The cache of any one of claims 1 - 4, wherein the cache control information includes at least one of: data coherency status, and least recently used (LRU) information.

6. The cache of any one of claims 1 - 5, wherein the information related to a write request includes the write data and information about a location within the data array to be updated with the write data.

7. The cache of claim 6, wherein the information about the location comprises a first pointer to a first data entry within the data array, and a second pointer to a location with the first data entry.

8. The cache of claim 7, further comprising a second data array, wherein the information about the location further comprises a third pointer to the second data array.

9. The cache of any one of claims 1 - 8, further comprising a controller configured to:

determine, based on the write request, whether there is a cache-hit;

responsive to determining that there is cache-hit:

perform a write operation to the tag control buffer;

perform a write operation to the write buffer to store the information related to the write request;

process a subsequent data access request; and perform a write operation to a first data entry of the data array based on the information stored in the write buffer.

10. The cache of claim 9, wherein the write operation to the first data entry of the data array is performed responsive to determining at least one of: the write buffer being full, the data array being idle, and the subsequent data access request being directed to the first data entry.

1 1. A method of operating a cache that comprises a tag array, a tag control buffer, a data array, and a write buffer, the method comprising:

receiving a first write request including write data and a memory address;

receiving a second data access request;

determining a tag address based on the memory address;

performing a first read operation to the tag array to determine if there is a cache-hit; and

responsive to determining that there is a cache-hit:

performing a write operation to the write buffer to store information related to the first write request and write data,

performing a write operation to the tag control buffer to update stored cache control information,

performing second read operations to the tag array and to the data array for the second data access request, and

performing a write operation to a first data entry of the data array based on the information related to the first write request and the write data stored in the write buffer.

12. The method of claim 1 1, wherein the second read operations to the tag array and to the data array for second request are performed before the write operation to the tag control buffer is completed.

13. The method of claim 1 1 or 12, wherein the write operation to the data array based on the information related to the first write request stored in the write buffer is performed before the second read operations to the tag array and to the data array for the second request are performed; further comprising:

determining to perform the write operation to the first data entry of the data array based on the information related to the first write request stored in the write buffer responsive to determining at least one of: the write buffer being full, the data array being idle, and second data access request is directed to the first data entry in the data array.

14. The method of any one of claims 11 - 13, wherein the cache control information includes at least one of: data coherency status, and least recently used (LRU) information.

15. The method of any one of claims 11 - 14, wherein the information related to a write request includes the write data and information about a location associated with the first data entry;

wherein the information about the location comprises a first pointer to the first data entry, and a second pointer to a location with the first data entry; and

wherein the cache further comprises a second data array, wherein the information about the location further comprises a third pointer to the second data array.

16. A computer system, comprising:

a hardware processor; and

a hierarchical memory system coupled with the hardware processor, comprising: a dynamic random access memory device; and

a cache comprising:

a tag array configured to store one or more tag addresses, a tag control buffer configured to store cache control information, a data array configured to store data acquired from the dynamic random access memory device, and

a write buffer configured to store information related to a write request from the hardware processor;

wherein the tag array is configured to be accessed independently from the tag control buffer and wherein the data array is configured to be accessed independently from the write buffer.

17. The computer system of claim 16, wherein the cache further comprises a controller configured to:

determine, based on the write request, whether there is a cache-hit;

responsive to determining that there is cache-hit:

perform a write operation to the tag control buffer;

perform a write operation to the write buffer to store the information related to the write request;

process a subsequent data access request from the hardware processor; and perform a write operation to a first data entry of the data array based on the information stored in the write buffer;

wherein the write request to the data array is performed based on a determination of at least one of: the write buffer being full, the data array being idle, and the subsequent data access request is directed to the first data entry.

18. The computer system of claim 16 or 17, wherein each of the tag array and the data array comprises an array single-ported static random access (SRAM) devices; wherein the tag control buffer comprises an array of flip-flops.

19. A cache comprising:

a tag array configured to store one or more tag addresses and cache control information;

a data array configured to store data acquired from a memory device; and a write buffer configured to store information related to a write request, wherein the data array is configured to be accessed independently from the write buffer.

20. A method of operating a cache that comprises a tag array, a data array, and a write buffer, the method comprising:

receiving a write request including a first data and a memory address;

determining a tag address based on the memory address;

performing a read operation to the tag array to determine if there is a cache-hit; responsive to determining that there is a cache-hit, performing a write operation to the write buffer to store the first data, and performing a write operation to the tag array to update stored cache control information; and

responsive to determining a preset condition is satisfied, performing a write operation to the data array based on the first data stored in the write buffer.

Description:
METHOD AND APPARATUS FOR REDUCING READ/WRITE CONTENTION TO

A CACHE

TECHNICAL FIELD

[001] The present disclosure generally relates to the field of computer architecture, and more particularly, to a method and an apparatus for reducing read/write contention to a cache.

BACKGROUND

[002] Volatile memory, such as dynamic random-access memory (DRAM), provides temporary storage space that can be accessed by a computer processor at a reasonable speed. The volatile memory can be used to store instructions and data, which can then be fetched to the computer processor for processing. The computer processor can also store a result of the processing into the volatile memory for subsequent processing.

Although DRAM provides reasonable access speed, memory access latency remains a bottleneck to the computer processor, as the processing speed of computer processor, as well as the storage capacity of DRAM, keeps increasing thanks to Moore's Law.

Accordingly, contemporary computer processors typically deploy a hierarchical memory system including a cache, to speed up memory access.

SUMMARY

[003] Embodiments of the present disclosure provide a cache. The cache comprises a tag array configured to store one or more tag addresses, a tag control buffer configured to store cache control information, a data array configured to store data acquired from a memory device, and a write buffer configured to store information related to a write request. The tag array is configured to be accessed independently from the tag control buffer, and the data array is configured to be accessed independently from the write buffer. [004] Embodiments of the present disclosure also provide a method of operating a cache that comprises a tag array, a tag control buffer, a data array, and a write buffer. The method comprises: receiving a first write request including write data and a memory address; receiving a second data access request; determining a tag address based on the memory address; performing a first read operation to the tag array to determine if there is a cache-hit; and responsive to determining that there is a cache-hit: performing a write operation to the write buffer to store information related to the first write request, performing a write operation to the tag control buffer to update stored cache control information, performing second read operations to the tag array and to the data array for the second data access request, and performing a write operation to a first data entry of the data array based on the information related to the first write request stored in the write buffer.

[005] Embodiments of the present disclosure also provide a computer system. The computer system comprises a hardware processor and a hierarchical memory system coupled with the hardware processor. The hierarchical memory system comprises a dynamic random access memory device and a cache. The cache comprises a tag array configured to store one or more tag addresses, a tag control buffer configured to store cache control information, a data array configured to store data acquired from the dynamic random access memory device, and a write buffer configured to store information related to a write request from the hardware processor. The tag array is configured to be accessed independently from the tag control buffer, and the data array is configured to be accessed independently from the write buffer.

[006] Embodiments of the present disclosure also provide a cache. The cache comprises a tag array configured to store one or more tag addresses and cache control information, a data array configured to store data acquired from a memory device, and a write buffer configured to store information related to a write request. The data array is configured to be accessed independently from the write buffer.

[007] Embodiments of the present disclosure also provide a method of operating a cache that comprises a tag array, a data array, and a write buffer. The method comprises: receiving a write request including a first data and a memory address; determining a tag address based on the memory address; performing a read operation to the tag array to determine if there is a cache-hit; responsive to determining that there is a cache-hit, performing a write operation to the write buffer to store the first data, and performing a write operation to the tag array to update stored cache control information; and responsive to determining that preset condition is satisfied, performing a write operation to the data array based on the first data stored in the write buffer.

[008] Additional objects and advantages of the disclosed embodiments will be set forth in part in the following description, and in part will be apparent from the description, or may be learned by practice of the embodiments. The objects and advantages of the disclosed embodiments may be realized and attained by the elements and combinations set forth in the claims.

[009] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosed embodiments, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS [010] FIG. 1 is a diagram illustrating a computer system in which embodiments of the present disclosure can be used.

[O i l] FIGS. 2A-B are schematic diagrams illustrating single-ported and dual-ported memory devices that can be used in embodiments of the present disclosure.

[012] FIG. 3 is a schematic diagram illustrating an exemplary cache, consistent with embodiments of the present disclosure.

[013] FIG. 4 is a schematic diagram illustrating another exemplary cache, consistent with embodiments of the present disclosure.

[014] FIG. 5 is a flowchart illustrating an exemplary method of operating a cache, consistent with embodiments of the present disclosure.

DESCRIPTION OF THE EMBODIMENTS

[015] Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the invention. Instead, they are merely examples of apparatuses and methods consistent with aspects related to the invention as recited in the appended claims.

[016] Embodiments of the present disclosure provide a cache that comprises a tag array configured to store one or more tag addresses, a tag control buffer configured to store cache control information, a data array configured to store data acquired from a memory device, and a write buffer configured to store information related to a write request.

According to embodiments of the present disclosure, the tag array is configured to be accessed independently from the tag control buffer, and the data array is configured to be accessed independently from the write buffer. After receiving a write request and determining that the write request leads to a cache-hit, the cache may, instead of performing the write request, store information related to the write request in the write buffer. The cache can then process a subsequent access request, and then perform the delayed write request based on the information stored in the write buffer. With such an arrangement, even if the data array comprises single-ported SRAM devices that do not allow simultaneous read and write operations, the cache needs not wait for the write request to be completed before processing the subsequent access request. As a result, the latency due to read/write contention at the single-ported SRAM devices can be reduced, and the performance of the cache can be substantially improved.

[017] Reference is now made to FIG. 1, which illustrates a computer system 100 in which embodiments of the present disclosure can be used. As shown in FIG. 1, computer system 100 includes a computer processor 102 and a hierarchical memory system 104. Memory system 104 includes a cache 122 and a DRAM 132. Although FIG. 1 illustrates that cache 122 is separate from computer processor 102, it is understood that cache 122 can also be a part of an integrated circuit chip that includes computer processor 102. Cache 122 can also be a part of an integrated circuit chip that includes DRAM 132, or can exist as a standalone chip.

[018] Cache 122 is typically built using static random-access memory (SRAM) devices, which provide higher access speed than a DRAM device but also include more transistors and consume more power than a DRAM device of the same storage capacity. Cache 122 can be used as an intermediate buffer to store a subset of data stored in DRAM 132, such that the data stored in cache 122 correspond to at least a part of the data stored in DRAM 132. That subset of data are typically the most recently accessed data by computer processor 102, which can include data that are acquired from DRAM 132 according to a data read operation, or data that are modified by computer processor 102 and to be stored in DRAM 132 according to a data write operation. Due to temporal and spatial localities, such data (as well as other data that are stored in nearby memory locations) are likely going to be accessed by computer processor 102 again. When computer processor 102 tries to access such data again, it can read it from cache 122 (which results in a cache-hit) instead of from DRAM 132. By increasing the access speed of these data, the effect of access latency of DRAM 132 on the overall speed of computer system 100 can be reduced.

[019] As shown in FIG. 1, cache 122 includes a tag array 124, a data array 126, and a cache controller 128. Cache 122 can be a direct-mapped cache. Both tag array 124 and data array 126 includes a plurality of entries, including tag entries 124a and 124b, and data entries 126a and 126b. Data array 126 stores the memory data acquired from DRAM 132 that was accessed (or will likely be accessed) by computer processor 102. The data stored in each data entry of data array 126 is associated with a tag address. The tag address can be derived from the memory address of that data in DRAM 132. Each tag entry of tag array 124 may store a tag address associated with a data entry in data array 126, such that the tag entry corresponds to the data entry. Tag array 124 also stores control information for cache operation with each tag address. The control information may include data coherency state and least recently used (LRU) information. The data coherency state may indicate a state of coherency between the data stored in the cache and the corresponding data stored in the memory, in the event that the data in the cache is overwritten by computer processor 102 in a cache write operation. The LRU information reflects a history of access (either read or write access) of a data entry in data array 126, and can include a flag that indicates whether a data entry is the least recently accessed among the data entries in data array 126.

[020] Cache controller 128 manages the read and write operations to cache 122. When cache controller 128 receives a request to access (read or write) cache 122 from computer processor 102, it determines a tag address from the request. Cache controller 128 then performs a read operation to tag array 124 to determine whether it stores a matching tag address. In a case where the cache is a direct-mapped cache, if a matching tag address is found in one of the tag entries in tag array 124, cache controller 128 can determine that there is a cache-hit. Cache controller 128 can then perform a read operation to the data entry that corresponds to the matching tag entry, and transmit the data stored in the data entry back to computer processor 102. Cache controller 128 may also update both the data coherency state and LRU information in tag array 124, as well as data stored in data array 126, in a cache write operation.

[021] With the current technology, both tag array 124 and data array 126 are constructed using SRAM devices. Some SRAM devices are single-ported devices.

Reference is now made to FIG. 2A, which illustrates a typical single-ported SRAM cache 200. As shown in FIG. 2A, SRAM cache 200 includes a single address (addr) for selecting one or more single-ported SRAM devices. The selected devices can be for a write operation to store input data (din), when the write enable signal (we) is asserted. The selected devices can also be for a read operation to provide stored data as output data (dout), when the read enable signal (re) is asserted. Typically single-ported SRAM devices include shared bit lines for read and write operation, with the write enable and read enable signals controlling whether the bit lines are to be driven by the input data port (for a write operation) or to be driving the output data port (for a read operation). Therefore, the selected SRAM devices cannot process a read operation and a write operation simultaneously, otherwise there will be a contention between the input data port and the output data port at the shared bit lines.

[022] On the other hand, some SRAM devices are dual-ported devices, which allow simultaneous read and write operations. Typically dual-ported SRAM devices include separate bit lines for read operation and for write operation. Reference is now made to FIG. 2B, which illustrates a typical dual-ported SRAM cache 220. As shown in FIG. 2B, SRAM cache 220 includes an address for read operation (rd_addr) and an address for write operation (wr_addr). SRAM cache 220 allows different sets of SRAM devices being selected for read and write operations simultaneously, or the same set of SRAM devices being selected for simultaneous read and write operations.

[023] Compared with caches constructed with dual-port SRAM devices with the same storage capacity, single-ported SRAM caches generally include fewer transistor devices and signal routes, and can provide up to 50% reduction in chip size and power consumption. Single-ported SRAM cache, however, can cause serious performance degradation due to its inability of handling read and write operations simultaneously.

[024] To provide an illustrative example, referring to FIG. 1, assume that computer processor 102 transmits a write request to write data into data entry 126a, and then a read request to read data from data entry 126b, with both requests resulting in consecutive cache-hits. The write request may result in a write operation to the LRU information and data coherency status stored in tag entry 124a that corresponds to data entry 126a.

Moreover, the read request may also result in a read operation to access the tag addresses stored in tag entries 124a and 124b, as cache controller 128 searches through tag array 124 to determine if there is a matching tag address.

[025] If tag array 124 is constructed using a dual-ported SRAM device, the write operation to tag array entry 124a and the read operations to tag array entries 124a and 124b may occur simultaneously. On the other hand, if tag array 124 is constructed using a single-ported SRAM device, the write operation to tag array entry 124a and the read operations to tag array entries 124a and 124b cannot occur simultaneously. For example, after write operation to tag entry 124a starts (for processing the write request), the read operations to tag entries 124a and 126a (for processing the read request) cannot proceed and must wait until the write operation to tag array entry 124a completes. Similarly, if data array 126 is constructed using a single-ported SRAM device, after the write operation to data entry 126a starts (for processing the write request), the read operation to data entry 126b (for processing the read request) also cannot proceed and must wait until the write operation to data entry 126a finishes.

[026] Despite the consecutive cache-hits, the wait time caused by the read/write contentions still significantly degrades the performance of single-ported cache 122, compared to situations when cache 122 is constructed using dual-port SRAM devices where read/write contentions do not exist. With the current technology, it is often for caches to achieve a cache-hit rate of up to 90%. Therefore, removing the read/write contentions during cache-hits can significantly improve the performance of single-ported SRAM devices.

[027] Reference is now made to FIG. 3, which illustrates an exemplary cache 300 according to embodiments of the present disclosure. Cache 300 may be a part of a hierarchical memory system (e.g., memory system 104 of FIG. 1) that processes data access requests from a computer processor (e.g., computer processor 102), and may replace cache 122 of FIG. 1 It is understood that cache 300 can also be a part of an integrated circuit chip that includes computer processor 102. Cache 300 can also be a part of an integrated circuit chip that includes DRAM 132 of FIG. 1 , or can exist as a standalone chip. Cache 300 can be a direct-mapped cache.

[028] As shown in FIG. 3, cache 300 may include a tag portion 310, a data portion 340, and a cache controller 360. Tag portion 310 may include a tag array 312 and a tag control buffer 314. Each entry of tag array 312 may store tag addresses, and each entry of tag control buffer 314 may store control information including data coherency state and LRU information. Tag array 312 may be constructed using single-ported or dual-ported SRAM devices, while tag control buffer 314 may be constructed using sequential logic blocks such as flip-flops, latches, etc., or any circuit blocks (e.g., dual-ported SRAM devices). Moreover, tag array 3 12 and control buffer 314 may include its own read and write enable signals, entry selection signals, etc. With such arrangements, tag array 312 and control buffer 314 can be accessed (read or write) independently from each other. Although FIG. 3 illustrates that tag portion 310 includes both tag array 312 and tag control buffer 314, it is understood that tag control buffer 314 can be separated from tag portion 310, or that tag array 312 may store the cache control information as well.

[029] Moreover, data portion 340 also includes a data array 342, and a write buffer 344. Each entry of data array 342 and write buffer 344 may store the actual data that was accessed (or will likely be accessed) by computer processor 102 of FIG. 1. Data array 342 may be constructed using single-ported or dual-ported SRAM devices. Write buffer 344 may also be constructed using single-ported or dual-ported SRAM devices, or with sequential logic blocks such as flip-flops and latches. Data array 342 and write buffer 344 may include its own read and write enable signals, entry selection signals, etc. With such arrangements, data array 342 can be accessed (read or write) independently from write buffer 344.

[030] Each entry of data array 342 and write buffer 344 may have different sizes. For example, each entry of data array 342 may correspond to a cache line, which is typically 64 bytes. Each entry of write buffer, on the other hand, may have a size of 4 bytes, which may correspond to the size of data requested by a typical write operation.

[031] In some embodiments, cache controller 360 can control the read and write operations at tag portion 310 and data portion 340 to implement a predetermined caching policy. Cache controller 360 can include digital logic circuitries, and can be either a standalone application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or a part of computer processor 102. Cache controller 360 may also be capable of executing instructions (e.g., firmware, micro-codes, etc.) to control the read and write operations. The instructions may be stored in a computer readable medium that may be volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, etc., and is accessible by cache controller 360.

[032] In some embodiments, cache controller 360 can implement a caching policy to reduce the latency caused by write operations to tag portion 310 and data portion 340. After receiving (e.g., from computer processor 102) a request to write data into a certain entry of data array 342, and that the write request leads to a cache-hit (e.g., when the tag address derived from the request is found in tag array 312), cache controller 360 can withhold the write operation to data array 342, and store information related to the write operation in write buffer 344. The information stored in write buffer 344 may include, for example, the write data to be stored in data array 342, an identifier (e.g., tag address, a pointer, or any identifier) for identifying an entry in data array 342 to receive the write data, and a location within that entry, which typically can be derived by the bit offset included in the memory address. Cache controller 360 can then proceed to process the next data request (e.g., a read request) from computer processor 102.

[033] Cache controller 360 can perform the delayed write request using the information stored in write buffer 344 in various scenarios. As an example, cache controller 360 can perform the delayed write request when the write buffer is full and cannot accept additional information. In this case, the number of entries of write buffer 344 can be designed based on, for example, an expected number of write requests from computer processor 102, a number of entries of data array 342, which may determine a number of cache-hits within the predetermined period, a performance target of the cache, etc.

[034] As another example, cache controller 360 may also perform the delayed write request when cache controller 360 has not received any request for accessing cache 300 within a predetermined period, or a predetermined period for completion of a read request. The pre-determined period can also be specific (e.g., whether the cache is a LI cache, L2 cache, etc.), and can be in the range of nanoseconds.

[035] Moreover, cache controller 360 can also perform the delayed write request when a subsequent read or write request is directed to the same entry in data array 342 as the delayed write request. In such a case, to ensure that the subsequent read or write request operates on the most updated data in cache 300, cache controller 360 can perform the delayed write request and merge the write data associated with the delayed write request with the data stored in an associated entry in data array 342. After the delayed write request is performed, the information stored in write buffer 344 can be removed, or overwritten by new data. In some embodiments, write buffer 344 can associate a status flag with each entry to indicate whether the data stored in the entry has been written into data array 342. If the flag indicates that the data has been written into data array 342, the data can be overwritten or removed.

[036] With embodiments of the present disclosure, cache 300 does not always have to wait for a write request to complete before processing a subsequent read or write request. After receiving the write request and determining that it leads to a cache-hit, cache controller 360 can store information related to the write request into write buffer 344 for later processing. After (or concurrent with) storing the information in write buffer 344, cache controller 360 can process the subsequent read or write request.

[037] Moreover, as discussed above, tag array 312 and tag control buffer 314 can be configured to allow independent accesses, such that during a cache-hit, when tag control buffer 314 receive a write operation (e.g., to update the LRU information for an entry in data array 342 that has been accessed for a cache-hit), tag array 312 can simultaneously handle a read operation for tag address search. For example, when tag control buffer entry 314a receives a write operation due to a write request, tag array 312 remains accessible to cache controller 360.

[038] With such an arrangement, even if tag array 312 and data array 342 are constructed with single-ported SRAM devices that cannot process read and write operations simultaneously, the latency caused by write operation to the processing of subsequent requests can be reduced. For example, assuming computer processor 102 transmits a write request that leads to a write operation to data entry 342a, and then a read request that leads to a read operation to data entry 342b, both of which lead to cache-hits, the subsequent read operation to data entry 342b can proceed without waiting for the write operation to data entry 342a completes. Similarly, if the write request also leads to a write operation to tag control buffer 314 (e.g., to update the LRU information and/or the data coherency status), the write operation can be performed independently, and concurrently, with the read operations to tag array 312 as cache controller 360 searches for a matching tag address when processing the read request.

[039] As discussed above, with the current technology, it is often for caches to achieve a cache-hit rate of up to 90%. Therefore, with embodiments of present disclosure, the latency due to read/write contentions during cache-hits can be reduced, and the performance of a cache with tag and data arrays constructed with single-ported SRAM devices can be improved significantly.

[040] In some embodiments, when the write request leads to a cache-miss (e.g., a tag address derived from the write request cannot be found in tag array 312), cache controller 360 may determine to process the write request instead of withholding it, and the controller may determine not to store information related to the write request in write buffer 344. Such a policy can be based on, for example, that cache-miss occurs much less frequently than cache-hits, and that a cache-miss typically requires a write operation to update an entire cache line of data, which would require a large write buffer to hold the write data. Therefore, improving latency caused by read/write contention for cache-miss may be more costly and provide less benefit than improving the same latency for cache-hit.

[041] Embodiments of the present disclosure are also applicable to set-associative caches. Reference is now made to FIG. 4, which illustrates an exemplary set-associative cache 400 according to embodiments of the present disclosure. As shown in FIG. 4, cache 400 includes a tag portion 410, a data portion 440, and a cache controller 460. Tag portion 410 includes four tag arrays 412a, 412b, 412c, and 412d, and four tag control buffers 414a, 414b, 414c, and 414d. Data portion 440 also includes four data arrays 442a, 442b, 442c, and 442d, and a write buffer 444. Each tag entry in tag array 412a, 412b, 412c, and 412d may store a tag address that corresponds to a data entry in, respectively, data array 442a, 442b, 442c, or 442d. The tag address can be derived from a memory address 480 provided in a data request from computer processor 102. As an illustrative example, the tag address can be derived by extracting the first 33 bits of memory address 480. Moreover, each of tag arrays 412a, 412b, 412c, and 412d is associated with, respectively, tag control buffer 414a, 414b, 414c, or 414d, which store the LRU information and data coherency state that correspond to the tag addresses stored in the tag arrays and the data stored in the data arrays.

[042] With four tag and data arrays, cache 400 can be a four-way set associative cache, such that a particular tag address can be stored in any of the four tag arrays, and the data associated with that tag address can be stored in any of the four data arrays. The determination of which of the four tag arrays to store the tag address, and which of the four data arrays to store the data, can be based on the index field in memory address 480.

[043] Tag arrays 412a-d may be constructed using single-ported or dual-ported SRAM devices, while tag control buffers 414a-d may be constructed using sequential logic blocks such as flip-flops, latches, etc., or any circuit blocks (e.g., dual-ported SRAM devices). Moreover, each of tag arrays 412a-d and control buffers 414a-d may include its own read and write enable signals, entry selection signals, etc. With such arrangements, corresponding pairs of tag array and control buffer (e.g., tag array 412a and control buffer 414a) can be accessed (read or write) independently.

[044] Moreover, data arrays 442a-d may be constructed using single-ported or dual-ported SRAM devices. Write buffer 444 may also be constructed using single-ported or dual-ported SRAM devices, or with sequential logic blocks such as flip-flops and latches. Each of data arrays 442a-d and write buffer 444 may include its own read and write enable signals, entry selection signals, etc. With such arrangements, each of data arrays 442a-d can be accessed (read or write) independently from write buffer 444.

[045] Cache controller 460 manages the read/write operations to cache 410. Cache controller 460 can include digital logic circuitries, and can be either a standalone application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or a part of computer processor 102. Cache controller 460 may also be capable of executing instructions (e.g., firmware, micro-codes, etc.) to control the read and write operations. The instructions may be stored in a computer readable medium that may be volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, etc., and is accessible by cache controller 460. Cache controller 460 may include a set of comparators 462a-d, a write buffer manager 464, and a multiplexor 466. Each of these components can be either hardware circuitries, software components, or a mixture of both.

[046] After receiving a data access request (read or write) from computer processor 102 that includes memory address 480, cache controller 460 can perform read operations to each of tag arrays 412a-d to obtain the stored tag addresses, and compare them against the tag address derived from memory address 480 using comparators 462a-d. The read operations to each of tag arrays 412a-d can be concurrent and independent from any write operations to tag control buffers 414a-d (e.g., to update LRU information and/or data coherency status). Based on the comparison result, write buffer manager 464 can determine whether there is a cache-hit. If there is a cache-hit, and if the data access request is a write request, write buffer manager 464 can instruct write buffer 444 to store information related to the write request. The information may include, for example, write data 484 associated with memory address 480, which of the data arrays 442a-d to be written with the data, an identifier (e.g., tag address, a pointer, or any identifier) for identifying an entry in that data array to receive the write data, and a location within that entry, which typically can be derived by the bit offset included in the memory address. After (or concurrent) with storing the information at the write buffer, cache controller 460 can then process the subsequent request without waiting for the write request to be completed. On the other hand, if the data access request is a read request, data stored in the a data entry that correspond to the matching tag address will be provided to multiplexor 466, which can then extract read data 486 and provide the read data as a response to the read request.

[047] Reference is now made to FIG. 5, which illustrates an exemplary method 500 of operating a cache. The method can be performed by, for example, cache controller 360 of FIG. 3 or cache controller 460 of FIG. 4. The cache can include a tag portion and a data portion, where the data portion includes a data array and a write buffer that are

independently accessible. The cache may be, for example, cache 300 of FIG. 3 or cache 400 of FIG. 4.

[048] After an initial start, the controller proceeds to step 502 and receive a write request including write data and a memory address. The write request may come from a computer processor (e.g., computer processor 102 of FIG. 1).

[049] After receiving the write request, the controller performs step 504 and determines a tag address based on the memory address in the write request.

[050] After determining the tag address, the controller then determines whether there is a cache-hit, in step 506. To determine whether there is a cache-hit, the controller may perform read operations to the tag portions to read the stored tag addresses, compare the stored tag addresses against the tag address determined in step 504, and determine a matching tag entry.

[051] If the controller determines there is a cache-miss in step 506 (e.g., based on a determination that the tag address determined in step 504 cannot be found in the tag array), the controller can proceed to step 508 to obtain new data from the memory (e.g., DRAM 132), and proceed to step 509 to determine whether the data portion and the tag portion of the cache are idle. The controller can stay in step 509 (and withhold the write operation) until determining that the data portion and the tag portion of the cache are idle. After the controller determines that the data portion and the tag portion of the cache are idle, the controller can then proceed to step 510 to perform the write operations to the data portion and to the tag portion of the cache. After the write operations complete, the controller proceeds to step 51 1 to process the next request, and method 500 proceeds to end.

[052] On the other hand, if the controller determines there is a cache-hit in step 506 after determining that a tag entry stores a tag address that matches with the tag address determined in step 504, the controller can determine the cache control information for the matching tag entry in step 512. The cache control information can include LRU

information, data coherency status, etc. For example, if the LRU information of the matching tag entry indicates that the tag entry is associated with a least recently accessed data entry in the data array, the controller can update the LRU information to indicate that it is no longer the least recently accessed (rather, it is the most recently accessed). The controller can then proceed to step 514 and perform a write operation to the tag control buffer to store the cache control information.

[053] The controller can then perform a write operation to store the write request information in the write buffer, in step 516. The write request information may include, for example, the write data, and a location of the data array (and which of a plurality of data arrays in cache 400) to be written, etc. After (or concurrent with) performing step 516, the controller can then proceed to step 518 to process the next data request. F5

[054] The controller can then proceed to step 520 to perform the delayed write request. The delayed write request can be performed triggered by a preset condition. The preset condition may include, for example, the write buffer is full, the data array is idle, a subsequent read or write request being directed to the same entry in the data array as the delayed write request, etc.

[055] As is known to those having ordinary skills in the art, functions of tag control buffer 314 in cache 300 can be implemented by the tag array 312. For example, tag array 3 12 may store the cache control information as well.

[056] Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed here. This application is intended to cover any variations, uses, or adaptations of the invention following the general principles thereof and including such departures from the present disclosure as come within known or customary practice in the art. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims. Method 500 can then proceed to an end.

[057] It will be appreciated that the present invention is not limited to the exact construction that has been described above and illustrated in the accompanying drawings, and that various modifications and changes can be made without departing from the scope thereof. It is intended that the scope of the invention should only be limited by the appended claims.