Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MAIN MEMORY PREFETCH OPERATION AND MULTIPLE PREFETCH OPERATIONS
Document Type and Number:
WIPO Patent Application WO/2016/153545
Kind Code:
A1
Abstract:
Provided is an integrated circuit that includes a first prefetcher component communicatively coupled to a processor and a second prefetcher component communicatively coupled to the memory controller. The first prefetcher component configured for sending prefetch requests to the memory controller. The second prefetcher component configured for accessing prefetch data based on the prefetch request and storing the prefetch data in a prefetch cache of the memory controller.

Inventors:
SVENDSEN KJELD (US)
Application Number:
PCT/US2015/042596
Publication Date:
September 29, 2016
Filing Date:
July 29, 2015
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
APPLIED MICRO CIRCUITS CORP (US)
International Classes:
G06F12/08; G06F9/30; G06F13/16
Foreign References:
US20110072218A12011-03-24
US20120066455A12012-03-15
US20140052927A12014-02-20
EP2778932A12014-09-17
US20080071954A12008-03-20
Attorney, Agent or Firm:
TUROCY, GREGORY (US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A computer system, comprising:

a first prefetcher component communicatively coupled to a processor, the first prefetcher component configured for sending a prefetch request to a memory controller; and

a second prefetcher component communicatively coupled to the memory controller, the second prefetcher component configured for accessing prefetch data based on the prefetch request and storing the prefetch data in a prefetch cache of the memory controller.

2. The computer system of claim 1 , wherein prior to receiving the prefetch request from the first prefetcher component, the second prefetcher component is further configured for retrieving the prefetch data and storing the prefetch data in a memory of the second prefetcher component, wherein the prefetch data is moved from the memory of the second prefetcher component to the prefetch cache of the memory controller based on a receipt of the prefetch request.

3. The computer system of claim 1 , wherein the first prefetcher component is further configured for determining a stride prefetch and locking onto access patterns based on the stride prefetch.

4. The computer system of claim 1 , wherein the first prefetcher component is further configured for determining a cost of performing a prefetch operation based on the prefetch request, independent of another determination performed by the second prefetcher component.

5. A prefetch method comprising:

receiving, by a slave prefetch component, a prefetch request from a master prefetch component;

accessing, by the slave prefetch component, prefetch data based on the prefetch request; and

storing, by the slave prefetch component, the prefetch data in a prefetch cache of a memory controller.

6. The prefetch method of claim 5, further comprising: prior to receiving the prefetch request to the master prefetch component, retrieving, by the slave prefetch component, the prefetch data based on a previous prefetch hint;

storing, by the slave prefetch component, the prefetch data in a memory of the slave prefetch component; and

after receiving the prefetch request from the master prefetch component, moving, by the slave prefetch component, the prefetch data from the memory of the slave prefetch component to the prefetch cache of the memory controller.

7. The prefetch method of claim 5, further comprising:

determining, by the slave prefetch component, a prefetch stride based on an address hit or an address miss; and

locking onto access patterns, by the slave prefetch component, based on the prefetch stride.

8. The prefetch method of claim 5, further comprising:

independently determining, by the slave prefetch component, a cost of performing a prefetch operation based on the prefetch request.

9. An integrated circuit, comprising:

a first prefetcher component communicatively coupled to a processor, the first prefetcher component configured for sending prefetch requests to a memory controller; and

a second prefetcher component communicatively coupled to the memory controller, the second prefetcher component configured for accessing prefetch data based on at least one prefetch request of the prefetch requests and storing the prefetch data in a prefetch cache of the memory controller.

10. The integrated circuit of claim 18, further comprising:

a third prefetcher component communicatively coupled to another memory controller, the third prefetcher component configured for accessing other prefetch data based on a second prefetch request of the prefetch requests, the third prefetcher component further configured for storing the other prefetch data in another prefetch cache of the other memory controller,

wherein the first prefetcher component further configured for determining a cost of performing a prefetch operation based on the prefetch requests independent of another determination performed by the second prefetcher component.

Description:
MAIN MEMORY PREFETCH OPERATION AND

MULTIPLE PREFETCH OPERATIONS

TECHNICAL FIELD

[0001] This disclosure relates to a main memory prefetch operation and multiple prefetch operations.

BACKGROUND

[0002] In electronic circuits, the processing speed of microprocessors tends to be faster than the processing speed of the memory where a program is stored. Thus, the instructions for the program are read at a speed that is slower than the mircoprocessor speed and the microprocessor has to wait for the instructions to be read.

[0003] In an attempt to alleviate the wait situation, an instruction prefetch (e.g., a prefetch operation) can be used to speed up the amount of time (e.g., a wait state) needed to perform various operations (e.g., load programs, execute a program, and so on). During the prefetch operation, files expected to be needed for the various operations are cached in advance of when each file is expected to be used for the various operations.

[0004] The prefetch operation can occur when a processor requests an instruction from the main memory, which is before the processor needs the instruction. The prefetch operation can be performed because programs are usually executed sequentially and, thus, the instructions can be prefetched in the program order. When the pre-requested instruction (e.g., prefetched instruction) is received from the memory, the instruction is placed in a cache. When the processor is ready for the instruction, the processor can access the instruction from the cache. This process is quicker than what would occur if the processer requested the instruction only when the instruction is needed, which would result in high latency while waiting for the memory to return the instruction.

[0005] Prefetching has been performed by a memory controller based prefetcher of a computer system. Thus, the memory controller has to track the address-streams from all the processors in the system. When a new process is added to the system, the memory controller based prefetcher has to be updated. Further, if the memory controller based prefetcher is split into multiple instances, the address-stream tracking can be complicated and in some instances rendered unworkable.

SUMMARY

[0006] An aspect relates to a system that includes first prefetcher component communicatively coupled to a processor. The first prefetcher component can be configured for sending a prefetch request to a memory controller. The system can also include a second prefetcher component communicatively coupled to the memory controller. The second prefetcher component can be configured for accessing prefetch data based on the prefetch request and storing the prefetch data in a prefetch cache of the memory controller.

[0007] Another aspect relates to a method that includes receiving, by a slave prefetch component, a prefetch request fronia master prefetch component. The method also includes accessing, by the slave prefetch component prefetch data based on the prefetch request. Further, the method includes storing, by the slave prefetch component, the prefetch data in a prefetch cache of a memory controller

[0008] Another aspect relates to an integrated circuit that includes a first prefetcher component communicatively coupled to a processor. The first prefetcher component configured for sending prefetch requests to a memory controller. The integrated circuit can also include a second prefetcher component communicatively coupled to the memory controller. The second prefetcher component configured for accessing prefetch data based on at least one prefetch request of the prefetch requests and storing the prefetch data in a prefetch cache of the memory controller.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] FIG. 1 illustrates an example main memory prefetch system;

[0010] FIG. 2 illustrates another example system;

[0011] FIG. 3 illustrates an example system configured to conduct prefetches based on prefetch hints;

[0012] FIG. 4 illustrates another example main memory prefetch system;

[0013] FIG. 5 illustrates an example system for prefetch requests; [0014] FIG. 6 illustrates an example method for a main memory prefetching scheme;

[0015] Fig. 7 illustrates an example method for pre-accessing prefetch data; and

[0016] FIG. 8 illustrates a block diagram of an example data communication network.

DETAILED DESCRIPTION

[0017] Various aspects discusses herein relate to main memory prefetching. The terms "prefetching," "prefetch," and variants thereof as used herein, refers to an instruction prefetch or a prefetch operation. The various disclosed aspects can be utilized to overcome various challenges associated with prefetchers used with traditional systems and methods. For example, some methods employ a memory controller based prefetcher and, therefore, the prefetcher for a memory controller cache is solely located in the memory controller. There are a number of challenges associated with this type of prefetching scheme. For example, in a multi-processor system the memory controller prefetcher is required to monitor (e.g., keep track of) address-streams from all processors in the system. Such an approach to prefetch operations can encounter a severe design complication that can negatively impact logic complexity, design frequency of operations, area consumption, and/or power consumption.

[0018] Another challenge of a multi-processors system can be that if the number of processors is increased (by allocation or new design), the memory controller prefetcher requires a design update to accommodate the processors count increase. This is a design complication that can cause the design not to be readily scalable.

[0019] Yet another challenge of multi-processors systems can be that if the memory controller is split into multiple instances, the prefetcher address-pattern tracking can be complicated. At worst case, the prefetchers address-pattern tracking can be rendered non-workable, which renders the prefetcher useless. Further, if the memory controller is split into multiple instances and address- hashing is used to determine the (memory controller) destination of the memory request, prefetcher address-tracking can be essentially rendered impossible. [0020] An alternative approach of traditional systems for performing the prefetch operations is to place the memory controller prefetcher solely in the processor. This can alleviate the above-mentioned challenges associated with some prefetches, but presents a different set of issues.

[0021] For example, a challenge with such systems can be that a processor based memory controller prefetcher might not be able to determine the constraints of the memory accessed. This can make it difficult (if not

impossible) to determine whether to prefetch (or not) from the memory controller perspective, which can lead to overall performance degradation.

[0022] Further, without a prefetcher in the memory controller, the memory controller cannot be directed (e.g., instructed) to conduct prefetches beyond the requests from the processor. This can impact bus/interconnect bandwidth since all prefetch requests must be sent separately from regular loads over the process-memory controller interconnect.

[0023] Therefore, an aspect disclosed herein relates to main memory prefetching that can be performed in a simple and scalable manner. Further, the disclosed aspects can simplify overall system and subcomponent design, which can enable a more efficient system. Such efficiency can be realized in terms of throughput, latency, frequency of operation, and power consumption.

[0024] According to an implementation, various aspects relate to attaching prefetch hints to prefetch requests. The hints can allow a memory subsystem prefetcher to conduct prefetches beyond a simple prefetch. For example, the prefetch hint information can be provided at substantially the same time as the prefetch request. The prefetch hint information can indicate count and stride prefetch for additional prefetches. A stride prefetch (or simply stride) can be used to eliminate a compulsory miss and/or a capacity miss. Thus, the stride prefetch can be used to predict a next address to prefetch. For example, if a memory address is missed, an address that is offset by a defined distance from the missed address is likely to be missed in the future. According to some implementations, when there is a hit to an address, an address that is offset by a distance from the hit address is also prefetched. These types of prediction can be useful for stride access. The count relates to the number of loads ahead to fetch (e.g., two loads ahead, three loads ahead, five loads ahead, six loads ahead, and so on). Thus, a determined number of loads, in addition to a current load, can be prefetched.

[0025] The utilization of prefetch hints can consume a lower amount of request bandwidth (as compared to not using prefetch hints) on a global interconnect system. Additionally, multiple prefetch operations can be specified in a single operation, further reducing the amount of request bandwidth that needs to be utilized.

[0026] Various aspects also relate to fusing loads (e.g., reads) to a memory subsystem with prefetches in a single operation, which can enable prefetch request generation to be performed in a request agent. Further, fusing loads can simplify system-wide prefetching in a multi-processor system and can utilize lower request bandwidth (as compared to not fusing loads) in distributed main memory subsystems. Thus, a lower request bandwidth is needed on a global interconnect system since load and multiple prefetch operations can be specified in a single operation. Further, this can allow hash based (or another type of) distributed memory controllers on multiple buses. This is because a local memory subsystem unit might not see stride patterns since the prefetcher can reside in a request agent.

[0027] FIG. 1 illustrates an example main memory prefetch system 100. The main memory prefetch system 100 can be configured to provide a simplified prefetch architecture for main memory as compared to traditional prefetcher systems. Further, the main memory prefetch system 00 can be scalable such that any number of processor can be added (or removed) from the main memory prefetch system 100.

[0028] Loads (e.g., data fetches) from slower memories (e.g., double data rate (DDR), disks, and so on) tend to be spatial (e.g., closely related in space) and temporal in nature. Prefetching attempts to take advantage of the spatial nature of loads (e.g., loads closely related in (address) space). Thus, for address streams there is a likelihood that a load to memory address A is followed by a load to address A+N, where N is a stride and can be any arbitrary integer, and, thus, by induction loads likely to occur in strides to addresses A+N*m, where m represents the amount of loads ahead and is an arbitrary signed integer. [0029] Further, loads to slower memories tend to incur a significant latency from the requesting agent to the memory agent, and loads consume bandwidth on the request bus interconnect structure. Requesting agents (e.g., processors (CPUs)), may attempt to circumvent the long latencies incurred by the interconnect structure and the slowness (long latency access) of the memories, by using prefetching (e.g., issuing loads speculatively in advance so the data is made available earlier than if just requesting the data when actually needed). However, this prefetch data is usually stored in the request agent (e.g., in a cache structure), which can lead to resource conflicts and upsets (e.g., cache trashing) mainly due to limitation of such (e.g., CPU) cache structures in terms of capacity. This can effectively limit the amount of prefetching which can be performed, since too much prefetching can lead to unnecessary data to be prefetched and thus, agent cache trashing.

[0030] Thus, a larger cache structure can be implemented in the memory subsystem (memory controller). With a prefetch cache in the memory controller, a challenge is how to design the prefetcher for this cache. As discussed above, several problems arise in this context.

[0031] To overcome the above as well as other issues associated with traditional systems, the main memory prefetch system 100 can be configured such that a prefetcher is divided into a master-slave arrangement. For example, a prefetcher can be split into two (or more) sections or components, namely a master prefetch component 102 and a slave prefetch component 104. The master prefetch component 102 can be communicatively coupled to a processor 106. Further, the slave prefetch component 104 can be communicatively coupled to a memory controller 108. Coupling can include various

communications including, but not limited to, direct communications, indirect communications, wired communications, and/or wireless communications.

[0032] Thus, a first functionality associated with a prefetch operation can be performed by the master prefetch component 102 and a second functionality associated with the prefetch operation can be performed by the slave prefetch component 104 (and/or one or more other slave prefetch components).

[0033] The master prefetch component 102 can be configured to send prefetch requests to the memory controller 108 (e.g., to the slave prefetch component 104). According to some aspects, the master prefetch component 102 can be configured to determine a stride and lock onto access patterns for the prefetch operation.

[0034] The slave prefetch component 104 can use the request from the master prefetch component 102 to access a memory of the slave prefetch component 104 (or the memory controller 108, or another system component) to obtain the prefetch data. Once accessed, the prefetch data can be placed in a prefetch cache 110 of the memory controller 108.

[0035] It is noted that although a single master prefetch component 102 and a single slave prefetch component 104 are illustrated and described, the prefetcher can be divided into any number of slave prefetchers. For example, a single master prefetch component 102 can interface with multiple slave prefetchers (e.g., the slave prefetch component 104, a third slave prefetcher component, a fourth slave prefetcher component, a fifth slave prefetcher component, and so on). Each subsequent slave prefetcher component can be communicatively coupled to respective subsequent memory controllers.

Further, for implementations that employ more than one slave prefetcher, respective functionalities associated with a prefetch operation can be performed by one or more of the subsequent slave prefetchers.

[0036] FIG. 2 illustrates another example system 200. The system 200 can be configured to perform main memory prefetching in a simple and scalable manner. Further, the system 200 can be configured to simplify overall system and subcomponent design, which can enable a more efficient system in terms of throughput, latency, frequency of operation, power, and so forth.

[0037] Included in the system 200 can be a first prefetcher component 202 (e.g., the master prefetch component 102 of FIG. 1) and a second prefetcher component 204 (e.g., the slave prefetch component 104 of FIG. 1). The first prefetcher component 202 and at least the second prefetcher component 204 can be individual components that are formed based on separation (or division) of a prefetcher into two or more components. For example, the first prefetcher component 202 can be communicatively coupled to a processor 206 and the second prefetcher component 204 can be communicatively coupled to a memory controller 208. It is noted that although not shown or described, additional prefetcher components, communicatively coupled to respective additional memory controllers, can be included in the system 200. However, these additional prefetcher components and memory controllers are not shown or described for purposes of simplicity.

[0038] The first prefetcher component 202 can be configured to send one or more prefetch requests 210 to the memory controller 208. The one or more prefetch requests 210 can be instruction prefetches. The one or more prefetch requests 210 can be sent at different times, depending on the instructions that are expected to be needed in the future for processing.

[0039] The second prefetcher component 204 can be configured to access a first set of prefetch data 212 based, at least in part, on the one or more prefetch requests 2 0. The first set of prefetch data 212 retrieved by the second prefetcher component 204 can be stored in a prefetch cache 214 of the memory controller 208.

[0040] According to an implementation, the second prefetcher component 204 can be configured to retrieve the first set of prefetch data 212 prior to receiving a prefetch request from the first prefetcher component 202. Further to this implementation, the first set of prefetch data 212 can be stored in a memory 216 of the second prefetcher component 204. At about the same time as a prefetch request for the first set of prefetch data 212 is received from the first prefetcher component 202, the first set of prefetch data 212 can be moved from the memory 216 to the prefetch cache 214.

[0041] FIG. 3 illustrates an example system 300 configured to conduct prefetches based on prefetch hints. For example, system 300 can be configured to perform multiple operation prefetches. The system 300 can be configured such that a larger cache structure can be implemented in a memory subsystem. Further, the system 300 can be configured to overcome various challenges associated with traditional prefetch operations.

[0042] According to some methods, a requesting agent issues separate loads and prefetches. These separate loads and prefetches can consume extra (e.g., around twice as much) request bandwidth on the interconnect structure as compared to one or more of the aspects disclosed herein.

[0043] According to another method, the prefetcher is located in the memory subsystem, which complicates the prefetcher in multi-processor systems because it needs support for each request agent and can cause lack of immediate scalability. Further, the prefetcher located in the memory subsystem can effectively prevent a distributed memory subsystem approach, which can complicate the memory subsystem design in high-performance systems.

[0044] Therefore, in accordance with an aspect discussed herein, a prefetch component for a memory controller cache can be split between a central processing unit and one or more memory controllers, with the main prefetching capability located in the central processing unit. Thus, the system 300 can require lower request bandwidth on a global interconnect system, as compared to other systems. The lower request bandwidth can be facilitated by the system 300 based on specifying multiple prefetch operations in a single operation.

[0045] The system 300 can include a first prefetcher component 302 communicatively coupled to a processor 304. The first prefetcher component 302 can be configured to send a prefetch request 306 (or multiple prefetch requests) to a memory controller 308. A second prefetcher component 310 can be communicatively coupled to the memory controller 308. The second prefetcher component 310 can be configured to access prefetch data 312 based on the prefetch request 306. Further, the second prefetcher component 310 can be configured to store the prefetch data 312 in a prefetch cache 314 of the memory controller 308.

[0046] According to an implementation, the first prefetcher component 302 can include an indication component 316 that can be configured to provide hint information 318 that can be utilized by the second prefetcher component 3 0. The hint information 318 can comprise stride information and data related to how many loads ahead the second prefetcher component 310 should prefetch. According to some implementations, the hint information 318 is attached to the prefetch request 306.

[0047] The second prefetcher component 310 can be configured to conduct prefetches based on the hint information 318 (e.g., prefetch hints) attached to regular memory loads from the processor 304, according to an implementation. This can save separate prefetch requests from the CPU(s) (e.g., the processor 304), which can save request bandwidth on the CPU-MC (central processing unit - memory controller) interconnect leading to overall better system performance.

[0048] The pre-prefetched data retrieved based on the prefetch hint(s) can be stored in a memory 320 of the second prefetcher component 310. When a subsequent prefetch request is received from the first prefetcher component 302, the pre-prefetched data can be moved from the memory 320 to the prefetch cache 314.

[0049] The CPU can issue loads and specific prefetch requests with hint information to the memory controller prefetcher (e.g., the second prefetcher component 310). These hints can include stride (N), and for how many loads ahead (m). As discussed above, a stride prefetch (or simply stride) can be used to eliminate a compulsory miss and/or a capacity miss. Thus, the stride prefetch can be used to predict a next address to prefetch. For example, if a memory address is missed, an address that is offset by a defined distance from the missed address is likely to be missed in the future. According to some implementations, when there is a hit to an address, an address that is offset by a distance from the hit address is also prefetched. These types of prediction can be useful for stride access. The count relates to the number of loads ahead to fetch (e.g., two loads ahead, three loads ahead, five loads ahead, six loads ahead, and so on). Thus, a determined number of loads, in addition to a current load, can be prefetched. This can lower the number of loads and prefetches that need to be sent over the interconnect. Further, this can lower request bandwidth usage.

[0050] FIG. 4 illustrates another example main memory prefetch system 400. Main memory prefetch system 400 can be configured as a main memory prefetching scheme. The various aspects discussed herein can alleviate the complications of scaling to additional (e.g., added) processors and/or memory controllers, and the complication resulting from the need for the (memory controller) prefetcher to track address-streams from multiple processors.

Further, the various aspects can be configured for a reduction in the number of processors and/or memory controllers.

[0051] The main memory prefetch system 400 includes a first prefetch component 402 (e.g., master prefetch component 102 of FIG. 1)

communicatively coupled to a processor 404. The main memory prefetch system 400 also includes a second prefetch component 406 (e.g., slave prefetch component 104 of FIG. 1) communicatively coupled to a first memory controller 408. Further, the main memory prefetch system 400 can include a third prefetch component 410 communicatively coupled to a second memory controller 412. In addition, any number of additional prefetch components (e.g., slave prefetch components) can be included in the main memory prefetch system 400, illustrated as a N prefetch component 414 communicatively coupled to N memory controller 416, where N is an integer.

[0052] It is noted that although the following is described with respect to the first prefetch component 402 and the second prefetch component 406, such aspects can additionally or alternatively apply to the third prefetch component 410 and subsequent prefetch components (N prefetch component 414).

[0053] The first prefetch component 402 can be configured to send a first prefetch request 418 to the first memory controller 408 (e.g., the second prefetch component 406). As discussed, in other implementations, the first prefetch request 418 can be sent to the second memory controller 412 (e.g. third prefetch component 410) and/or subsequent memory controllers (e.g., second memory controller 4 2 and/or N memory controller 416).

[0054] The second prefetch component 406, upon receiving the first prefetch request 418 from the first prefetch component 402, can be configured to access a first set of prefetch data 420. Further the second prefetch component 406 can be configured to store the first set of prefetch data 420 in a cache 422 (e.g., prefetch cache 1 10) of the first memory controller 408.

[0055] According to some implementations the second prefetch component 406 can be configured to retrieve the first set of prefetch data 420 prior to receiving the first prefetch request 418 from the first prefetch component 402. Further, the second prefetch component 406 can be configured to store the first set of prefetch data 420 in a memory 424 of the second prefetch component 406. Thereafter, when at least a portion of the first set of prefetch data 420 is requested in a subsequent prefetch request, the portion of the first set of prefetch data can be moved from the memory 424 to the cache 422, according to an aspect.

[0056] The first prefetch component 402 can include a stride component 426 that can be configured to determine a stride prefetch. For example, the stride prefetch can be type of prediction, wherein if a memory address is missed, an address that is offset by a distance from the missed address is likely to be missed in the near future. Thus, the stride prefetch can be used to mitigate compulsory/capacity misses. Further, the stride prefetch can be utilized such that when there is a hit in the buffer, an address that is offset by a distance from the hit address is also prefetched. The first prefetch component 402 can also be configured to lock onto access patterns.

[0057] According to some implementations, the third prefetch component 410 can be configured to receive a second prefetch request 428 from the first prefetch component 402. The third prefetch component 410 can be configured to retrieve a second set of prefetch data 430 and store the second set of prefetch data 430 in a cache 432 (e.g., prefetch cache) of the second memory controller 412.

[0058] In accordance with an implementation, the third prefetch component 410 can be configured to retrieve the second set of prefetch data 430 prior to receiving the second prefetch request 428. The second set of prefetch data 430 can be stored in a memory 433 of the third prefetch component 410.

[0059] At least a set of the second set of prefetch data 430 stored in the memory 433 can be moved from the memory 433 to the cache 423 when a subsequent prefetch request is received from the first prefetch component 402.

[0060] Further, subsequent prefetch requests 434 can be received at the N memory controller 416, from the first prefetch component 402. The N prefetch component 414 can be configured to retrieve a N set of prefetch data 436 and store the N set of prefetch data 436 in a N cache 438 (e.g., prefetch cache) of the N memory controller 416.

[0061] In accordance with an implementation, the N prefetch component 414 can be configured to retrieve the N set of prefetch data 436 prior to receiving the subsequent prefetch requests 434 from the first prefetch component 402. The N set of prefetch data 436 can be stored in a N memory 440 of the N prefetch component 414. When a request for at least a subset of the set of prefetch data 436 is received in another prefetch request, the subset can be moved from the N memory 440 to the N cache 438.

[0062] Thus, the main memory prefetch system 400, through utilization of prefetch hints, can consume a lower amount of request bandwidth (as compared to not using prefetch hints) on a global interconnect system.

Additionally, multiple prefetch operations can be specified in a single operation, further reducing the amount of request bandwidth that needs to be utilized. [0063] FIG. 5 illustrates an example system 500 for prefetch requests. The system 500 can be configured such that a larger cache structure can be implemented in the memory subsystem. A prefetcher could also be

implemented at the memory subsystem level. However, if the memory subsystem is distributed, which can be performed for numerous reasons (e.g., capacity, bandwidth, Registration, Admission and Status (RAS)), a local memory subsystem prefetcher may not be able to determine the agent stride patterns. Thus, the prefetcher would have to reside in the request agent. With this implementation, regular loads, as well as prefetch loads, would have to traverse the interconnect. This consumes precious bandwidth, and can effectively consume about twice as much bandwidth as would be consumer if no prefetch was used.

[0064] To resolve the above noted deficiencies, system 500 can be configured to fuse regular loads with prefetch information. In this manner, only a single load might be issued on the interconnect. Further, the load can also hold prefetch information (e.g., if prefetching should be initiated or not initiated (hint)). If a prefetch should be initiated, information related to what stride (N), and for how many loads ahead (m) can be provided. Other hints could also be considered, according to an aspect.

[0065] For example, when a prefetch operation is performed, it may be determined that data is needed for a first address and, therefore, most likely data will be needed for a next address. Therefore, a prefetch hint can be added that indicates to prefetch to a load address and to a next address (as well as subsequent addresses). This can lower the bandwidth consumption on an innerconnect because, instead of separating the requests into separate operations, the prefetch requests (e.g., prefetch load address, next address, subsequent address) is fused into a single operation and, further the prefetched data is stored in a low energy cache.

[0066] System 500 can include a first prefetcher component 502 (e.g., master prefetch component 102 of FIG. 1, first prefetcher component 202 of FIG. 2, and so on) and at least a second prefetcher component 504 (e.g., first prefetcher component 202 of FIG. 1, second prefetcher component 204 of FIG. 2, and so on). [0067] The first prefetcher component 502 can be communicatively coupled to a processing component 506; the second prefetcher component 504 can be communicatively coupled to at least one memory controller 508. Further, as previously discussed, additional second prefetcher components (e.g., more than one slave prefetch component) can be included in the system 500. However, for purposes of simplicity, only one slave prefetch component (e.g., the second prefetcher component 504 ) is illustrated.

[0068] The first prefetcher component 502 can be configured to transmit a prefetch request to the memory controller 508 (e.g., the second prefetcher component 504). Since prefetches are speculative and it is not guaranteed that a set (or subset) of data will be needed, according to an aspect, both the first prefetcher component 502 and the second prefetcher component 504 can be configured to independently determine the cost of performing a prefetch operation.

[0069] For example, the first prefetcher component 502 (e.g., the CPU or the processing component 506) can attempt to optimize a prefetch request based on the CPU resources, the cost of sending requests of the CPU-MC

interconnect, or combinations thereof.

[0070] The second prefetcher component 504 can be configured to determine the cost of prefetching to a local cache (e.g., a cache of the slave prefetch component). The determination as to the cost of the prefetch (as well as whether to perform the prefetch) can be based on memory controller cache capacity, cache fullness/use, and (DDR) memory constraints. For example, the DDR memory constraints can include the cost of accessing open pages and/or closed pages.

[0071] For example, the first prefetcher component 502 can include a first cost manager component 510 and the second prefetcher component 504 can include a second cost manager component 512. The first cost manager component 510 and the second cost manager component 512 can

independently determine a cost of performing the prefetch.

[0072] For example, the first cost manager component 510 can be configured to determine a cost of performing the prefetch independent of another determination performed by the second cost manager component 512. In another example, the second cost manager component 514 can be configured to determine a cost of performing the prefetch independent of another determination performed by the first cost manager component 510.

[0073] According to an implementation, the first prefetcher component 502 can be configured to determine the prefetch request based on local resources and a cost of sending the prefetch request over an interconnect. For example, the first cost manager component 510 can determine the cost of sending the prefetch request over an interconnect.

[0074] According to another implementation, the second prefetcher component 504 (or the second cost manager component 512) can be configured to determine a cost of prefetching to a local cache associated with the memory controller. The cost can be determined by the second cost manager component 512 based on a cache capacity of the memory controller, a remaining capacity of the local cache, a memory constraint, or a combination thereof.

[0075] In view of the example systems shown and described herein, methods that may be implemented in accordance with the one or more of the disclosed aspects, will be better understood with reference to the following flow charts. While, for purposes of simplicity of explanation, the methods are shown and described as a series of blocks, the disclosed aspects are not limited by the number or order of blocks, as some blocks may occur in different orders and/or at substantially the same time with other blocks from what is depicted and described herein. Moreover, not all illustrated blocks may be required to implement the methods described hereinafter. The functionality associated with the blocks may be implemented by software, hardware, a combination thereof or any other suitable means (e.g. device, system, process, component).

Additionally, it is also noted that the methods disclosed hereinafter and throughout this specification are capable of being stored on an article of manufacture to facilitate transporting and transferring such methodologies to various devices. Those skilled in the art will understand that a method could alternatively be represented as a series of interrelated states or events, such as in a state diagram. The various methods disclosed herein can be performed by a system comprising at least one processor.

[0076] FIG. 6 illustrates an example method 600 for a main memory prefetching scheme. The method 600 can be implemented by an integrated circuit, according to an embodiment. Method 600 starts, at 602, when a prefetch request is received. According to an implementation, the prefetch request can be received at a slave prefetch component. Further, the prefetch request could have been sent by a master prefetch component. The prefetch request can include a set of instruction prefetches. More than one prefetch request can be sent at substantially the same time, or at different times.

[0077] At 604, prefetch data is accessed based on the prefetch request. For example, a memory of the slave prefetch component may be accessed to obtain the prefetch data. According to an implementation, at least a set of the prefetch data is retrieved prior to receiving the prefetch request.

[0078] At 606, the prefetch data is stored in a prefetch cache of a memory controller. For example, the memory controller can be associated with the slave prefetch component. In the implementation where at least a set of the prefetch data is retrieved prior to receiving the prefetch request, the prefetch data can be stored in a memory of the slave prefetcher component.

[0079] In an implementation, the method 600 can include determining a prefetch stride and locking onto access patterns. For example, the prefetch stride can be determined based on an address hit or an address miss. In another implementation, hint information can be associated with the prefetch requests. For example, the hint information can include stride information and data related to the number of loads ahead to fetch.

[0080] According to some implementations, the method 600 can include independently determining a cost of performing the prefetch. For example, the method 600 can include determining the prefetch request based on local resources and a cost of sending the prefetch request over an interconnect. In another example, the method can include determining a cost of prefetching to a local cache associated with the memory controller based on a cache capacity of the memory controller, a remaining capacity of the local cache, a memory constraint, or a combination thereof.

[0081] Fig. 7 illustrates an example method 700 for pre-accessing prefetch data. At 702, retrieve prefetch data. The prefetch data can be retrieved, in advance, based on a previous prefetch instruction received. For example, the previous prefetch instruction can include prefetch hint data or prefetch hint information. The prefetch hint information can comprise stride information and data related to a number of reads to fetch in additional to a current read. In another example, the previous prefetch instruction can include fused

instructions (e.g., retrieve data at this address, retrieve data at a next address, and retrieve data at one or more subsequent addresses).

[0082] At 704, the prefetch data is stored in a memory of a prefetcher component. For example, the memory of the prefetcher component can be a temporary storage area for the prefetch data. If the prefetch data is not requested in a subsequent prefetch instruction, the prefetch data can be removed (e.g., deleted, evicted) from the memory of the prefetcher component.

[0083] A (subsequent) prefetch request can be received from a memory controller, at 706. Based on the prefetch request, the prefetch data can be retrieved from the memory of the prefetcher component, at 708, and moved from the memory to a prefetch cache of the memory controller, at 710.

[0084] Aspects can relate to an integrated circuit that can include a first prefetcher component and a second prefetcher component. The first prefetcher component can be communicatively coupled to a processor. Further, the first prefetcher component can be configured for sending prefetch requests to a memory controller. The second prefetcher component can be communicatively coupled to the memory controller. Further, the second prefetcher component can be configured for accessing prefetch data based on at least one prefetch request of the prefetch requests and storing the prefetch data in a prefetch cache of the memory controller.

[0085] Further to this aspect, integrated circuit can include a third prefetcher component communicatively coupled to another memory controller. The third prefetcher component can be configured for accessing other prefetch data based on a second prefetch request of the prefetch requests. In addition, the third prefetcher component can be configured for storing the other prefetch data in another prefetch cache of the other memory controller.

[0086] The first prefetcher component can be further configured for determining a cost of performing a prefetch operation based on the prefetch requests independent of another determination performed by the second prefetcher component.

[0087] A main memory prefetching that can be performed in a simple and scalable manner. Further, the disclosed aspects can simplify overall system and subcomponent design, which can enable a more efficient system. Such efficiency can be realized in terms of throughput, latency, frequency of operation, and power consumption.

[0088] The techniques described herein can be applied to any device and/or network that utilizes a processor, such as a central processing unit, and one or more memory controllers. Handheld, portable and other computing devices and computing objects of all kinds can be used in connection with the various embodiments, e.g., anywhere that a device may wish to implement a prefetch instruction or prefetch operation for a multiprocessor system. Accordingly, the below general purpose remote computer described below in FIG. 8 is one example, and the disclosed subject matter can be implemented with any client having network/bus interoperability and interaction.

[0089] FIG. 8 illustrates a block diagram of an example electronic computing environment that can be implemented to facilitate a main memory prefetch operation and/or multiple prefetch operations. FIG. 8 illustrates an example of a computing system environment 800 in which some aspects of the disclosed subject matter can be implemented, although the computing system

environment 800 is one example of a computing environment for a device and is not intended to suggest any limitation as to the scope of use or functionality of the disclosed subject matter.

[0090] FIG. 8 is an exemplary device for implementing the disclosed subject matter includes a general-purpose computing device in the form of a computer 802. Components of computer 802 may include a processing unit 804, a memory 806, and a system bus 808 that couples various system components including the system memory to the processing unit 804.

[0091] Computer 802 includes a variety of computer readable media. The memory 806 may include computer storage media in the form of volatile and/or nonvolatile memory such as ROM and/or RAM. The computer 802 may also include other removable/non-removable, volatile/nonvolatile computer storage media.

[0092] The computer 802 can operate in a networked or distributed environment using logical connections to one or more other remote

computer(s), such as remote computer 870, which can in turn have media capabilities different from computer 802. The logical connections depicted in FIG. 8 include a network 816, such local area network (LAN) or a wide area network (WAN), but can also include other networks/buses, either wired or wireless. When used in a LAN networking environment, the computer 802 can be connected to the LAN through a network interface 818 or adapter. When used in a WAN networking environment, the computer 802 can typically include a communications component, such as a modem, or other means for establishing communications over the WAN, such as the Internet.

[0093] The disclosed subject matter can be implemented as a method, apparatus, or article of manufacture using typical manufacturing, programming or engineering techniques to produce hardware, firmware, software, or any suitable combination thereof to control an electronic device to implement the disclosed subject matter. Computer-readable media can include hardware media, software media, non-transitory media, or transport media.