Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND APPRATUS FOR MANAGING A RANDOM ARRAY OF INDEPENDENT DISKS (RAID)
Document Type and Number:
WIPO Patent Application WO/2011/040928
Kind Code:
A1
Abstract:
A Direct Memory Access (DMA) controller that supports both RAID operations and non-RAID operations is provided. A chain of descriptors includes RAID descriptors describing RAID logical operations and non-RAID descriptors describing non-RAID operations. A data transfer of a data block having a transfer size greater than the transfer size supported by the DMA controller is processed by recycling an initial descriptor and generating a single interrupt at the completion of the transfer of the data block.

Inventors:
PARTHASARATHY BALAJI (US)
SMILEY DAVID A (US)
COOPER FURROKH R (US)
HALARI KEYUR N (US)
VAVRO DAVID K (US)
AFROZE SYEDA M (US)
VED SANDEEP P (US)
Application Number:
PCT/US2009/059362
Publication Date:
April 07, 2011
Filing Date:
October 02, 2009
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INTEL CORP (US)
PARTHASARATHY BALAJI (US)
SMILEY DAVID A (US)
COOPER FURROKH R (US)
HALARI KEYUR N (US)
VAVRO DAVID K (US)
AFROZE SYEDA M (US)
VED SANDEEP P (US)
International Classes:
G06F13/16; G06F3/06; G06F12/00
Foreign References:
US20070088864A12007-04-19
US20080077750A12008-03-27
US20070073922A12007-03-29
US7219169B22007-05-15
US6145043A2000-11-07
Attorney, Agent or Firm:
VINCENT, Lester J. et al. (1279 Oakmead ParkwaySunnyvale, California, US)
Download PDF:
Claims:
CLAIMS

An apparatus comprising:

first logic to perform a Direct Memory Access (DMA) operation to transfer a first block of data from a first source to a first destination;

second logic to perform a Redundant Array of Independent Disks (RAID) operation on second blocks of data retrieved from a plurality of second sources; and

fetch logic shared by the first logic and the second logic, the fetch logic to retrieve a first descriptor from memory shared by the first logic and the second logic, the memory storing a first descriptor identifying the DMA operation to be performed by the first logic and a second descriptor identifying the RAID operation to be performed by the second logic, the operations to be performed concurrently by the first logic and the second logic.

The apparatus of claim 1 , further comprising:

a descriptor decoder First In First Order (FIFO) to store the first descriptor and the second descriptor and to maintain required order of the descriptors irrespective of the order in which the descriptors are processed by the first logic and the second logic.

The apparatus of claim 1, wherein the second logic to concurrently generate a P syndrome and a Q syndrome in a single pass through the second blocks of data.

The apparatus of claim 1, further comprising:

upstream logic coupled to first and second logic to provide access to the memory.

The apparatus of claim 1, further comprising:

control logic coupled to the first logic and the second logic, the control logic to track a number of descriptors to be processed in order to determine whether there is any work to be performed by the first logic and/or the second logic.

6. The apparatus of claim 1, wherein the first descriptor and the second descriptor are retrieved from a chain of descriptors stored in the memory.

7. The apparatus of claim 6, wherein the first logic and the second logic to

concurrently determine each of the first descriptor and the second descriptor is a RAID descriptor or a non-RAID, the first logic to process the non-RAID descriptor and the second logic to process the RAID descriptor.

8. The apparatus of claim 7, further comprising:

an arbiter coupled to the first logic and the second logic, the arbiter to provide exclusive access to the plurality of second sources to the second logic for a dynamically expandable time period.

9. The apparatus of claim 8, wherein the arbiter to determine from system resources whether to allocate the dynamically expandable time period to the second sources.

10. The apparatus of claim 1, wherein the fetch logic to break up an initial data transfer size greater than a data transfer size supported by a DMA channel into smaller transfer lengths supported by the DMA channel by recycling the initial descriptor for use by subsequent data transfer operations by modifying address fields in the initial descriptor until the initial block size has been transferred using a plurality of separate DMA operations with each DMA operation having a data transfer size less than or equal to the data transfer size supported by the DMA channel.

1 1. The apparatus of claim 10, wherein a single interrupt is generated after the initial block size has been transferred.

12. The apparatus of claim 1, wherein the first logic to map a plurality of variable length RAID streams generated by the RAID operation to a single output logical stream.

13. The apparatus of claim 12, wherein the plurality of variable length streams include a P syndrome, a Q syndrome and a completion stream. 14. The apparatus of claim 13, wherein the second logic to manage data misalignment of data read from the plurality of second sources.

15 The apparatus of claim 14, wherein the number of second sources is 10 for a validate operation.

16. The apparatus of claim 14, wherein the number of second sources is 8 for a

generation operation.

17. The apparatus of claim 1, wherein the second logic to manage alignment of data to be written to a destination to store a result of the RAID operation.

18 The apparatus of claim 17, wherein the result includes a P syndrome and a Q

syndrome for a RAID 6 system. 19. The apparatus of claim 1, wherein the second logic to manage out of order

completions for the RAID operation.

20 The apparatus of claim 19, wherein the plurality of sources is 2 to 10. 21. The apparatus of claim 1, wherein the second logic to handle an extended

descriptor to identify other sources in addition to the plurality of sources identified in a base descriptor.

22. The apparatus of claim 21, wherein the second logic to handle both extended descriptors and base descriptors during normal operation.

23. The apparatus of claim 21, wherein the second logic to handle extended descriptors upon detecting a command to suspend processing of descriptors or a command to reset processing of descriptors.

24. A method comprising:

performing, by a first logic, a Direct Memory Access (DMA) operation to transfer a first block of data from a first source to a first destination;

performing, by a second logic, a Redundant Array of Independent Disks (RAID) operation on a plurality of second sources; and

retrieving, by fetch logic shared by the first logic and the second logic, a descriptor from memory shared by the first logic and the second logic, the memory storing a first descriptor identifying the DMA operation to be performed by the first logic and a second descriptor identifying the RAID operation to be performed by the second logic, the operations to be performed concurrently by the first logic and the second logic.

25. The method of claim 24, further comprising:

storing the first descriptor and the second descriptor a descriptor decoder First In First Order (FIFO); and

maintaining required order of the descriptors in the descriptor decoder FIFO irrespective of the order in which the descriptors are processed by the first logic and the second logic.

26. The method of claim 24, wherein the second logic to concurrently generate a P syndrome and a Q syndrome in a single pass through the second blocks of data.

27. The method of claim 24, further comprising:

providing, by upstream logic coupled to first and second logic, access to the memory.

28. The method of claim 24, further comprising: tracking, by control logic coupled to the first logic and the second logic, a number of descriptors to be processed in order to determine whether there is any work to be performed by the first logic and/or the second logic. 29. The method of claim 24, further comprising:

retrieving the first descriptor and the second descriptor from a chain of descriptors stored in the memory.

30. The method of claim 29, wherein the first logic and the second logic to

concurrently determine each of the first descriptor and the second descriptor is a

RAID descriptor or a non-RAID, the first logic to process the non-RAID descriptor and the second logic to process the RAID descriptor.

31. The method of claim 24, further comprising:

providing exclusive access to the plurality of second sources to the second logic for a dynamically expandable time period.

32. The method of claim 31 , further comprising:

determining from system resources whether to allocate the dynamically expandable time period to the second sources.

33. The method of claim 24, further comprising:

upon detecting a request for an initial data transfer size greater than a data transfer size supported by the DMA operation, breaking the initial data transfer size into smaller transfer lengths supported by the DMA operation by recycling the initial descriptor for use by subsequent data transfer operations by modifying address fields in the initial descriptor until the initial block size has been transferred using a plurality of separate DMA operations with each DMA operation having a data transfer size less than or equal to the data transfer size supported by the DMA operation.

34. The method of claim 33, wherein a single interrupt is generated after the initial block size has been transferred. The method of claim 24, further comprising:

mapping, by the first logic, a plurality of variable length RAID streams generated by the RAID operation to a single output logical stream.

The method of claim 35, wherein the plurality of variable length streams include P syndrome, a Q syndrome and a completion stream.

The method of claim 24, further comprising:

managing, by the second logic, data misalignment of data read from the plurality of second sources.

The method of claim 37, wherein the number of second sources is 10 for a validate operation.

The method of claim 37, wherein the number of second sources is 8 for a generation operation.

The method of claim 24, further comprising:

managing, by the second logic, alignment of data to be written to a destination to store a result of the RAID operation.

The method of claim 40, wherein the result includes a P syndrome and a Q syndrome for a RAID 6 system.

The method of claim 24, further comprising:

managing, by the second logic, out of order completions for the RAID operation.

The method of claim 42, wherein the plurality of sources is 2 to 10. The method of claim 24, further comprising: upon detecting a suspend or reset state, handling, by the second logic, an extended descriptor.

The method of claim 44, wherein the second logic to handle both extended descriptors and base descriptors during normal operation.

The method of claim 44, wherein the second logic to handle extended descriptors upon detecting a command to suspend processing of descriptors or a command to reset processing of descriptors.

An article including a machine-accessible medium having associated information, wherein the information, when accessed, results in a machine performing:

performing, by a first logic, a Direct Memory Access (DMA) operation to transfer a first block of data from a first source to a first destination;

performing, by a second logic, a Redundant Array of Independent Disks (RAID) operation on a plurality of second sources; and

retrieving, by fetch logic shared by the first logic and the second logic, a descriptor from memory shared by the first logic and the second logic, the memory storing a first descriptor identifying the DMA operation to be performed by the first logic and a second descriptor identifying the RAID operation to be performed by the second logic, the operations to be performed concurrently by the first logic and the second logic.

A system comprising:

a Random Array of Independent Disks (RAID); and

a processor, the processor comprising:

first logic to perform a Direct Memory Access (DMA) operation to transfer a first block of data from a first source to a first destination in the RAID;

second logic to perform a Redundant Array of Independent Disks (RAID) operation on a plurality of second sources; and

fetch logic shared by the first logic and the second logic, the fetch logic to retrieve a descriptor from memory shared by the first logic and the second logic, the memory storing a first descriptor identifying the DMA operation to be performed by the first logic and a second descriptor identifying the RAID operation to be performed by the second logic, the operations to be performed concurrently by the first logic and the second logic.

Description:
METHOD AND APPRATUS FOR MANAGING A RANDOM ARRAY OF

INDEPENDENT DISKS (RAID)

FIELD

This disclosure relates to managing a Random Array of Independent Disks (RAID) and in particular to fair bandwidth sharing between Direct Memory Access (DMA) channels in a DMA controller used for both RAID and non-RAID operations.

BACKGROUND A Redundant Array of Independent Disks (RAID) combines a plurality of physical hard disk drives into a logical drive for purposes of reliability, capacity, or performance. Thus, instead of multiple physical hard disk drives, an operating system sees the single logical drive. As is well known to those skilled in the art, there are many standard methods referred to as RAID levels for distributing data across the physical hard disk drives in a RAID system.

For example, in a level 0 RAID system the data is striped across a physical array of hard disk drives by breaking the data into blocks and writing each block to a separate hard disk drive. Input/Output (I/O) performance is improved by spreading the load across many hard disk drives. Although a level 0 RAID improves I/O performance, it does not provide redundancy because if one hard disk drive fails, all of the data is lost

A level 5 RAID system provides a high level of redundancy by striping both data and parity information across at least three hard disk drives. Data striping is combined with distributed parity to provide a recovery path in case of failure. A level 6 RAID system provides an even higher level of redundancy than a level 5 RAID system by allowing recovery from double disk failures.

In a level 6 RAID system, two syndromes referred to as the P syndrome and the Q syndrome are generated for the data and stored on hard disk drives in the RAID system. The P syndrome is generated by simply computing parity information for the data in a stripe (data blocks (strips), P syndrome block and Q syndrome block). The generation of the Q syndrome requires Galois Field (GF) multiplications and is complex in the event of a disk drive failure. The regeneration scheme to recover data and/or P syndrome block and/or Q syndrome block performed during disk recovery operations requires both GF and inverse operations. The generation and recovery of the P and Q syndrome blocks for RAID 6 and parity for RAID 5 requires the movement of large blocks of data between system memory and a storage device (I/O device). Typically, computer systems include Direct Memory Access (DMA) controllers (engines) to perform transfers of data between memory and I/O devices. A DMA controller allows a computer system to access memory independently of the processor (core). The processor initiates a transfer of data from a source (memory or I/O device (controller)) to a destination (memory or I/O device (controller)) by issuing a data transfer request to the DMA controller. The DMA controller performs the transfer while the processor performs other tasks. The DMA controller notifies the processor, for example, through an interrupt when the transfer is complete. Typically, a DMA controller manages a plurality of independent DMA channels, each of which can concurrently perform one or more data transfers between a source and a destination.

Typically, a data transfer from a source to a destination is specified through the use of a descriptor, that is, a data structure stored in memory that stores variables that define the DMA data transfer. For example, the variables can include a source address (where the data to be transferred is stored in the source (memory (or I/O device)); size (how much data to transfer) and a destination address (where the transferred data is to be stored in the destination (memory (or I/O device)). The use of descriptors instead of having the processor write the variables directly to registers in the DMA controller prior to each DMA data transfer operation allows chaining of multiple DMA requests using a chain of descriptors. The chain of descriptors allows the DMA controller to automatically set up and start another DMA data transfer defined by a next descriptor in the chain of descriptors after the current DMA data transfer is complete. BRIEF DESCRIPTION OF THE DRAWINGS

Features of embodiments of the claimed subject matter will become apparent as the following detailed description proceeds, and upon reference to the drawings, in which like numerals depict like parts, and in which:

Fig. 1 is a block diagram illustrating an embodiment of a RAID-6 array showing a plurality of stripes with each stripe including data blocks (strips) and P and Q syndromes striped across an array of hard disks; Fig. 2 is a block diagram of an embodiment of a system that includes a Direct Memory Access (DMA) controller to support both RAID (RAID 5/RAID 6 ) operations and non-RAID operations according to the principles of the present invention;

Fig. 3 is a block diagram of an embodiment of the DMA controller shown in Fig. 2 that includes a RAID engine, a non-RAID engine, and a Descriptor Fetch Engine according to the principles of the present invention.

Fig. 4 illustrates the format of an embodiment of a XOR Descriptor used for an XOR Generate or an XOR Validate operation;

Fig. 5 illustrates the format of an embodiment of a Galois Field (GF) Descriptor used for an XOR with Galois Field Multiply Generate or Galois Field Multiply Validate (GFM) operation;

Fig. 6 illustrates a method for performing an Input/Output (10) write with RAID 6 (or RAID 5) write back cache according to the principles of the present invention;

Fig. 7 is a block diagram of an embodiment of a system that includes a non-RAID engine;

Fig. 8 is a block diagram of an embodiment of a system that includes a RAID engine;

Fig. 9 is a flowgraph illustrating an embodiment of a method for processing descriptors stored in the descriptor FIFO;

Fig. 10 is a flowgraph illustrating an embodiment of a method to handle descriptor in the DMA controller;

Fig. 11 is a flowgraph illustrating an embodiment of a method to handle large transfer sizes using a single descriptor in the DMA engine;

Fig. 12 is a block diagram illustrating an embodiment of a system to perform mapping of a plurality of variable length RAID streams to a single logical traffic stream;

Fig. 13 is a block diagram illustrating source alignment functions of the RAID Engine;

Figs. 14A and 14B illustrate dataflow for an unaligned P and Q destination, with a transfer length of four cache lines.

Fig. 15 is a block diagram of embodiment of a system that includes a nonposted engine coupled between the RAID engine and a coherent protocol logic; and

Fig. 16 is a block diagram illustrating an embodiment of an extended descriptor. Although the following Detailed Description will proceed with reference being made to illustrative embodiments of the claimed subject matter, many alternatives, modifications, and variations thereof will be apparent to those skilled in the art.

Accordingly, it is intended that the claimed subject matter be viewed broadly, and be defined only as set forth in the accompanying claims.

DETAILED DESCRIPTION

Typically, there is a maximum DMA data transfer size per channel in order to provide fair arbitration between DMA channels in the DMA controller. Thus, a plurality of descriptors are required for a data transfer size greater than the maximum DMA data transfer size per channel, for example, the data transfer size for RAID operations. The need for a plurality of descriptors for a data transfer size greater than the maximum DMA transfer size per channel increases the amount of system memory required to store the plurality of descriptors. The plurality of descriptors also increases the CPU processing time due to the additional completion interrupts, one for each of the plurality of descriptors.

However, increasing the maximum DMA data transfer size per channel does not provide fair arbitration between DMA channels because it may result in consumption of all of the transfer bandwidth for extended amounts of time by one DMA channel. An embodiment of the present invention provides fair bandwidth sharing between DMA channels for descriptors that specify a data transfer size greater than the maximum DMA data transfer size per channel.

Fig. 1 is a block diagram illustrating an embodiment of a RAID-6 array 100 showing a plurality of stripes with each stripe including data blocks (strips) and P and Q syndromes striped across an array of hard disks (storage devices) 150. In the embodiment shown, the RAID array has five hard disks 150: three data disks and two syndrome (P, Q) disks. Data is written to the RAID-6 array 100 using block-level striping with P and Q syndromes distributed across the member hard disks in a round robin fashion. Sequential data, for example, a file segmented into blocks may be distributed across a stripe, for example, horizontal stripe 0, with one of the blocks stored in data blocks 102, 104, 106 on three of the data disks 102. A P syndrome and a Q syndrome computed for the data blocks 102, 104, 106 in horizontal stripe 0 are stored in a respective P block 130 and Q block 132. As shown, the P syndrome blocks and Q syndrome blocks are stored on different hard disks 150 in each stripe. In one embodiment, there are 512 bytes in each block in a stripe.

The P syndrome may be generated by performing an Exclusive OR (XOR) operation. XOR is a logical operation on two operands that results in a logical value of Ί ' , if only one of the operands has a logical value of Ί ' . For example, the XOR of a first operand having a value Ί 1001010' and a second operand having a value ' 10000011 ' provides a result having a value '01001001 '. If the hard drive that stores the first operand fails, the first operand may be recovered by performing an XOR operation on the second operand and the result.

The P syndrome is the simple parity of data (D) computed across a stripe using Θ

(XOR) operations. In a system with n data disks, the generation of the P syndrome is represented by equation 1 below:

P = Do Θ Di Θ D 2 Θ D n _i (Equation 1)

The computation of the Q syndrome requires multiplication (*) using a Galois Field polynomial (g). Arithmetic operations are performed on 8-bit (byte) Galois Field polynomials at very high performance. A polynomial is an expression in which a finite number of constants and variables are combined using only addition, subtraction, multiplication and non-negative whole number exponents. One primitive polynomial is x 8 + x 4 + x 3 + x 2 + 1 which may be denoted in hexadecimal notation by ID. The Galois Field (GF) operations on polynomials are also referred to as GF(2 A 8) arithmetic. In a system with n data disks, the generation of the Q syndrome is represented by equation 2 below: Q = g°*D 0 Θ g x *Di Θ g 2 *D 2 Θ g n_1 *D n _i (Equation 2)

Byte-wise Galois-field operations are performed on a stripe basis, where each byte in the block is computationally independent from the other bytes. Byte-wise Galois-Field operations can accommodate as many as 255 (2 Λ 8-1) data disks.

Fig. 2 is a block diagram of an embodiment of a system 200 that includes a direct memory access (DMA) controller 214 to support both RAID (RAID 5/RAID 6) operations and non-RAID operations according to the principles of the present invention. The system 200 includes a processor 202, system memory 218 and Input Output Controllers (IOCs) 206, 208. The processor 202 includes a memory controller 204, one or more processor cores 216 and the DMA controller (DMAC) 214. In an embodiment the processor 202 is a system-on-a-chip (SOC). The first Input/Output Controller (IOC) 206 coupled to the processor 202 provides access to storage devices (not shown) accessible via a Storage Area Network (SAN) 210. A second IOC 208 provides access to storage devices 212 directly coupled to the second IOC 208 that may be configured as a Random Array of Independent Disks (RAID) system. For example, in an embodiment, the storage devices 212 are configured as a RAID 6 system 100 as described in conjunction with Fig. 1.

The DMA controller 214 includes a plurality of DMA channels. The operation of each DMA channel is independent from the other DMA channels, which allows for different operations to be processed concurrently by each respective DMA channel.

The operations of a DMA channel include memory-to-memory data transfers and memory-to-memory mapped I/O (MMIO) data transfers. Each DMA channel moves data on command of its controlling process (the DMA client). A descriptor describes each data transfer and enables the DMA controller 214 to perform the data transfers. The descriptor is a data structure stored in memory that stores variables that define the DMA data transfer. Upon completion of the data transfer, the DMA controller 214 can notify the processor core 216 of the completion via either an interrupt to the processor core 216, a memory write to a programmed location, or both. Each DMA channel in the DMA controller 214 provides optimal block data movement by supporting scatter/gather operation specified by a linked list (chain) of descriptors. The DMA controller 214 executes the scatter/gather list of data transfers. At the completion of each operation, the DMA controller 214 can update the respective DMA channel's status register.

The DMA controller 214 provides support for both non-RAID operations and RAID operations. A non-RAID operation includes a Direct Memory Access (DMA) transfer used to transfer data blocks directly between the IOCs 206, 208 and system memory 218. A non-RAID operation can also be used to transfer data blocks directly between the system memory 218 in system 200 and a system memory in a mirror system (not shown) accessible through the CPU 202 via a communications link 220. The mirror system includes the same logic as discussed in conjunction with system 200. The DMA controller 218 also provides support for RAID operations as defined by a RAID descriptor. A RAID operation includes at least one logical operation that is performed on a plurality of data blocks stored in system memory 218. The logical operation can be one of the logical operations described earlier for computing P and Q syndromes for a RAID 6 system 100 in conjunction with Equation 1 and Equation 2. A non-RAID operation is performed to fetch the data blocks from N different sources which can be aligned differently with respect to each other.

Both RAID and non-RAID operations are defined by one or more descriptors 222. In an embodiment, to initiate a RAID or non-RAID operation, a chain (linked list) of descriptors can be generated and stored in system memory 218. The address of the first descriptor in the chain is provided to the DMA controller 214. In an embodiment, the address of the first descriptor in the chain is written to a descriptor chain address register in the DMA controller 214. The RAID or non-RAID operation is initiated for a DMA channel in the DMA controller 214, for example, via a write to a DMA channel command register in the DMA controller 214.

Fig. 3 is a block diagram of an embodiment of the DMA controller 214 shown in Fig. 2 that includes a RAID engine 304, a non-RAID engine 306 and a Descriptor Fetch Engine 301 according to the principles of the present invention. The RAID engine 304 and the non-RAID engine 306 share a descriptor FIFO 302, upstream logic (switch) 308 and the descriptor fetch engine 301. The upstream logic 308 is the conduit to system memory 218 (Fig. 2). The system memory 218 can store descriptors 222 and source data blocks. The non-RAID engine 308 and the RAID engine 306 use a time splicing mechanism to share the access to the shared resources including the descriptor FIFO 302 and the upstream logic 308 based on an arbitration scheme.

The non-RAID engine 306 supports standard DMA operations such as Memory-to-

Memory copy and Memory-to-Input/Output (IO) copy. In an embodiment, the non-RAID engine 306 can perform a Memory-to-Memory copy or Memory-to-IO copy at the rate of 6.4 Gigabytes (GB)/second (s) and process 16 Bytes per clock period for a 400 Mega Hertz (MHz) clock.

The RAID engine 304 performs RAID operations. In an embodiment, the RAID

Engine 304 can read source data from system memory 218 as defined by one or more descriptors 222 (Fig. 2) at the rate of 2.8GB/s and write the destination data as defined by the descriptors at 700 Mega Bytes (MB)/second (s). In an embodiment, the RAID Engine 304 supports four logical operations/functions: (a) XOR Generate, (b) XOR Validate, (c) XOR with Galois Field Multiply Generate (GFM), and (d) Galois Field Multiply Validate (GFM) function.

In an embodiment, a dual descriptor processing pipeline provided by the RAID engine 304 and the non-RAID engine 306 in the DMA controller 214 allows RAID operations to be processed concurrently while performing non-RAID operations (Memory- to-Memory copy or Memory-to-IO copy). In order to meet bandwidth requirements without blocking between sources and/or between destinations, only the minimal amount of data is fetched per source in system memory 218 (Fig. 2).

The system 200 supports RAID operations using the RAID engine 304 without affecting the performance of the non-RAID engine 306 and meeting the performance requirements of the RAID engine 304. The RAID engine 304 supports RAID5/6 operations through logical operations (functions).

RAID logical operations (functions) include Exclusive OR (XOR) operations and XOR with Galois Field Multiply operations used to generate redundant data used to recover from one or more disk failures in a RAID 5/RAID 6 system 100. These RAID logical functions may involve concurrently fetching data from up to N different sources. In an embodiment, the size of the data to be fetched from each of the sources in system memory 218 prior to performing the XOR operation may be up-to IMegaBytes (MB), that is, greater than the maximum DMA data transfer size per channel 222 (Fig. 2).

In an embodiment, an arbiter 303 includes a timeslot manager to manage exclusive access by the RAID engine 304 or the non-RAID engine 306 to shared system resources including the descriptor FIFO 302 and the upstream logic 308. The timeslot manager provides exclusive access to the RAID engine 304 for a dynamically expandable time period in order to meet the performance requirements for a RAID5/RAID6 system 100.

To process a Generate function (XOR Generate or XOR with Galois Field

Multiply Generate (GFM)), the RAID Engine 304 uses a DMA read operation to read up- to N-2 different sources from the memory as specified by the up-to N-2 source addresses stored in the respective descriptor 222 (Fig. 2). To process a Validate function (XOR Validate or Galois Field Multiply Validate (GFM) function), the RAID Engine 304 uses a DMA operation to read up-to N different sources from the system memory 218 as specified by the up-to N source addresses stored in the respective descriptor 222 (Fig. 2). The DMA read operation is similar to an operation that the non-RAID engine 306 performs.

Fig. 4 illustrates the format of an embodiment of a XOR descriptor 400 (a RAID descriptor) used to specify an XOR Generate or an XOR Validate operation.

Referring to Fig. 4, the XOR descriptor 400 includes a descriptor control field 402, a block size field 404 (to store the size of the data block to be transferred), and source address fields 1-5 406, 410, 412, 414, 416, 418 (to store source addresses for data blocks stored in system memory 218). The XOR Descriptor 400 also includes a next descriptor address 410 to store an address of the next descriptor in a chain of descriptors and a parity address field 408 to store the address in system memory 218 at which the result of the XOR operation (Generate or Validate) is stored.

The XOR Generation operation performs a byte-wise XOR of the data in each of a plurality of source blocks with each other. As discussed earlier, the address in system memory 218 (Fig. 2) of each of the source blocks is stored in the XOR Descriptor 400. The XOR Descriptor 400 shown in Fig. 4 specifies up to five different source addresses. Byte 1 of the first source block is XOR'd with byte 1 of the second source block, and that result is XOR'd with byte 1 of the third source block and so on with the result written to byte 1 of the destination (specified by the parity address stored in the parity address field 408). Likewise, byte 2 of each block is XOR'd and written to byte 2 of the destination (specified by the parity address stored in the parity address field 408).

The RAID engine 304 (Fig. 3) may perform a series of partial operations such as XOR the first 'n' bytes of block 1 with the first 'n' bytes of block 2, and then XOR that result with the first 'n' bytes of block 3, write the final result to the first 'n' bytes of the destination, and then repeat the process for the second group of 'n' bytes, and so on.

The XOR Validate operation is similar to the XOR Generate, except that instead of writing the result to a location in system memory 218 specified by the Parity Address stored in the parity address field 408, the result is compared with the block in system memory 218 identified by the system memory address stored in the parity address field 408 of the descriptor 400.

Returning to Fig. 3, functions for data-fetch operations for the RAID engine 304 are spliced-in with standard DMA functions for the non-RAID engine 306 using the descriptor FIFO 302 and upstream logic 308. Thus, the additional RAID engine logic functions do not impact the performance of the non-RAID engine 306. At a very high level, this is achieved by using unused time-slots of the non-RAID engine 306 and capitalizing on portions of commonality of the infrastructure between the non-RAID engine 306 and the RAID engine 304. The RAID engine 306 splices in RAID data transfers into available timeslots, thereby maintaining the performance levels of the non- RAID engine 306.

As shown in Fig. 3, the non-RAID Engine 306 and the RAID Engine 304 share a single descriptor FIFO 302 and both the RAID engine 306 and the non-RAID engine 304 use the same upstream logic (also referred to as the switch) 308 to access the system memory 218 (that is, memory read and memory write operations). In an embodiment, the non-RAID Engine 304 operates on page boundaries which are 4 Kilo Bytes (KB) in length.

In contrast to the non-RAID engine 304, the RAID engine 306 is optimized to perform RAID logical functions (operations). Large transfers are subdivided into smaller transfers such that the RAID engine 306 accesses single cache-lines (CL) in system memory 218 (Fig. 2) from all of the sources (identified by the source addresses in the RAID descriptor 400). In an embodiment, a 64 byte cache line, data is fetched from all sources. Furthermore, an arbitration scheme performed by arbiter 303 uses the concept of timeslot assignment for requesting access to system memory 218 through the upstream logic (switch) 308.

The optimization is such that the RAID engine 304 arbitrates through Request

(REQJ/Grant (GNT) signals to arbiter 303 for a DMA transfer to/from the system memory 218 only when the shared resources are available for use by the RAID engine 304. This allows the Descriptor FIFO 302 and the access to the upstream logic (switch) 308 to be shared by the non-RAID engine 306 and the RAID engine 304 in a fair and optimized manner.

In an embodiment, the RAID engine 304 requests a single cache line from all of the data sources (up-to N sources) via the arbiter 303 via the REQ/GNT signal sequence when the RAID engine 304 has satisfied the condition that system resources are available to process at least one of the N sources. In one embodiment N is 10. However, N is not limited to 10. While fetching a transfer length for a given source, subsequent fetches for the other sources are blocked.

The arbiter 303 grants the RAID engine 304 a timeslot for access to the system memory 218 based on a dynamically expandable timeslot. A data- fetch request is issued for at least a single cache line from each of the N sources in system memory 218, as long as the system resources are available. If at any point of performing the data fetches from the N sources, system resources are no longer available, the RAID engine 304 gives up the time-slot. The current status of the data fetch requests is saved allowing the RAID engine 304 to resume from where it stopped during the current timeslot during the next available timeslot. Thus, the timeslot for access to the system memory 218 by the RAID engine 304 is not fixed, but is dependent on the length of time that system resources are available allowing the RAID engine 304 to access system memory, that is, the RAID engine's timeslot is dynamically expandable.

The dynamically expandable timeslot allows efficient arbitration for upstream access to system memory 218, allowing the non-RAID engine 306 and the RAID engine 304 to optimize performance. As the current information is stored by the RAID engine 304 when releasing shared system resources to the non-RAID engine 306, the RAID engine 304 can resume from where it had left off when the system resources become available again. The RAID engine 304 can start and stop at different word boundaries whenever the system resources indicate the availability and non-availability of the timeslot. Furthermore, the RAID engine 304 can start from the memory address at which it was stopped by using saved context information.

Efficiently managing the shared system resources through the use of the "Stop and Resume" scheme discussed above also contributes to improving the performance of the RAID engine 304. Furthermore, the RAID engine 304 is optimized to be highly efficient while gaining access to the timeslot.

Fig. 5 illustrates the format of an embodiment of a Galois Field (GF) Descriptor 500 (RAID descriptor) used to request an XOR with Galois Field Multiply Generate or Galois Field Multiply Validate (GFM) operation. Referring to Fig. 5, the GF Descriptor 500 includes a descriptor control field 502, block size field 04, four source address fields 506, 510, 512, and 514, a parity address field 508, a Q parity address field 518 and a Galois Field register 516. The Galois Field register 516 stores a Galois Field (GF), that is, an 8-bit constant value to be multiplied with the data block referenced by the respective source address. For example, GF1 stores an 8-bit constant value to be multiplied with the data block referenced by source address 1. In the embodiment shown, the 8-bit GF register 516 can store a GF for up to eight different sources. In an embodiment, the RAID engine 304 performs RAID logical operations including RAID logical operations to compute the Q syndrome and the P syndrome, that is, P = D 0 Θ Di Θ D 2 Θ D n _i (Equation 1)

Q = g°*D 0 Θ g x *Di Θ g 2 *D 2 Θ g n-1 *D n _i (Equation 2)

RAID engine 304 performs a first RAID logical operation, that is, an XOR operation on the block of data (bytes) located at Source Address 1 (Do) 506 with the block of data (bytes) at Source Address 2 (Di) 512 and writes the result of the XOR operation into the buffer specified by the Parity Address field 508. Next, the RAID engine 304 performs a second RAID logical operation, that is, multiplies each byte of data in the block of data (bytes) located at Source Address 1 (Do) 506 with Gl stored in the GF register 514, multiplies each byte in the block of data (bytes) at Source Address 2 (Di) with G2 stored in the GF register 514, performs an XOR operation on the results of each GF multiply operation, and writes the result into or validates it with the buffer specified by the Q Parity Address field 518.

Fig. 6 illustrates a method for performing an Input/Output (IO) write with RAID 6

(or RAID 5) write back cache according to the principles of the present invention. Fig. 6 will be described in conjunction with Figs. 2-5.

The IO write operation is performed using both non-RAID operations and RAID operations. The non-RAID operations and RAID operations are defined using descriptors 222 (Fig. 2) that are stored in system memory 218 and accessed through the descriptor FIFO 302 (Fig. 3). The IO write operation reads data stored on a storage device (not shown) accessible via the SAN 210, performs an operations (based on the operation requested in the descriptor 222 (Fig. 2)) on the data, forwards the result of the operation to be mirrored to the mirror system via communications link 220 (Fig. 2) and writes data and parity to the RAID 100. All of the descriptors 222 (Fig. 2) describing the RAID logical operation and DMA operations can be stored in a single chain (linked list). The descriptors 222 are initially stored in system memory 218 and are transferred to the descriptor FIFO 302 for processing by the RAID engines 304, 306. In an embodiment, the descriptor FIFO 302 can store two descriptors 222 per DMA channel.

At block 600, in response to an Input Output (I/O) write request from the operating system or application, data stored on a storage device (not shown) accessible via the Storage Area Network 210 is transferred by the non-RAID engine 306 via the Storage Area Network (SAN) IOC 206 and written to system memory 218. Processing continues with block 602.

At block 602, the RAID engine 304 performs a RAID acceleration calculation (RAID logical operation) on the transferred data that is stored in system memory 218. The RAID logical operation performed is dependent on the type of descriptor stored in the descriptor control field 402, 502 of the respective descriptor 400, 500. Processing continues with block 604.

At block 604, a memory to I O copy is performed by the non-RAID engine 306 to mirror write data through to a failover system via communications link 220 (Fig. 2).

Processing continues with block 606.

At block 606, an I/O copy to memory is performed by the non-RAID engine 306 from the failover controller (not shown) in the failover system (not shown) via communications link 220 (Fig. 2). Processing continues with block 608.

At block 608, the data and parity computed by the RAID engine 304 (Fig. 3) are written by the non-RAID engine 306 to the RAID system 100.

As discussed earlier and shown in Figs. 4 and 5, a descriptor (for example, RAID descriptor 400, 500) is a data structure that is stored in system memory 218. A plurality of descriptors can be chained (linked) together in order to initiate multiple sequential transfers on a single DMA channel of the DMA controller 214. These descriptors are to transfer large contiguous chunks of data (data blocks, also referred to as payload) based on the directives set up in the descriptor. Each DMA channel in the DMA controller has its own respective chain of descriptors.

Returning to Fig. 3, the non-RAID engine 306 processes non-RAID descriptors for non-RAID operations stored in the descriptor FIFO 302. Non-RAID descriptors include descriptors describing (specifying) Memory to Memory copy, Memory to I/O copy, Cyclic Redundancy Check (CRC) generate/store and test operations. The two Engines (RAID engine 304 and non-RAID engine 306) together handle processing of a chain of descriptors stored in the descriptor FIFO 302 that can include both RAID descriptors and non-RAID descriptors.

RAID descriptors include XOR Generate, XOR Validate, XOR with Galois Field Multiply Generate, and Galois Field Multiply Validate function discussed earlier in conjunction with RAID descriptors 400 (Fig. 4) and 500 (Fig. 5).

As discussed earlier, the RAID Engine 304 and the non-RAID Engine 306 share common system resources including the descriptor First In First Out (FIFO) 302 and upstream logic (Switch) 308. The upstream logic (switch) 308 is the conduit to the system memory 218 to fetch source data and write destination data.

A chain of descriptors stored in the descriptor FIFO 302 can include both RAID

(XOR or GF) descriptors and non-RAID descriptors allowing descriptor processing to occur in an overlapping fashion. This ensures that descriptor processing is not serialized and/or blocked by availability of one set of resources (RAID engine 304 or non-RAID engine 306) through the management of descriptor processing between the RAID engine 304 and the non-RAID Engine 306.

Each descriptor 222 (Fig. 2) in the chain of descriptors stored in the descriptor FIFO 302 describes a particular task (DMA data transfer and/or DMA logical operation) to be performed. The descriptors are organized in a chain (linked list) in the descriptor FIFO 302 in order to perform sequential processing of the tasks for a particular RAID or non- RAID operation. The chain of descriptors can include both RAID descriptors and non- RAID descriptors inter-twined in the chain of descriptors. Each chain of descriptors is associated with a particular DMA channel in the DMA controller 214. In an embodiment, there are eight DMA channels in the DMA controller 214.

The determination of which engine (RAID engine 304 or non-RAID engine 306) processes a particular descriptor in the chain of descriptors is dependent on the operation type stored in the descriptor control field 402, 502 of the descriptor. If the operation type is a "DMA operation", the descriptor is a "non-RAID descriptor" and processed by the non-RAID engine 306. If the operation type is "RAID logical operation", for example, XOR operation, the descriptor is a "RAID logical operation" and processed by the RAID engine 304.

For example, referring to Fig. 6, the RAID acceleration calculation (block 602 of Fig. 6) is performed by the RAID Engine 304 and the Data and Parity (Memory-to-IO) write (block 604 of Fig. 6) is performed by the non-RAID Engine 306. Fig. 7 is a block diagram of an embodiment of a system 700 that includes a non- RAID engine 306. The non-RAID engine 306 includes a datafetch engine (non-RAID datafetch engine) 702 and a datapath engine (non-RAID datapath engine) 704. The descriptors are processed in a pipelined manner through three distinct stages - the descriptor fetch engine 301 which is shared by the RAID engine 304 and the non-RAID engine 306, the datafetch engine 702 and the datapath engine 704.

To begin processing descriptors (RAID and non-RAID) stored in the descriptor FIFO 302, the descriptor fetch engine 301 first retrieves the descriptor from the head of the descriptor chain of descriptors for a particular DMA channel in the DMA controller 214 (Fig. 2). The descriptor may be a RAID descriptor or a non-RAID descriptor dependent on the operation type stored in the descriptor control field 402, 502. The datafetch Engine 301 decodes the operation type stored in the descriptor control field 402, 502 of the descriptor and fetches the data (called the source data) from system memory 218 that is needed to process the operation defined by the operation type stored in the descriptor.

Fig. 8 is a block diagram of an embodiment of a system 800 that includes a RAID engine 304. The RAID engine 306 includes a RAID datafetch engine 802 and a RAID datapath engine 804. All of the other logic (descriptor FIFO 302, source data FIFO 706, request header FIFO 710 and write data FIFO 708) is shared with the non-RAID engine 304 discussed earlier in conjunction with Fig. 7. The source data FIFO 706 provides temporary storage for the source data blocks that are fetched from system memory 218. The RAID datafetch engine 802 and RAID datapath engine 804 are dedicated to processing RAID descriptors. The descriptor fetch engine 301 fetches descriptors from system memory 218 to be stored in the descriptor FIFO 302 for processing by the RAID engine 304 and the non-RAID engine 306.

Fig. 9 is a flowgraph illustrating an embodiment of a method for processing descriptors 222 (Fig. 2) stored in the descriptor FIFO 302. Fig. 9 will be discussed in conjunction with Figs. 3, 7 and 8. As discussed in conjunction with Figs 7 and 8, both the RAID engine 304 and the non-RAID engine 306 include a respective datafetch engine 702, 802.

At block 900, the arbiter 303 (Fig. 3) determines if either of the engines 304, 306 is free, that is, available to process a descriptor 222 stored in the descriptor FIFO 302. If so, processing continues with block 902. If not, the arbiter 303 waits until one of the engines 304, 306 is free.

At block 902, if the RAID engine 304 is free, processing continues with block 904. If not, processing continues with block 920 to determine if the non-RAID engine 306 is free.

At block 904, if the non-RAID engine 306 is free, processing continues with block 906. If not, processing continues with block 908

At block 906, if the operation type stored in the descriptor at the head of the descriptor FIFO 302 indicates the descriptor is a RAID descriptor, processing continues with block 914. If not, processing continues with block 916.

At block 908, if the operation type stored in the descriptor at the head of the descriptor FIFO 302 indicates that the descriptor is non-RAID descriptor, processing continues with block 910. If not, processing continues with block 912.

At block 910, the RAID engine 304 is free and the descriptor 222 is a non-RAID descriptor. The arbiter 303 schedules the RAID datafetch engine 802 to start processing the non-RAID descriptor. The RAID datafetch engine 802 decodes the operation type in the descriptor and marks the descriptor as a non-RAID descriptor. This indicates to the arbiter 303 to not schedule the RAID Datafetch engine 802 again to process this non- RAID descriptor. The arbiter 303 schedules the non-RAID datafetch engine 702 to process the non-RAID descriptor as soon the RAID datafetch engine 702 becomes eligible based on other system constraints. Processing continues with block 900

At block 912, the RAID datafetch engine 802 is free and the descriptor is a RAID descriptor. The arbiter 303 schedules the RAID Datafetch engine 802 to start processing the descriptor. The RAID datafetch engine 802 decodes the operation type stored in the descriptor. As the descriptor is a RAID descriptor, the RAID datafetch engine 802 continues processing the descriptor. The RAID engine 304 completes issuing all the source reads until the RAID operation in the RAID descriptor is complete. Processing continues with block 900.

At block 914, both engines 304, 306 are free; the arbiter 303 schedules both the RAID datafetch engine 802 and the non-RAID datafetch engine 702 to start processing the RAID descriptor. Both engines 304, 306 decode the RAID descriptor and work in lock- step to ensure that the Descriptor FIFO 302 is read only once. Both the RAID engine 304 and the non-RAID engine 306 decode the operation type field in the RAID descriptor. Having determined that the descriptor is a RAID descriptor, the non-RAID datafetch engine 702 quits processing the RAID descriptor. The RAID datafetch engine 802 continues processing the RAID descriptor and completes issuing all the source reads until complete. Processing continues with block 900.

At block 916, as both the RAID engine 304 and the non-RAID engine 304 are free, the arbiter 303 schedules both the RAID datafetch engine 702 and the non-RAID datafetch engine 306 to start processing the non-RAID descriptor. Both engines 304, 306 decode the operation type stored in the non-RAID descriptor and work in lock-step to ensure that the descriptor FIFO 302 is read only once. Both the RAID engine 304 and the non-RAID engine 306 decode the operation type field in the non-RAID descriptor. Having determined that the descriptor is a non-RAID descriptor, the RAID datafetch engine 802 quits processing the non-RAID descriptor. The non-RAID datafetch engine 702 continues processing the non-RAID descriptor and completes issuing all the source reads until complete. Processing continues with block 900.

At block 920, if the non-RAID datafetch engine 702 is free, processing continues with block 922. If not, both datafetch engines 702, 802 are busy and processing continues with block 900 to wait for at least one of the datafetch engines 702, 802 to be free.

Processing continues with block 900.

At block 922, if the descriptor type is non-RAID, processing continues with block 924. If not, processing continues with block 926.

At block 924, the non-RAID datafetch engine 702 is free and the descriptor is a non-RAID data descriptor. The non-RAID Datafetch Engine 702 continues processing the non-RAID descriptor and completes issuing all the source reads until complete.

Processing continues with block 900.

At block 926, the non-RAID datafetch engine 702 is free and the descriptor is a

RAID descriptor. The arbiter 300 schedules the non-RAID datafetch engine 702 to start processing the RAID descriptor. The non-RAID datafetch engine 702 decodes the descriptor and marks it as a RAID descriptor. This indicates to the arbiter 303 to not schedule the non-RAID datafetch engine 702 again to process the RAID descriptor. The arbiter 303 schedules the RAID Datafetch engine 802 to process the RAID descriptor as soon the RAID datafetch engine 802 becomes eligible based on other system constraints The respective datapath engine 704, 804 for each engine 304, 306 processes the source data that is returned from the system memory 218 (Fig. 2). Dependent on the descriptor type, the source data is processed by the appropriate datapath engine 704, 804.

If a RAID descriptor is followed by a non-RAID descriptor in a chain of descriptors stored in the descriptor FIFO 302, the RAID descriptor processing need not be completed before the non-RAID descriptor processing starts. There is overlap in the descriptor processing for source reads to reduce the system latency when there is no conflict between sequential descriptors in a chain of descriptors. Thus, RAID and non- RAID descriptors can be processed concurrently by the DMA controller 214. Transactions from the RAID datafetch engine 802 and the non-RAID Datafetch Engine 702 can be pending waiting for access to system memory 218.

However, because system memory accesses can complete out of order, the read completions for the non-RAID descriptor can be returned before the read completions for the RAID descriptor. To process the read completions for the non-RAID descriptor, the non-RAID Datapath engine 702 reads the descriptor from the descriptor FIFO 302.

However, as the previous RAID descriptor is not complete, the non-RAID datapath engine 704 is held off until the RAID datapath engine 804 completes reading the descriptor FIFO 302 and indicates completion of the operation defined by the descriptor.

If a non-RAID Descriptor is followed by a RAID descriptor in the chain of descriptors stored in the descriptor FIFO 302, the RAID datapath engine 804 is held off. If a RAID descriptor is followed by a RAID descriptor in the chain of descriptors stored in the descriptor FIFO 302, no action is taken because the RAID datapath engine 804 processes the back-to-back RAID descriptors and the processing is serialized. If a non- RAID Descriptor is followed by a non-RAID Descriptor in the chain of descriptors stored in the descriptor FIFO 302, no action is taken because the non-RAID datapath engine processes the back-to-back non-RAID descriptors and the processing is serialized.

In an embodiment, the processing of descriptors based on ordering of descriptors in the chain of descriptors is handled by a descriptor decoder FIFO (DDF) 307 (Fig. 3). The entries in the descriptor decoder FIFO 307 are controlled by read and write pointers. An entry in the descriptor decoder FIFO 307 (corresponding to a valid descriptor in the descriptor FIFO) is marked appropriately as a RAID descriptor or a non-RAID descriptor after the RAID datafetch engine 802 or the non-RAID datafetch engine 702 decodes the operation type in the descriptor. The read pointer is incremented after one of the datafetch engines 702, 802 completes processing the descriptor. The write pointer is incremented after one of the datapath engines 704, 804 completes processing the descriptor. The descriptor decoder FIFO 307 has one entry for each descriptor stored in the descriptor FIFO 302.

In an embodiment in which the DMA controller 214 supports N DMA channels, there is a descriptor decoder FIFO 307 per DMA channel. The information in the descriptor decoder FIFO 307 is used to determine when to hold off the respective datapath engine 704, 804 until the completions for each or either one of datapath engines 704, 804 is available.

If the entry in the descriptor decoder FIFO 307 that is indexed by the read pointer indicates that the current descriptor is a RAID descriptor, the non-RAID datapath engine 704 is held off (even if the completions for non-RAID descriptor source reads are available) until the RAID datapath engine 804 completes processing the RAID descriptor. Similarly, if the entry in the descriptor decoder FIFO 307 that is indexed by the read pointer indicates that the current descriptor is a non-RAID descriptor, the RAID datapath engine 804 is held off (even if the completions for RAID descriptor source reads are available) until the non-RAID datapath engine 704 completes processing the non-RAID descriptor.

This ensures correct operation on back-to-back descriptors in the chain of descriptors for a particular DMA channel in the DMA controller 214 when the chain of descriptors has a mix of RAID descriptors and non-RAID descriptors. A counter in control logic 305 (Fig. 3) tracks the number of descriptors to be processed in order to determine whether there is any work to be performed for a particular DMA channel in the DMA controller 214, the counter is incremented by the respective datapath engine 704, 804. The DMA controller 214 handles all errors that occur during the processing of the RAID descriptors and non-RAID descriptors, so that an error condition in one of the engines (RAID or non-RAID) 304, 306 does not impact processing of descriptors by the other engine 304, 306.

In an embodiment, the DMA controller 214 supports up-to 8Kbytes transfer length per DMA channel. The DMA controller 214 allows a transfer size that is greater than the maximum DMA transfer size per channel to be requested in a single descriptor while maintaining a balanced performance among different DMA channels according to the principles of the present invention. In an embodiment, the DMA controller 214 includes eight DMA channels. Each DMA channel has a respective descriptor FIFO 302 and three stages of processing that is, a descriptor fetch engine 301, a datafetch engine 802, 702, and a datapath engine 804, 704. In an embodiment, the transfer size supported for a DMA operation defined by a non- RAID descriptor is from 0 Bytes to 1 Mega Bytes (MB) with bandwidth fairly allocated between all DMA channels independent of the transfer size. A descriptor recycling mechanism is used for non-RAID descriptors having a transfer size greater than 8

Kilobytes (KB) while providing fair allocation of bandwidth between the plurality of DMA channels.

As a descriptor (initial descriptor) is initially fetched from the descriptor FIFO 302 and pre-decoded for processing, the datafetch engine 802, 702 reads the block size field 404, 504 in the initial descriptor. If the block size stored in the block size field 404, 504 in the initial descriptor is greater than the maximum DMA transfer size supported per channel the DMA controller 214, the datafetch engine 802, 702 generates a modified descriptor. The modified descriptor has a transfer size stored in the block size field 504, 404 that is supported by the channel in the DMA controller 214. The modified descriptor is written in place of the initial descriptor (the initial descriptor is overwritten with the modified descriptor) in the descriptor FIFO 302. Portions of the initial descriptor are stored in the datafetch engine 802, 702 and used to continue to modify (re-cycle) the initial descriptor until the initial block size has been transferred using a plurality of separate DMA transfers with each DMA transfer having a block size supported by the DMA controller 214.

The initial descriptor is modified multiple times in order to break up the original transfer lengths into smaller transfer lengths supported by the DMA controller 214. In addition other fields in the initial descriptor are modified in order to hide the breaking up of the initial large transfer size into smaller transfer sizes from the requester. For example, an interrupt enable bit in the descriptor control field 502, 402 is modified to suppress interrupt generation for completion of the operation defined by each of the plurality of modified descriptors until the operation defined by the last modified descriptor is performed. Thus, only one descriptor is stored in the descriptor FIFO 302 for the initial requested transfer size and one interrupt is generated at the completion of the operation defined by the initial descriptor. The "descriptor re-cycling" scheme keeps the overhead of multiple descriptors and multiple interrupts to a minimum, without completely starving progress on other DMA channels when large data transfers are initiated on any one DMA channel.

The DMA controller 214 includes a state machine per DMA channel. The state machine has five states: HALTED, ARMED, SUSPEND/PENDING ACTIVE, and DMA IDLE. The DMA controller 214 initializes each DMA channel in the HALTED state. The DMA channel is enabled when an address of the start of the chain of descriptors residing in system memory 218 (prior to writing to the descriptor FIFO 302) is written to a descriptor chain address register for the respective DMA channel in the DMA controller 214. Writing the descriptor chain address register causes the DMA channel to transition to the ARMED state. That transition automatically sets both a descriptor count register and internal descriptor counter to zero.

Each time the DMA channel completes processing a descriptor, the internal descriptor counter is incremented and compared to the descriptor count register. In the ARMED state the data transfer is started by writing the descriptor count register (to a non- zero value), causing the DMA channel to transition to the ACTIVE state, where the DMA controller 214 processes the chained descriptors for the DMA channel until the DMA controller 214 encounters an error that results in an abort, (b) the operation is suspended, or (c) the DMA controller 214 completes processing all indicated descriptors (as determined by the value in the descriptor count register being equal to the internal descriptor counter).

When the DMA channel completes processing all indicated descriptors that is, the descriptor count register equals the internal descriptor counter, the DMA controller transitions to the DMA IDLE state, where it waits for the descriptor count register to be updated or the operation to be suspended.

Writing the DMA count register is an implicit command to transition from the

IDLE to ACTIVE state. The write command can be issued anytime without regard to the current state of the DMA channel, so that the chain can be extended without pausing the DMA operation. Overwriting the DMA count register is not critical because it stores an absolute, rather than relative value.

If the DMA controller 214 encounters an error that causes an abort in any state, the

DMA channel transitions back to the HALTED state and waits for the error register to be cleared and a new chain of descriptors to be posted for the DMA channel to process. While the DMA channel is ACTIVE (or IDLE), operation can be suspended by setting the Suspend DMA bit in the DMA Channel Command Register).

In the SUSPEND/PENDING state, the DMA controller 214 finishes processing the current operation and then transitions to the HALTED state. In the HALTED state, descriptors can be modified and operation can be restarted by writing registers for the DMA channel in the DMA controller.

While the DMA channel is in HALTED state or ARMED state, the DMA channel holds a suspend DMA bit reset to zero. Therefore, Suspend DMA command only causes a state change when the DMA channel is in a state other than HALTED or ARMED. The suspend DMA command has latency associated with it because the DMA channel has to finish processing the current operation prior to transitioning the DMA channel to the SUSPEND state.

When a DMA channel is in the ACTIVE state, commands may be processed on descriptor boundaries. Thus, state changes do not occur immediately. In this case, the DMA channel can transition directly to the HALTED state (without going through SUSPEND PENDING state). In any case, the DMA channel stops on the next

'convenient' descriptor boundary. That is, the DMA channel can have multiple descriptors in progress at the time it receives the Suspend DMA command, thus, it may continue DMA transfers until it can efficiently stop.

Fig. 10 is a flowgraph illustrating an embodiment of a method to handle descriptor in the DMA controller 214. Fig. 10 will be discussed in conjunction with Fig. 8.

At block 1000, the descriptor fetch engine 301 fetches a descriptor from the descriptor FIFO 302 in system memory 218. Processing continues with block 1002.

At block 1002, the RAID datafetch engine 802 analyzes the received descriptor and issues a READ request for the source data stored in system memory 218 as indicated by the source address stored in the RAID descriptor. The source data is read from single cache lines (CL) from system memory 218 in a cache coherent system. Thus the source cache line must be owned by the DMA controller 214 before the cache line can be modified. Processing continues with block 1004.

At block 1004, the RAID datafetch engine 304 issues a Request for Ownership

(RFO) for as many destination writes as possible. This look-ahead scheme ensures that the required ownership of the destination cache line is obtained before the actual writes occur. Processing continues with block 1006. At block 1006, the RAID datapath engine 804 processes the fetched source data and issues writes to the cache lines which were requested through the RFO mechanism.

As discussed earlier, the RAID engine 304 shown in Fig. 3 supports RAID 5/6 logical operations. In an embodiment, the RAID engine 304 supports four RAID 5/6 logical operations/functions: (a) XOR Generate, (b) XOR Validate, (c) XOR with Galois Field Multiply (GFM) Generate, and (d) Galois Field Multiply (GFM) Validate and Update function.

In the GFM Generate operation, the RAID engine 304 performs two operations concurrently on data blocks read from the source addresses, that is, an XOR function and a GFM function. The result of each operation is stored in temporary buffers, P Parity

(results of the XOR operation) and Q Parity (results of the GFM operation). In addition, Completion data is also written to a completion buffer. These three results (RAID streams) are written upstream to the system memory 218. Access to the system memory 218 is through an upstream logic also referred to as the switch (308 (Fig. 3)). However, the switch 308 can only handle a single result. In an embodiment of the present invention, the plurality of variable length results (RAID streams) are mapped into a single logical traffic stream prior to sending to the upstream device 308.

Descriptors that have been fetched from system memory 218 are stored in the descriptor FIFO 302. All three engines (descriptor fetch, data fetch, data path) use information stored in the descriptor. Thus, there are pointers to the descriptor FIFO 302 that are specific to each engine in order to read and discard the descriptor.

Fig. 11 is a flowgraph illustrating an embodiment of a method to handle large transfer sizes using a single descriptor in the DMA engine.

At block 1 152, the descriptor fetch engine 301 fetches a descriptor from memory 218. Processing continues with block 1154.

At block 1 154, the descriptor fetch engine 301 checks the requested block size in the fetched descriptor. If greater than the size supported by a DMA channel, processing continues with block 1 156. If not, processing continues with block 1158.

At block 1 154, the non-RAID datafetch engine 702 stores all the intermediate context information necessary locally to handle the larger sizes for a given DMA channel. The non-RAID datafetch engine 702 retrieves the intermediate context when resuming the next portion of the original payload for the DMA channel. Processing continues with block 1 156. At block 1 156, referring to Fig. 7, the non-RAID datafetch engine 702 performs two passes of descriptor write backs to the descriptor FIFO 302. The first pass writes a modified descriptor that is, modifies the transfer size from the original payload size to a portion of the original payload size to the non-RAID datafetch engine 702.

To ensure that the bandwidth is fairly distributed across all of the DMA channels, the DMA controller 214 breaks a large block size request into a plurality requests to transfer smaller block sizes. In one embodiment, an initial transfer length of 1 Megabytes (MB) is subdivided into a plurality of 8 Kilobytes (KB) transfer lengths. The initial descriptor is modified to emulate a chain of descriptors with each intermediate modified descriptor storing the smaller transfer size. The engines release the resources for a given DMA channel processing after at least 8K is processed for that DMA channel. Processing continues with block 1158.

At block 1 158, the non-RAID datafetch engine 702 uses the source address received in the recycled (modified) descriptor and calculates the destination write address by adding the portion of the original payload size to the destination address in the descriptor. After the RAID datafetch engine 802 completes issuing the source reads, the RAID datafetch engine 802 updates the descriptor FIFO 302 to modify the stored descriptor source address. Processing continues with block 1 160.

If the initial requested transfer length has been transferred, processing continues with block 1 150 to fetch another descriptor. If not, processing continues with block 1 152 to reuse the initial descriptor.

This method for "re-cycling" (reusing) the initial descriptor in system memory 218, that is, the emulation of a chain of descriptors by recycling a single initial descriptor reduces the number of interrupts and the amount of memory required for storing a chain of descriptors. The source address fields in the single recycled descriptor are updated each time a portion of the data payload has been processed. The descriptor is not discarded when a portion of the payload is processed but instead is recycled/re-used for that DMA channel by modifying source address fields in the single descriptor stored in the descriptor FIFO 302.

Furthermore, the completions and interrupts to the processor (core) 216 for the

DMA channel working on the original payload are suppressed until all portions of the original payload have been processed. By dividing the original payload (block size) into smaller portions (chunks) only one DMA channel is using all of the system resources for a small period of time, that is, while the smaller portion is processed. Thus, one DMA channel does not consume all of the resources for the original payload.

The RAID datafetch engine 802 also handles page breaks (at the portion of the original payload boundary) for larger sizes. The RAID datafetch engine 802 generates the indication for descriptor completion for a given DMA channel at the boundary. The completion is filtered until the entire original payload length is complete.

Fig. 12 is a block diagram illustrating an embodiment of a system to perform mapping of a plurality of variable length RAID streams to a single logical traffic stream. The RAID datafetch engine 802 accesses system memory 218 and presents the source data to the RAID datapath engine 804 for processing.

The Write Data FIFO 708 stores the data for P/Q/Completion streams. There is a single physical Write Data FIFO 708 which is divided into three Logical FIFOs - P Logical Destination Data FIFO 708A, Q Logical Destination Data FIFO 708B, and the Completion Data Logical FIFO 708C. There are three write pointers 1 122, 1 124, 1 126, controlled by the RAID datapath engine 804. There are three read pointers 11 16, 1 118,

1 120, controlled by Logical Stream Decoder Logic 1 102. Logical FIFOs 708A-C are used in the Write Data FIFO 708 instead of providing three separate physical FIFOs to reduce overhead associated with having three separate physical FIFOs.

Multiple variable length (for example, 1 Byte up-to to 64 Cache Lines (4KB)) RAID streams are mapped into a single Logical traffic stream by maintaining a Logical Stream Decoder FIFO 1104 to decode three different streams while presenting a single stream to the Switch 308. One entry in the Request header FIFO 710 can be used to perform transfers from 1 Byte up-to to 64 Cache Lines (4KB) worth of data transfer. The operation of the Header Stream Decoder FIFO pointers and the Data FIFO pointers are independent of each other, that is, the P, Q and completion data flow are independent. Ordering of the data in the plurality of variable length RAID streams is maintained through the use of a single logical stream.

The RAID datapath engine 804 processes data and uses P, Q or completion write address counters 1 122, 1124, 1 126 to write the appropriate data to the next available location in the respective logical FIFO 708A-C. A request is recorded into the logical stream decoder FIFO 1104, which preserves the order of the outgoing transactions.

The switch 308 selects the single physical switch port to return data stored in the write data FIFO 708 to system memory 218. The Logical Stream Decoder logic 1 102 determines a type of transaction to send and reads the appropriate data from the Write Data FIFO 708, using the respective P, Q and Completion Read Address counters 1 1 16, 1 118, 1120.

The RAID datapath engine 804 processes the source data and generates the P Parity destination data (parity) 708A and Q destination data (Parity) 708B, in parallel. Although the P and Q destination data 708A, 708B is generated concurrently, the P and Q destination data is written in a staggered fashion into the shared physical write data FIFO 708. The P destination Data and the Q destination Data are written into the shared physical write Data FIFO 708, at locations indicated by the 'P write address' 1 122, and the 'Q write address' 1124 respectively.

When the RAID datapath engine 804 has sufficient processed data (stored in the P and Q destination data logical buffers 708A, 708B) to issue a cache line request, the RAID datapath engine 804 signals the switch 308 that it is ready to transfer data to the system memory 218. The processing continues until all bytes from the source data are consumed and results are stored in the P and Q destination buffers 708 A, 708B.

The RAID datapath engine 804 performs a last data write to write a 'Completion Status' register which includes information about errors that may have occurred during processing. As with the P and Q destination data, the 'Completion Status' data is written to an address generated from a separate Completion Write Address 1126. The RAID datapath Engine 804 also ensures that Completion Status writes are issued only after all of the P and Q writes have been issued.

The Request header FIFO 710 stores control information for P/Q/Completion streams. The headers for all the three streams are stored in a single header FIFO 308. The Request header FIFO 710 is written by the RAID datafetch engine 802.

The logical stream decoder FIFO 1 104 keeps track of which logical stream is on top of the Queue from the datafetch engine 804 to the logical stream decoder FIFO 1 104. There is a one-to-one correspondence between the Header FIFO 308 and the logical stream decoder FIFO 1104. An entry in the logical stream decoder FIFO 1104 is updated when an entry in the request header FIFO 710 is updated. There are read and write pointers to control the logical stream decoder FIFO 1104. The write pointer is controlled by the RAID datapath engine 804 and the read pointer is controlled by the switch 308.

As discussed earlier, there is a respective write pointer (address) 1122, 1 124, 1 126 to control each of the logical FIFOs 708A-C in the Write Data FIFO 708. Each of these write pointers 1 122, 1124, 1126 is controlled by the RAID datapath engine 804. The RAID datapath engine 804 provides an indication to the logical stream decoder FIFO 1 104 as to which Logical FIFO 708A-C is currently being written. There is also a read pointer 1 116, 11 18, 1120 to control reading from each of the Logical FIFOs 708A-C.

The switch 308 controls a single read pointer for a single logical RAID FIFO.

However, as there are three logical FIFOs 708A-C corresponding to the P Destination Data FIFO 708A, Q Destination Data FIFO 708B, and the Completion Data FIFO 708C, each of the respective read pointers 1 1 16, 1 118, 1120 is controlled independently dependent on which stream (P/Q/Completions) is currently being processed using the entry corresponding to the read pointer of the logical stream decoder FIFO 1 104.

By using the decoded information from the logical stream decoder FIFO 1104, the logical stream decoder FIFO 1 104 selects the logical stream (P/Q/Completions) data corresponding to the current RAID stream data to be provided to the Switch 308. A read pointer output by the logical stream decoder FIFO 1 104 is incremented when the write transaction is considered complete as indicated by the switch 308 and a control signal from the logical stream decoder FIFO 1 104 selects the output of multiplexer 1106.

As discussed earlier, a RAID descriptor 400, 500 shown in Figs 4 and 5 that is fetched by the descriptor fetch engine 301 from the descriptor FIFO 302 includes the starting addresses for all the sources in system memory 218 that are to be accessed for the operation specified in the RAID descriptor 400, 500. The RAID descriptor 400, 500 also includes the number of bytes for the operation. Upstream access to the memory is through the switch 308.

To process the XOR Generate function, the RAID engine 304 reads a plurality of sources from the memory as specified in the RAID descriptors 400, 500 shown in Figs. 4 and 5. To process the XOR Validate function, the RAID engine 304 also reads a plurality of different sources from the memory as specified in the RAID descriptor 400, 500. Each of the sources can have a different byte alignment.

The starting byte address for a read from a source in system memory 218 may be offset within a cache line and each source may have a different starting addresses, the RAID engine 304 manages different source alignments. The same performance is provided irrespective as to whether the sources are unaligned or aligned and the number of sources. Fig. 13 is a block diagram illustrating source alignment functions of the RAID engine 304. The RAID Read Data FIFO 1202 is a single physical FIFO, separated into N logical FIFOs, one logical FIFO per source. In an embodiment N is 10.

In an embodiment, the downstream datapath from the Switch 308 is smaller than a cache line. All reads from memory are aligned to 64-byte Cache-lines, thus, the RAID datafetch engine 802 translates the number of bytes specified in the RAID descriptor 400, 500 into the number of 64-byte Cache lines (CLs) to be read from system memory 218 for each of the sources. In an embodiment, the RAID engine 304 handles source alignment while meeting a 3GB/s RAID processing requirement for the system.

Starting addresses of the source blocks to be fetched can be off-set within a cache line, thus, the RAID datapath engine 804 compensates by adjusting the number of cache lines to be read. The RAID datafetch engine 802 issues source reads so that the RAID datapath engine 804 can process the available data.

The RAID datapath engine 804 works closely with the RAID read data FIFO 1202, which stores the received data from the read requests generated by the RAID datafetch engine 802.

Each of the N logical FIFOs in the RAID read data FIFO 1202 supports one RAID Source and has a respective position counter in Read Data FIFO counters 1204 that is decoded to indicate an 'empty' and 'full' condition. A source available condition output from the read data FIFO counters 1204 is used to indicate to the RAID datafetch engine 802 that it may issue more source read requests, if required. The inverse of this signal ('not empty') is used to signal the RAID datapath engine 802 that there is source data available to be processed.

The RAID datapath engine 804 begins processing by reading a common descriptor 1212 that is used by both the RAID datafetch engine 802 and the RAID datapath engine 804. The common descriptor 1212 stores the length of the descriptor (descriptor byte count) and also includes individual source addresses relative to the 64 byte cache line for each source. The Read Data FIFO 1202 is updated as the operation for each respective source is completed. Upon completion of an operation for a respective source, the RAID datapath engine 804 begins processing the source data.

In an embodiment, the RAID datapath engine 802 processes 32 Bytes of Source Data at a time, or less if there is less than 32 Bytes (B) remaining to be processed. If the source address offset is cache line aligned, the RAID datapath engine 802 reads the first data (data N) from the Read Data FIFO 1202, processes the P and Q data, then stores the result in the respective P or Q accumulator 1206, 1208.

If the source address offset is non-cache line aligned, the RAID datafetch engine 802 reads the first data (data N) and the second data (data N+l) from the read data FIFO 1202 and extracts the relevant 32B to be processed from the combination of the two data entries. All sources are processed in this manner, using either one 32B Source Data entry (N) or two 32B Source Data entries (N and N+l), dependent on the respective source offset.

After 32B of Source Data has been processed for each source, the contents of the local accumulator (P accumulator 1206 or Q accumulator 1208) is written to an upstream write data FIFO 708 so that the accumulator may be used to process the next set of source data. If a source has an unaligned starting address, each entry (except the first entry) is read twice from the Read Data FIFO 1202 during processing. In one embodiment, the second read is hidden during one of the processing states. In another embodiment, the data is registered (cached) locally.

The RAID datapath engine 804 continues to process the data payload, 32 Bytes at a time from each source, in the manner described above until all of the data has been consumed. The 32 Bytes of source data from each source is accessed in a round robin fashion across all the sources. A shared single down counter (DMA remain count) 1210, tracks the remaining number of 32 Bytes of data to be accessed. This scheme requires only one down counter (DMA remain count 1210) for processing the data, irrespective of the number of sources or the mixture of aligned and unaligned source data. As each 32B entry of Source Data is processed, the RAID datapath engine 804 pops off the appropriate entries in the Read Data FIFO 1202, adjusting the number of entries based on the starting address offset. When the data is popped off of the Read Data FIFO 1202 for a source, this signals the RAID datapath engine 804 to fetch more data for the respective source.

The RAID datafetch engine 804 issues source read requests for all N sources, using the address and request length information stored in a common descriptor 1212. When the source completion data arrives, it is written into the read data FIFO 1202, according to the source number for the read completion as well as the next address specified by the read data FIFO counters 1204.

Referring to Fig. 13, when the RAID datapath engine 804 detects that source data is available in the read data FIFO 1202, it first reads the source address offsets and descriptor length from the Common Descriptor 1212. The Source Address offsets are registered locally and the Descriptor length is loaded into the DMA remain counter 1210. The RAID Datapath Engine 804 asserts a Source Select to the Read Data FIFO 1202, the Read_Loc_N and Read_Loc_N+l (if the source address is unaligned). The RAID Datapath Engine 804 then performs the P and Q Calculation on the 32 Bytes of data read from the Read Data FIFO 1202 and stores the accumulated result into the P and Q accumulators 1206, 1208. As the data from each Source is read, the RAID Datapath Engine 804 POPs the data off of the Read Data FIFO 1202, which signals the RAID Datafetch engine 802 that an entry is available in the Read Data FIFO 1202 to accept more source data. After all sources have been read and processed from the Read Data FIFO 1202, the contents of the P and Q Accumulators 1206, 1208 are written to the write data FIFO 708. The counter DMA remain counter 1210 is decremented by 32 Bytes to track the number of bytes processed. This process is repeated until all the Source Data has been processed and operation(s) defined by the descriptor 400, 500 are complete.

As discussed, one counter (DMA remain counter) 1210 is used to track transfer byte counts for all N sources while providing the same system performance irrespective of the alignment of the different sources. The initial source byte offsets are handled on the first transaction so that for subsequent transactions, the RAID datafetch engine 802 processes the read data using a fixed size processing, issuing multiple reads to the read data FIFO 1202 to read misaligned data, if necessary. The RAID datapath engine 804 handles the byte offsets and ensures that all the completions for different sources are popped off and the source data FIFO is ready for use by the next descriptor and/or DMA channel. The Read Data FIFO 1202 provides the flow mechanism between the RAID Datafetch engine 802 and the RAID Datapath Engine 804, signaling when more data is available for consumption or more data is required for the RAID Datapath engine 804.

In order to process the XOR Generate function, the RAID Engine 304 reads N different sources, that is, data stored at a source address in the memory as specified by the source addresses in the descriptor 400, 500. The descriptor defines the size of the data to be transferred which in one embodiment can range from 1 Byte to 1 MByte. The RAID Engine 304 generates the Exclusive OR (XOR) value of the combination of the data read from the different sources in the case of XOR Generate.

In the case of XOR with GF Generate, the RAID Engine 304 performs two operations concurrently on the source blocks. The final result is written to two buffers, that is, the P destination data and the Q destination data. The P destination data and the Q destination data can have a different starting address and alignment in memory. Both the P and Q destination data is updated concurrently in the memory. The different P and Q starting addresses (1 to Memory range supported) and size of the transfer (1-lMByte) results in different alignment combinations to ensure that the P and Q destination data is written correctly to the memory. In addition to the alignment issue and concurrent updating requirements, there is also a RAID performance requirement. In an embodiment, the RAID performance requirement is 3GB/s linked to a 6.4 GTs coherent protocol, for example, coherent protocol interface.

Returning to Fig. 8, the RAID engine 304 includes a RAID DataFetch Engine 802 and a RAID Datapath Engine 804. The RAID Datafetch Engine 802 and the RAID Datapath engine 804 operate in synchronization with each other. The RAID Datafetch Engine 803 issues a Request for Ownership (RFO) for one or more cachelines in system memory 218 and the RAID datapath engine 804 issues the writes to the cache lines after the requested ownership has been obtained. The addresses and the alignment of the issued destination transfers issued are the same in order to hide the fact that the write operations are performed by two different engines.

In an embodiment, the RAID datafetch engine 802 and the RAID datapath engine 804 work in a lock-step manner to handle all different combinations of P and Q alignment for a RAID 6. In an embodiment, a cache line is defined as 64 Byte boundaries. When the RAID datafetch engine 802 detects (from the common descriptor information 1212 (Fig. 13) that the P destination is not aligned to a Cache Line (CL), the RAID datafetch engine 802 issues a Request for Ownership (RFO) request. An RFO request is issued to 'prefetch' a write Cache line from the system memory 218 that is up-to the end of the current cache line. The RAID datafetch engine 802 also issues an RFO request for the Q destination.

If the P or the Q requests are aligned to the cache line, each P and Q request is for a cache line. If the P AND the Q requests issued are not at least a cache line, the RAID datafetch issues a request for the next cache line for both P and Q. This is to ensure that the RAID datafetch engine 802 issues the RFOs in advance of when the actual writes occur for that cache line from the RAID datapath engine 804.

As the P and Q destination alignments are different, there may be only P bytes or only Q bytes or both P and Q bytes at the completion of the current descriptor processing. The RAID datafetch engine 802 and the RAID datapath engine 804 detect different data alignments and skip issuing P or Q when no writes are to be issued.

A single cache line request does not straddle page boundaries and thus there is alignment to the nearest cache line. The RAID datapath engine 804 processes a half-cache line (32B) of P and Q input data for each pass through the RAID datapath engine 804, regardless of destination byte alignment. The RAID datapath engine 804 calculates the size of the P and Q requests using the same common descriptor information. The P and Q destination sizes are calculated so that the request size is aligned to the next cache line. For example, if the Address is 0x04, the request size is adjusted to 0x40 - 0x04 = 0x3C, even if there is more P or Q data.

Figures 14A and 14B illustrate dataflow for an unaligned P and Q destination, with a transfer length of four cache lines (eight half-cache lines (HC)) in the RAID datapath engine 804 shown in Fig. 13. In the embodiment shown, each entry 1320 in the P

Accumulator 1206 represents a half-cache (HC) line, so the pipeline fills two half-cache (HC) line entries 1320 to write a full cache line. The P destination offset is 4 Bytes. Thus, the first request size is for 60 Bytes (that is, 64 (full-cache line) -4 (offset) bytes). The Q destination offset is 60 Bytes, so the first request size is 4 Bytes (that is, 64 (full-cache line - 60 (offset)). The first P and Q calculation result 1306, 1310 of exactly 32 Bytes (half- cache line) is present in the first pass of calculation. For this embodiment, the P destination's first 60 Bytes are written into the first cache line location in system memory 218.

The destination write requests and data are issued when there is enough data to satisfy the size of the next P and Q request size. If the P and/or Q request size up to the next cache line boundary is less than the amount of P and Q data actually calculated, remainder P and Q data is maintained in separate P and Q residue registers 1308, 1314 and used for the subsequent P and Q data write. P and Q requests and data writes are issued up to the end of the transaction.

The request address and size are calculated from internal counters. The request data uses a combination of the currently calculated data and the residue data 1308, 1314, using a multiplexer (mux) 1304, 1312 with the multiplexer select using a value determined by the destination offset. The residue bytes are updated for each P and Q write request, storing the bytes that did not fit in the request size. Dependent on the P and Q destination address offsets, P and/or Q residue data is written as an additional requests issued after all calculation of input data has been completed.

The Q destination's first 4 bytes are written into the first cache line location. The P request data is not sufficient to satisfy the size of the first P request, so no P request is issued. The Q request data is sufficient to satisfy the size of the first Q request, so a Q request is issued and the unused data is saved into the Q Residue register 1314. The second P and Q calculation result of exactly 32 Bytes is present in the second pass of calculation. The P request data is now sufficient to satisfy the size of the first P request, so the P request is issued and the unused data is saved in the P residue register 1308. The Q request data is not sufficient to satisfy the size of the second Q request, so the data is multiplexed in multiplexer 1312 with the Q residue data, stored in the Write Data FIFO 708 and the Q residue updated with new data.

The subsequent P and Q calculations continue to store data in this manner, updating the write data FIFO 708, P accumulator 1206 in the datapath RAID engine 804 and Q accumulator 1208 in the datapath RAID engine 804 appropriately, until all input bytes have been processed. After all the input bytes have been processed, one additional P transaction of 4 Bytes and one additional Q transaction of 60 Bytes are performed to flush the residue bytes into the Write Data FIFO 708.

P and Q destination alignments are supported to any address alignment and any length. In an embodiment the P and Q destination alignments are performed while maintaining a combined performance of 3GigaBytes/second through the simultaneous calculation of P and Q data, requiring only one set of reads through all the multiple input sources. The destination data is aligned up to the next cache line boundary, so that the write requests and data are efficiently transported through the system. This eliminates the need to also align the data up to the next 4Kilobyte (page) boundary because the page boundary is a multiple of the Cache line boundary. The independent RAID datafetch and RAID datapath engines each calculate the size of the requests based on the descriptor information common to both the datafetch engine and the datapath engine, following the same rules, not requiring additional logic to coordinate the calculation.

An embodiment has been described for RAID 6 that includes generation of P parity and Q parity. In an alternate embodiment, RAID support is provided by using only the P path to perform the XOR operation for RAID 5 while the Q path is disabled. Fig. 15 is a block diagram of embodiment of a system that includes a switch 308 coupled between the RAID engine 304 and a coherent protocol interface 1404. In an embodiment, the coherent protocol interface is QuickPath Interconnect (QPI).

The RAID engine 304 issues read requests to the System Memory 218 through the coherent protocol interface. The coherent protocol interface supports out-of-order completions. Out-of-order completions are tracked and associated with multiple transactions even within a single source for a given descriptor 400, 500. The RAID Engine 304 handles read completions that come back out-of-order even though the requests to the system memory 218 are issued in order.

In an embodiment, there are multiple outstanding transactions for a given source, that is, multiple sources have transactions pending. A source does not block the forward progress of other sources. Out-of-order completion returns are tracked to generate the destination data. In an embodiment, the tracking is performed using a source number 1406 and a transaction number 1408 within the source in order to correctly generate the correct destination data for RAID.

The combination of the RAID Engine 304 and the Switch 308 simplify the handling of data from up-to N sources to meet the bandwidth requirements for

RAID5/RAID6. In an embodiment N is 10.

The RAID datafetch engine 802 in the RAID engine 304 translates the number of bytes transferred into a number of cache lines to be issued per source. In an embodiment, the cache line size is 64Bytes. However, the cache line is not limited to 64 bytes, it may be any multiple of 2 N bytes. Next, the RAID datafetch engine 802 in the RAID engine 304 presents the transaction to the switch 308 and identifies the source.

The switch 308 assigns a tag to the transaction based on the source number 1406. The switch 308 also assigns a tag to the transaction based on the transaction number 1408 for the source because multiple transactions can be pending and in flight for the same source.

The switch 308 divides a single physical FIFO into a plurality of logical FIFOs with each logical FIFO associated with a different source as shown in Fig. 13. For example, in an embodiment, the single physical completion FIFO is divided into N different logical FIFOs with each logical FIFO corresponding to one of the N different sources. Also, each of the logical FIFOs is further divided into multiple transaction FIFOs per source. For example, in an embodiment with four pending transactions per source, the single logical source FIFO is divided into four different logical partitions and tracked separately.

Next, the switch 308 presents the transaction to the coherent protocol interface 1404. The coherent protocol interface 1404 fetches the data from System Memory 218. As the data read from system memory 218 is returned out of order, the Switch 308 decodes which source the returned data is associated with based on the returned tag for that completion. The switch 308 looks at the transaction tag to determine which logical partition in the source logical FIFO the completed transaction corresponds to.

The switch 308 decodes a tag field to first determine the source number (1-N) 1406 and then to determine the transaction 1408 within that source number. The switch 308 puts the completion into the correct partition and into the respective source logical FIFO. A 'source_available' vector, that is, source data 1-N from 1204 to 802 (Fig. 13) is updated to indicate which source and transaction within the source is now available. The RAID datapath Engine 804 is notified that the particular source data is available. The RAID Datapath Engine 804 then proceeds to assemble the data in a serial manner and process the source data to generate the destination data. The data may return out of order from the coherent protocol interface, thus, the RAID Datapath engine 804 includes optimizations to maximize throughput despite this inefficiency.

The RAID Datapath engine 804 determines which source to calculate next, based on the source_available vector, not necessarily in a numerical order. If a given source has sufficient data available, determined by the source address offset and indicated in the 'source_available' vector, then that source is selected to be processed next. By selecting sources in this manner, the overall bandwidth is improved as data is processed as it arrives with minimal stalling. This is possible because the data can be calculated in any source order without compromising the RAID calculations.

The switch 308 has mechanisms to optimize the use of the partition that just freed up for a given source to ensure that there are sufficient transactions pending in the pipeline to minimize system read latency. The switch 308 uses the availability of the partition for a given source to throttle subsequent transactions from making forward progress.

For example, in an embodiment with eight sources, a cache line request can be issued per source, while waiting for the cache line to be returned for each source, another cache line request can be issued per source. As the data can be returned out of order, each request includes a source number 1406 and a transaction number (first or second cache line request) 1408.

Also, to ensure that the availability of one source doesn't block the other, the switch 308 has mechanisms to make forward progress for another source if the completion space corresponding to a given source is not empty.

Two RAID descriptors 400, 500 have been discussed in conjunction with Figs. 4 and 5. Each of these RAID descriptors 400, 500 are base descriptors and are one cache line in size. Fig. 16 is a block diagram illustrating an embodiment of an extended descriptor 1500.

An extended descriptor 1500 is an extension to a base descriptor 400, 500 for

RAID operations that require more than 64B (cache line) of information. The next descriptor address 410 in the base descriptor 400 points to the first extended descriptor 1500. If there is more than one extended descriptor, the next descriptor address 1508 of each intermediate extended descriptor 1500 points to the next extended descriptor 1500 with the next descriptor address 1508 in the last extended descriptor 1500 pointing to the next base descriptor 500. The RAID Engine 304 supports processing both extended descriptors and base descriptors.

The extended descriptor 1500 shown in Fig. 16 is used when all of the source addresses cannot be specified within the source address fields in a base descriptor 400, 500. Extended descriptors need special handling during commands such as Suspend (Suspend current Descriptor processing) and Reset (Reset the DMA). The Reset DMA command is issued to the DMA controller 214 to recover when a DMA channel has hung, that is, the DMA channel has stopped processing descriptors. The Reset DMA command can be issued at anytime and executes immediately, that is, the processing of the current descriptor is not completed. The Reset DMA command causes the

DMA channel to return to a known state (Halted). A suspend DMA command is issued to the DMA controller 214 to suspend the current DMA transfer. The DMA channel halts at current descriptor boundary.

Returning to Fig. 4, the descriptor control field 402 in the base descriptor 400 includes an indication as to whether processing of the current base descriptor is to be stalled until processing of the current base descriptor is completed. However, even if processing of the current base descriptor is stalled, the descriptor fetch engine 301 fetches the extended descriptor because the source addresses stored in the extended descriptor 1500 are needed to complete the operation in the base descriptor 400.

The RAID datafetch engine 802 decodes the base descriptor 400 and determines whether an extended descriptor 1500 is needed to complete processing of the operation in the base descriptor. If so, the RAID datafetch engine 802 gives up the timeslot and stalls until the descriptor fetch engine 301 fetches the extended Descriptor 1500. The RAID datafetch engine 802 also decodes the relevant fields in the extended descriptor 1500 and issues the fetches for all the sources.

Upon completion of the extended or base descriptor, the RAID datafetch engine 802 ensures that the descriptor FIFO 302 is updated appropriately.

The RAID datapath engine 804 processes the source data and writes the generated destination data. The RAID datapath engine 804 decodes the descriptor (base and extended) and uses the source data completions to complete processing the descriptor. The RAID datapath engine 804 determines whether the descriptor is a base descriptor 400 or an extended descriptor 1500. For example, the RAID datapath engine 804 can determine the type of descriptor (base or extended) based on the number of source address pointers included in the descriptor. The RAID datapath engine 804 decodes the source and destination addresses according to the unique mapping of either the extended or base Descriptor. After reading the extended descriptor 1500 or base descriptor 400, the RAID datapath engine 804 ensures that the descriptor FIFO 302 is updated correctly. The RAID datapath engine 804 also decrements an internal DMA count that tracks the number of processed descriptors correctly when the extended descriptors are processed in addition to handling the base descriptors.

At the end of descriptor processing, the RAID datapath engine 804 provides a completion response irrespective as to whether the descriptor completion is for an extended descriptor or a base descriptor. The arbiter 303 detects when the RAID datafetch engine 804 is stalled while waiting for the extended descriptor 1500 and ensures that the extended descriptor 1500 is fetched before scheduling the RAID datafetch engine 802 for that DMA channel.

The descriptor fetch engine 301 has two windows of exposure, that is, after the base descriptor 400 is fetched but before the descriptor fetch engine 301 is scheduled to fetch the extended descriptor and when the base descriptor has been fetched and after the descriptor fetch engine 301 is scheduled to fetch the extended descriptor. If the RAID datafetch engine 802 has decoded the base descriptor 400 and there is a request to suspend or reset, instd of waiting to fetch the extended descriptor 1500, the RAID datafetch engine 802 processes the base descriptor and updates the descriptor FIFO 302.

If the RAID datafetch engine 802 has decoded the base descriptor 400 and there are no requests to suspend or reset the DMA channel, the RAID datafetch engine 802 waits for the extended descriptor prior to processing the base descriptor 400. If a request to Suspend or Reset of the DMA channel occurs while the RAID datafetch engine 802 is waiting for the extended descriptor, the RAID datafetch engine 802 processes the base descriptor and pops the base descriptor from the descriptor FIFO 302.

If the RAID datafetch engine 802 has decoded the base descriptor 400, there is a pending request to suspend or reset and the extended descriptor is available, the RAID datafetch engine 802 processes the base descriptor and the extended descriptor and pops both descriptors from the descriptor FIFO 302

It will be apparent to those of ordinary skill in the art that methods involved in embodiments of the present invention may be embodied in a computer program product that includes a computer usable medium. For example, such a computer usable medium may consist of a read only memory device, such as a Compact Disk Read Only Memory (CD ROM) disk or conventional ROM devices, or a computer diskette, having a computer readable program code stored thereon.

An apparatus comprising first logic to perform a Direct Memory Access (DMA) operation to transfer a first block of data from a first source to a first destination, second logic to perform a Redundant Array of Independent Disks (RAID) operation on second blocks of data retrieved from a plurality of second sources and fetch logic shared by the first logic and the second logic, the fetch logic to retrieve a first descriptor from memory shared by the first logic and the second logic, the memory storing a first descriptor identifying the DMA operation to be performed by the first logic and a second descriptor identifying the RAID operation to be performed by the second logic, the operations to be performed concurrently by the first logic and the second logic.

The apparatus comprising a descriptor decoder First In First Order (FIFO) to store the first descriptor and the second descriptor and to maintain required order of the descriptors irrespective of the order in which the descriptors are processed by the first logic and the second logic. The second logic concurrently generates a P syndrome and a Q syndrome in a single pass through the second blocks of data.

The apparatus comprising upstream logic coupled to first and second logic to provide access to the memory.

The apparatus comprising control logic coupled to the first logic and the second logic, the control logic to track a number of descriptors to be processed in order to determine whether there is any work to be performed by the first logic and/or the second logic.

The first descriptor and the second descriptor are retrieved from a chain of descriptors stored in the memory.

The first logic and the second logic to concurrently determine each of the first descriptor and the second descriptor is a RAID descriptor or a non-RAID, the first logic to process the non-RAID descriptor and the second logic to process the RAID descriptor.

The apparatus further comprising an arbiter coupled to the first logic and the second logic, the arbiter to provide exclusive access to the plurality of second sources to the second logic for a dynamically expandable time period.

The arbiter determines from system resources whether to allocate the dynamically expandable time period to the second sources.

The fetch logic to break up an initial data transfer size greater than a data transfer size supported by a DMA channel into smaller transfer lengths supported by the DMA channel by recycling the initial descriptor for use by subsequent data transfer operations by modifying address fields in the initial descriptor until the initial block size has been transferred using a plurality of separate DMA operations with each DMA operation having a data transfer size less than or equal to the data transfer size supported by the DMA channel. A single interrupt is generated after the initial block size has been transferred.

The first logic maps a plurality of variable length RAID streams generated by the RAID operation to a single output logical stream. The plurality of variable length streams can include a P syndrome, a Q syndrome and a completion stream.

The second logic manages data misalignment of data read from the plurality of second sources. The number of second sources can be 10 for a validate operation and 8 for a generation operation. The second logic manages alignment of data to be written to a destination to store a result of the RAID operation. The result can include a P syndrome and a Q syndrome for a RAID 6 system.

The second logic manages out of order completions for the RAID operation.

In an embodiment, the plurality of sources is 2 to 10.

The second logic to handle an extended descriptor to identify other sources in addition to the plurality of sources identified in a base descriptor. The second logic to handle both extended descriptors and base descriptors during normal operation and to handle extended descriptors upon detecting a command to suspend processing of descriptors or a command to reset processing of descriptors.

A method comprising performing, by a first logic, a Direct Memory Access (DMA) operation to transfer a first block of data from a first source to a first destination, performing, by a second logic, a Redundant Array of Independent Disks (RAID) operation on a plurality of second sources and retrieving, by fetch logic shared by the first logic and the second logic, a descriptor from memory shared by the first logic and the second logic, the memory storing a first descriptor identifying the DMA operation to be performed by the first logic and a second descriptor identifying the RAID operation to be performed by the second logic, the operations to be performed concurrently by the first logic and the second logic.

Storing the first descriptor and the second descriptor a descriptor decoder First In

First Order (FIFO) and maintaining required order of the descriptors in the descriptor decoder FIFO irrespective of the order in which the descriptors are processed by the first logic and the second logic. The second logic concurrently generates a P syndrome and a Q syndrome in a single pass through the second blocks of data.

Providing access to the memory, by upstream logic coupled to first and second logic,.

Tracking, a number of descriptors to be processed in order to determine whether there is any work to be performed by the first logic and/or the second logic. The tracking performed by control logic coupled to the first logic and the second logic.

Retrieving the first descriptor and the second descriptor from a chain of descriptors stored in the memory. The first logic and the second logic to concurrently determine each of the first descriptor and the second descriptor is a RAID descriptor or a non-RAID, the first logic to process the non-RAID descriptor and the second logic to process the RAID descriptor.

Providing exclusive access to the plurality of second sources to the second logic for a dynamically expandable time period and determining from system resources whether to allocate the dynamically expandable time period to the second sources.

Upon detecting a request for an initial data transfer size greater than data transfer size supported by the DMA operation, breaking the initial data transfer size into smaller transfer lengths supported by the DMA operation by recycling the initial descriptor for use by subsequent data transfer operations by modifying address fields in the initial descriptor until the initial block size has been transferred using a plurality of separate DMA operations with each DMA operation having a data transfer size less than or equal to the data transfer size supported by the DMA operation. A single interrupt can be generated after the initial block size has been transferred.

Mapping a plurality of variable length RAID streams generated by the RAID operation to a single output logical stream by the first logic. The plurality of variable length streams can include a P syndrome, a Q syndrome and a completion stream.

Managing, by the second logic, data misalignment of data read from the plurality of second sources. The number of second sources can be 10 for a validate operation and 8 for a generation operation.

Managing, by the second logic, alignment of data to be written to a destination to store a result of the RAID operation. The result can include a P syndrome and a Q syndrome for a RAID 6 system.

Managing, by the second logic, out of order completions for the RAID operation. The plurality of sources can be 2 to 10.

In an embodiment, upon detecting a suspend or reset state, handling, by the second logic, an extended descriptor. The second logic can handle both extended descriptors and base descriptors during normal operation. The second logic can also handle extended descriptors upon detecting a command to suspend processing of descriptors or a command to reset processing of descriptors.

An article including a machine-accessible medium having associated information, wherein the information, when accessed, results in a machine performing (1) a Direct Memory Access (DMA) operation to transfer a first block of data from a first source to a first destination by a first logic (2) a Redundant Array of Independent Disks (RAID) operation on a plurality of second sources by a second logic and (3) retrieving, by fetch logic shared by the first logic and the second logic, a descriptor from memory shared by the first logic and the second logic, the memory storing a first descriptor identifying the DMA operation to be performed by the first logic and a second descriptor identifying the RAID operation to be performed by the second logic, the operations to be performed concurrently by the first logic and the second logic.

A system includes a Random Array of Independent Disks (RAID) and a processor. The processor comprising first logic to perform a Direct Memory Access (DMA) operation to transfer a first block of data from a first source to a first destination in the RAID, second logic to perform a Redundant Array of Independent Disks (RAID) operation on a plurality of second sources and fetch logic shared by the first logic and the second logic, the fetch logic to retrieve a descriptor from memory shared by the first logic and the second logic, the memory storing a first descriptor identifying the DMA operation to be performed by the first logic and a second descriptor identifying the RAID operation to be performed by the second logic, the operations to be performed concurrently by the first logic and the second logic.

While embodiments of the invention have been particularly shown and described with references to embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of embodiments of the invention encompassed by the appended claims.