Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
PACKET PROCESSING SYSTEM ON CHIP DEVICE
Document Type and Number:
WIPO Patent Application WO/2010/076649
Kind Code:
A2
Abstract:
The present invention relates to the area of packet processing system on chip device which that estimates the queue fill level for a particular type of data stored in the data storage device, enables sharing of the data storage unit between different units that have different traffic characteristics as per their traffic requirements and ensure that traffic from any one does not adversely impact the other, enhances the performance of the memory by modifying the memory access pattern and maintains a certain rate of reading the packet fragments from the DDR at the transmission end.

Inventors:
MALIK RAKESH KUMAR (IN)
BERT KLAPS (BE)
HANSPAL JAGMEET SINGH (IN)
KUNAL PRASAD (IN)
GUJRAL AMANDEEP SINGH (IN)
SHANLEY TIMOTHY M (US)
SURI KAPIL (IN)
GUPTA DINESH (IN)
BARMAN ARUN KUMAR (IN)
ANAND PRASHANT (IN)
PUROHIT MILAN VINODRAI (IN)
GUPTA NEERAJ (IN)
BANERJEE PRADEEPT KUMAR (IN)
DIXIT PRIYADARSHINI (IN)
Application Number:
PCT/IB2009/007914
Publication Date:
July 08, 2010
Filing Date:
December 31, 2009
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
TRANSWITCH INDIA PVT LTD (IN)
MALIK RAKESH KUMAR (IN)
BERT KLAPS (BE)
HANSPAL JAGMEET SINGH (IN)
KUNAL PRASAD (IN)
GUJRAL AMANDEEP SINGH (IN)
SHANLEY TIMOTHY M (US)
SURI KAPIL (IN)
GUPTA DINESH (IN)
BARMAN ARUN KUMAR (IN)
ANAND PRASHANT (IN)
PUROHIT MILAN VINODRAI (IN)
GUPTA NEERAJ (IN)
BANERJEE PRADEEPT KUMAR (IN)
DIXIT PRIYADARSHINI (IN)
International Classes:
H04L12/46; G06F15/78; G06F17/50; G06F21/00; H04L12/42; H04L12/56
Domestic Patent References:
WO2008084179A12008-07-17
WO1996013922A21996-05-09
Foreign References:
US20080183884A12008-07-31
US20080181246A12008-07-31
EP1482402A22004-12-01
US20030021269A12003-01-30
US20080165795A12008-07-10
Attorney, Agent or Firm:
GOPALAN, Deepak, Sriniwas et al. (B.K. House Plot No. 109,Sector-44,Gurgaon 2, Haryana, IN)
Download PDF:
Claims:
We Claim:

1. A packet processing system on chip device, comprising: a buffer manager coupled with the storage device configured to determine storage device fill level; and a packet processing unit in communication with the Buffer Manager and being configured to obtain atleast a first data type contribution factor or an approximated first data type contribution factor and determine a dynamic queue fill level for at least a first data type stored in said storage device.

2. The system as claimed in claim 1 , wherein the buffer manager comprises: a plurality of memory locations grouped together to form at least one lsl block; and a plurality of memory locations grouped together to form a 2nd block.

3. The system as claimed in claim 1, wherein the packet processing unit is communicatively coupled to an input interface for receiving incoming packets of data

4. A packet processing system on chip device, comprising: a buffer manager configured to enable dequeuing of data stored on storage device, said storage device comprising: a plurality of memory locations grouped together to form at least one 1st block; a plurality of memory locations grouped together to form a 2nd block; and characterized in that the 2nd block comprises at least one 3rd block; at least one of the lsl block comprises a plurality of 5th blocks linked to each other by at least one explicit linkage; and each of the 5th blocks comprising a plurality of 6th blocks which are linked to each other by at least one implicit linkage and at least one of the 6th block contained in the at least one of the 1st block comprises at least one explicit linkage to the said at least one 3rd block.

5. The system as claimed in claim 4, wherein the said 3r block comprises at least one 41 block.

6. The system as claimed in claim 4, wherein when the said 3rd block comprises "n" number of 4th blocks, at most of "n-1" explicit linkages are present between the said "n" number of 4th blocks.

7. The system as claimed in claim 6, wherein if the number "n-1" is equal to 2, a first of the said two explicit linkages links a first 4th block to a second 4th block and a second of the said two explicit linkages links:

(a) a second 4th block to a third 4th block; or (b) a first 4th block to a third 4th block.

8. The system as claimed in claim 4, wherein when the said 1st block comprises "y" number of 5th blocks, at most "y-1" explicit linkages are present between the said "y" number of 5th blocks.

9. The system as claimed in claim 8, wherein if the number "y-1" is equal to 2, a first of the said two explicit linkages links a first 5th block to a second 5th block and a second of the said two explicit linkages links:

(a) a second 5th block to a third 5th block; or (b) a first 5th block to a third 5th block.

10. The system as claimed in claim 5, wherein the 4th block comprising a plurality of 7th blocks which are linked to each other by at least one implicit linkage.

1 1. The system as claimed in claim 4, wherein in respect of a non-self contained packet, the data stored in the at least one of the 1st block comprises header related information.

12. The system as claimed in claim 4, wherein in respect of a self contained packet (SCP), the data stored in the at least one of the 1st block comprises the entire packet.

13. The system as claimed in claim 4, wherein the data stored in the 2nd block represents tail fragments and/or body fragments.

14. The system as claimed in claim 4, wherein each of the 6 blocks contains either a header fragment of a packet or a SCP packet.

15. The system as claimed in claim 4, wherein the buffer manager is communicatively coupled to an output interface.

16. The system as claimed in claim 4, wherein the buffer manager is communicatively coupled to an arbitration unit for accessing shared real time resource.

17. A packet processing system on chip device, comprising: an arbitration unit comprising: a real time resource; a plurality of clients connected to the real time resource and sharing the same; and characterized in that the plurality of clients are connected to the real time resource through an arbitration unit which is configured to receive requests from the said plurality of clients, automatically allocate a predetermined initial priority level (IPL) for each of the requests thus received, wherein the priority values thus allocated to the requests from a particular client are selected from a priority level group (PLG) allocated to that client based on a predetermined priority level allocation method; and change the priority level of a particular request from the initial priority level (IPL) to a final priority level (FPL), if the said request is not serviced within a predetermined time period (Tth), a value of the said final priority level (FPL) being equal to or greater than a value of the said initial priority level (IPL).

18. The system as claimed in claim 17, wherein the arbitration unit is communicatively coupled to atleast one or more processing units and/or DMA channels and/or buffer manager for allowing access to the shared resource.

19. The system as claimed in claim 17, wherein the arbitration unit is operatively coupled to DDR interface unit.

20. A packet processing system on chip device, comprising: a DDR interface unit comprising: a) a means for receiving a plurality of read and write commands; b) a means for storing the read commands in a read command queue (RCQ) and storing the write commands in a write command queue (WCQ); c) a means for scheduling and reordering the read commands thus stored in the RCQ and/or the write commands thus stored in the WCQ; and d) a means for cyclically accessing at least one write command thus contained in the WCQ and at least one read command thus contained in the RCQ; wherein means for scheduling and reordering is configured to: i) prioritize issuance of successive commands on different banks of the memory; and ii) prioritize issuance of successive commands to a same row of a particular bank of the memory; and wherein the means for cyclically accessing is configured to cyclic access the WCQ and the RCQ in such a manner so as to minimize bus turnovers with respect to a predetermined number of read and write commands.

21. The system as claimed in claim 20, wherein the said DDR interface unit being operatively coupled to a shared resource and an arbitration unit.

22. A method comprising: a) obtaining either a first data type contribution factor or an approximated first data type contribution factor; b) obtaining a system fill level; and c) calculating the dynamic queue fill level for said first data type stored in the system based on the obtained first data type contribution factor or an approximated first data type contribution factor and the system fill level.

23. The method as claimed in claim 22, wherein the first data type contribution factor is obtained by configuring a predetermined constant value.

24. The method as claimed in claim 22, wherein the first data type contribution factor is obtained by user repeatedly setting and resetting the contribution factor at user defined intervals.

25. The method as claimed in claim 22, wherein the first data type contribution factor is obtained by: a) determining a volume of the first data type being sent for storage onto the system over a period of time; b) determining a total volume of all other data types being sent for storage onto the system over the aforesaid period of time; and c) calculating the first data type contribution factor as: contribution factor = (volume of first data type / volume of all other data types).

26. The method as claimed in any of claims 22 to 25, wherein the dynamic queue fill level for said first data type stored in the system is calculated as:

Dynamic queue fill level = system fill level x contribution factor

27. The method as claimed in claim 22, wherein calculating the approximated first data type contribution factor comprises the step of determining the volume of the first data type and the total volume of all other data types.

28. The method as claimed in claim 27, wherein if the volume of the first data type is 0, the approximated first data type contribution factor is assigned a value of 0, the system fill level is completely assigned to dynamic fill level for the other data types and the dynamic fill level for the first data type is assigned a value of 0.

29. The method as claimed in claim 27, wherein if the volume of the first data type is a nonzero value and is less than the total volume of all other data types, a digital value representative of the volume of the first data type is left shifted repeatedly till its value is equal to or greater than a digital value representative of the volume of all other data types, wherein a number of times the left shifting is performed is used to calculate the approximated first data type contribution factor and the system fill level is right shifted for a number of times equal to that used for calculating the approximated first data type contribution factor to obtain the dynamic queue fill level for said first data type.

30. The method as claimed in claim 29, further comprising calculating dynamic queue fill level for all other data types by subtracting the dynamic queue fill level for said first data type from the system fill level.

31. The method as claimed in claim 27, wherein if the volume of all other data types is a non-zero value and is less than the total volume of the first data type, a digital value representative of the volume of all other data types is left shifted repeatedly till its value is equal to or greater than a digital value representative of the volume of the first data type, wherein a number of times the left shifting is performed is used to calculate the approximated all other data types contribution factor and the system fill level is right shifted for a number of times equal to that used for calculating the approximated all other data types contribution factor to obtain the dynamic queue fill level for all other data types.

32. The method as claimed in claim 30, further comprising calculating dynamic queue fill level for the first data type by subtracting the dynamic queue fill level of all other data types from the system fill level.

33. The method as claimed in claim 22, further comprising calculating the dynamic queue fill level for the other data types based on the values of the system fill level and the calculated queue fill level for the said first data type.

34. A method comprising:

(a) receiving a plurality of header requests, wherein the headers are stored in at least one 1 st block;

(b) reading header data corresponding to each of the said plurality of header requests, wherein each of the header data is stored in a separate 6th block in the said at least one 1 st block;

(c) transmitting each of the header data thus read; and

(d) based on the data thus contained in each of the "n" number of 6th blocks, receiving further requests for reading data.

35. The method as claimed in claim 34, wherein a second of the header request is receivable prior to transmission of the header data in respect of the first header request.

36. The method as claimed in claim 34, wherein the plurality of the header requests thus received correspond to headers, all of which are stored in a single lsl block.

37. The method as claimed in claim 34, wherein the plurality of the header requests thus received correspond to headers, no two of which are stored in a single 1st block.

38. The method as claimed in claim 34, wherein the plurality of the header requests thus received correspond to headers, some of which are stored in a single 1 st block.

39. The method as claimed in claim 34, wherein the data thus contained in a particular 6th block thus read indicates presence of only a tail fragment in a 3rd block of a 2nd block.

40. The method as claimed in claim 34, wherein the data thus contained in a particular 6th block thus read indicates presence of the tail fragment and at least one body fragment in a 4th block of the 3rd block.

41. The method as claimed in claim 34, wherein the data thus contained in a particular 6th block thus read indicates presence of the tail fragment and a plurality of body fragments in at least two 4th blocks of the 3rd block.

42. The method as claimed in claim 34, wherein the data thus contained in a particular 6th block thus read indicates absence of both of the tail fragment and a body fragment in the 2nd block.

43. The method as claimed in any of claims 38 to 41 , optionally comprising receiving request(s) for reading data thus contained in the 3rd block of the 2nd block based on the data thus contained in the 6th block.

44. The method as claimed in claim 43, wherein each of the 6th block contains either a header fragment of a packet or a SCP packet; and the 2nd block comprises tail fragment and/or headless body fragments.

45. The method as claimed in claim 39, wherein if the data thus contained in the 6th block indicates presence of only the tail fragment in the 3rd block of the 2nd block, the method further comprises: (a) receiving a request for reading the tail fragment, said request comprising an identification (address) of the 4th block contained in the 3rd block where the tail fragment is stored;

(b) reading the tail fragment thus contained in the identified 4th block of the said 3rd block; and (c) transmitting the tail fragment thus read.

46. The method as claimed in claim 40, wherein if the data thus contained in the 6th block indicates presence of the tail fragment and at least one body fragment in the 4th block of the 3rd block, the method further comprises: (a) receiving a plurality of requests for reading the at least one body fragment and the tail fragment; wherein a first of the said plurality of requests comprises an address of the 4th block in the 3rd block where the at least one body fragment and the tail fragment are stored and the number of requests thus received is less than or equal to a number of 7th blocks thus contained in the 4th block;

(b) reading the at least one body fragment and the tail fragment thus contained in the identified 4th block; and

(c) transmitting the data thus read.

47. The method as claimed in claim 41, wherein if the data thus contained in the 6th block indicates presence of a tail fragment and a plurality of body fragments, that are contained in a plurality of 4th blocks in the 3rd block, the method comprises:

(a) receiving a plurality of requests for reading the plurality of body fragments in a first of the plurality of 4th blocks; wherein a first of the said plurality of requests comprises an address of the first 4 block and the number of requests thus received is equal to a number of 7 blocks thus contained in the 4th block;

(b) reading the plurality of body fragments thus contained in the first 4th block and transmitting the data thus read; and

(c) based on the data thus read, receiving at least one further request for reading data contained in at least one further block of the plurality of 4th blocks.

48. The method as claimed in claim 47, wherein in step (c), the maximum number of requests thus received is given by: [(x-1) * n) + y], wherein, 'x' is the number of explicit linkages thus read in step (b); 'n' is the number of 7th blocks in a 4th block; and %y' is the number of 7th blocks containing either tail or body fragment in the last 4th block.

49. A method comprising: a. receiving requests from the said plurality of clients, b. automatically allocating a predetermined initial priority level (IPL) for each of the requests thus received, wherein the priority values thus allocated to the requests from a particular client are selected from a priority level group (PLG) allocated to that client based on a predetermined priority level allocation method; and c. changing the priority level of a particular request from the initial priority level (IPL) to a final priority level (FPL), if the said request is not serviced within a predetermined time period (Tth), a value of the said final priority level (FPL) being equal to or greater than a value of the said initial priority level (IPL).

50. The method as claimed in claim 49, wherein the priority level group (PLG) comprises at least one initial priority level (IPL), the value of which can vary from a lowest priority level to a highest priority level.

51. The method as claimed in claim 49, wherein the priority level group (PLG) comprises two initial priority levels (IPL), the values of the said two initial priority levels (IPL) are not equal to each other.

52. The method as claimed in claim 49, wherein the priority level group (PLG) allocated to a particular client is selected independent of the priority level group(s) (PLG) allocated to other client(s).

53. The method as claimed in any of claims 49 to 52, wherein the value of the at least one initial priority level (IPL) contained in the priority level group (PLG) allocated to a particular client is selected independent of the value of the at least one initial priority level (IPL) contained in the priority level group(s) (PLG) allocated to other client(s).

54. The method as claimed in any of claims 49 to 53, wherein the values of the two initial priority levels (IPL) contained in the priority level group (PLG) allocated to a particular client are selected independent of the values of the two initial priority levels

(IPL) contained in the priority level group(s) (PLG) allocated to other client(s).

55. The method as claimed in claim 49, wherein the final priority level (FPL) allocated to a particular client is selected independent of the final priority level(s) allocated to other client(s).

56. The method as claimed in claim 49, wherein the predetermined time period (Tth) allocated to a particular client is selected independent of the predetermined time period(s) allocated to other client(s).

57. The method as claimed in claim 49, wherein the predetermined time period (Tth) for a particular client is calculated starting from a last time a command from that particular client is serviced.

58. The method as claimed in any of claims 49 to 57, wherein if the priority level group (PLG) thus allocated to a particular client comprises two distinct initial priority levels

(IPL), for the predetermined time period (Tth) immediately after a last command from that client is serviced, the request(s) thus received from that particular client is allocated a first initial priority level.

59. The method as claimed in any of claims 49 to 58, wherein if the priority level group (PLG) thus allocated to a particular client comprises two distinct initial priority levels (IPL) and if no request is received for the predetermined time period (Tth) immediately after a last command from that client is serviced, the request thus received thereafter is allocated a second initial priority level.

60. The method as claimed in claim 49, wherein the predetermined time period (Tth) represents the maximum time period taken to change the priority level of an un- serviced request from the initial priority level (IPL) to the final priority level (FPL).

61. The method as claimed in any of claims 49 to 60, wherein the time period taken to change the priority level of an un-serviced request from the initial priority level (IPL) to the final priority level (FPL) can vary from a zero value to the predetermined time period (Tth).

62. A method comprising:

(a) receiving a plurality of read and write commands; (b) storing the read commands in a read command queue (RCQ) and storing the write commands in a write command queue (WCQ);

(c) scheduling and reordering the read commands thus stored in the RCQ and/or the write commands thus stored in the WCQ; and (d) cyclically accessing at least one write command thus contained in the

WCQ and at least one read command thus contained in the RCQ; wherein in step (c) scheduling and reordering is performed so as to: (i) prioritize issuance of successive commands on different banks of the memory; and (ii) prioritize issuance of successive commands to a same row of a particular bank of the memory; and wherein in step (d) the cyclic access of the WCQ and the RCQ is performed in such a manner so as to minimize bus turnovers with respect to a predetermined number of read and write commands.

63. The method as claimed in claim 62, wherein the read command queue (RCQ) comprises a plurality of sub read command queues (RCQi to RCQn), each of the sub read command queue being mapped to a particular bank on the memory and the read commands thus received are stored in step (b) in the said plurality of sub read command queues (RCQi to RCQn) wherein the read command thus stored in a particular sub read command queue (RCQl or RCQ2 or ... RCQn) correspond to a particular bank (Bl or B2 or Bn) on the memory.

64. The method as claimed in claim 62, wherein the write command queue (WCQ) comprises a plurality of sub write command queues (WCQi to WCQn), each of the sub write command queue being mapped to a particular bank on the memory and the write commands thus received are stored in step (b) in the said plurality of sub write command queues (WCQi to WCQn) wherein the write command thus stored in a particular sub write command queue (WCQl or WCQ2 or ... WCQn) correspond to a particular bank (Bl or B2 or Bn) on the memory.

65. The method as claimed in any of claims 63 or 64, wherein the read commands thus stored in a particular sub read command queue are stored on the basis of their time of receipt (first come first in) and similarly the write commands thus stored in a particular sub write command queue are stored on the basis of their time of receipt (first come first in).

66. The method as claimed in claim 62, wherein in step (c), the commands thus stored in a particular command queue (i.e. the read command queue or the write command queue) are accessed such that they are from different sub command queues so as to correspond to interleaved banks on the memory.

67. The method as claimed in claim 66, wherein a particular sub- write command queue (WCQi or WCQ2 or ... WCQn) to be accessed for servicing within the write command queue (WCQ) is determined using a round robin methodology and similarly, a particular sub-read command queue (RCQi or RCQ2 or ... RCQn) to be accessed for servicing within the read command queue (RCQ) is determined using a round robin methodology.

68. The method as claimed in any of claims 62 to 67, wherein the reordering comprises reordering the commands thus stored in a particular sub command queue (i.e. a particular sub read command queue or a particular sub write command queue).

69. The method as claimed in claim 68, wherein the reordering of the commands thus stored in a particular sub command queue is performed to maximize the number of transactions performed a particular row of the bank on the memory.

70. The method as claimed in claim 68, wherein reordering the commands thus stored in a particular sub-queue comprises:

(e) determining for the first command stored in the particular sub-command queue a row identification number;

(f) determining for a predetermined number of subsequent commands stored in the same sub-queue, their respective row identification numbers; and (g) increasing a priority value of servicing of at least one subsequent command thus stored in the particular sub-queue having a row identification number as that of the first command.

71. The method as claimed in claim 70, wherein in step (b), the predetermined number is user definable.

72. The method as claimed in claim 62, wherein a switch from accessing the write command queue to accessing a read command queue occurs in step (d) when:

(h) a predetermined number of write commands have been serviced; and (i) there is a read command available in the read command queue.

73. The method as claimed in claim 62, wherein a switch from accessing the write command queue to accessing a read command queue occurs in step (d) when:

Q) a time gap with respect to a previous write command exceeds a predetermined threshold value; AND

(k) there is a read command available in the read command queue.

74. The method as claimed in claim 62, wherein a switch from accessing the read command queue to accessing a write command queue occurs in step (d) when: (1) a predetermined number of read commands have been serviced; and

(m) there is a write command available in the write command queue.

75. The method as claimed in claim 62, wherein a switch from accessing the read command queue to accessing a write command queue occurs in step (d) when: (n) a time gap with respect to a previous read command exceeds a predetermined threshold value; AND (o) there is a write command available in the write command queue.

76. The method as claimed in claim 62, further comprising providing a method that enables avoiding rise of Read-after-Write hazard due to the scheduling and reordering.

77. The method as claimed in claim 76, wherein the method that enables avoiding rise Read-after- Write hazard comprises:

(p) receiving a write command, the said write command including a corresponding tag; and (q) broadcasting the tag thus contained in the write command after the write operation has been performed.

78. The method as claimed in claim 77, wherein the broadcasting comprises:

(r) pushing the tag in respect of which the write operation has been performed into a FIFO memory block at a memory clock interval; and

(s) reading and transmitting the data thus contained in the FIFO memory block to a system at a system clock.

79. The method as claimed in claim 77, wherein the method that enables avoiding rise of Read-after- Write hazard comprises: in respect of a write command performable on a particular memory location broadcasting an acknowledgement message (AMi) which is indicative of placement of the said write command in a corresponding sub write command queue (WCQi or WCQ2 or ... WCQn); and broadcasting "x" acknowledgement messages (AMx) after the broadcasting of the acknowledgement message of the said write command (AMi), wherein each of the acknowledgement messages (AMx) indicates placement of each of "x" write commands in the said sub write command queue; wherein transmission of a read command in respect of the particular memory location after receipt of the "x" acknowledgement messages enables avoiding rise of

Read-after- Write hazard.

80. The method as claimed claim 79, wherein if there is no write command received for a sub write command queue for a predefined interval of time, an acknowledgement message (AM) is issued to enable avoidance of a lock-up condition while waiting for

AMx in the absence of other write commands to a given write sub command queue.

Description:
FIELD OF THE INVENTION:

The present invention relates to the area of packet processing system on chip device.

BACKGROUND OF THE INVENTION: A typical packet processing system on chip (SoC) device is shown in figure 1. Interface processing, packet processing, buffer and traffic management functions are architecturally partitioned as shown in the figure. A control processor is provided to manage the device; while signal processing functions (e.g. Voice processing) are performed in the Digital Signal Processor (DSP). Storage requirements of this system are met by having one or memories (e.g. DDR's) that are shared across different subsystems. Direct movement of data between the storage memory and either the interfaces or some internal blocks is accomplished by having DMA engines.

In the exemplary system shown below for a Packet Optical Network (PON) device that implements an ONT, packets are received from different interfaces (or internal blocks) and are classified into different flows. Flows can be classified per priority, source, destination, port or any other packet characteristics. Packets belonging to different flows are stored in different queues in the DDR; and the Traffic Manager implements the scheduling algorithm that decides the scheduling of the packets for r ( ead out. The classification function to decide how to queue a packet is implemented in the Packet Processing engine, which in the exemplary SoC is implemented in firmware. The actual mechanism to manage the queue and enable enque and dequeue of data is implemented in the Buffer Manager. The queuing process involves creating a link-list of fragments belonging to a given packet.

The packet processor also performs some control functions based on the fill levels of the queues. The knowledge of a packet entering or leaving a queue, however, is with the Buffer Manager. Therefore, there exists a block that tracks the Enqueue/Dequeue events to maintain a queue-fill status; and send it periodically to the packet processor that takes further action on it to, e.g. generate DBA reports.

There are multiple issues that need to be tackled in such a packet processing system on chip device: (a) The packet processor receives the aggregate queue fill levels from the buffer manager. From the aggregate, it needs to estimate the fraction of bytes that belong to PIR; and the fraction that fall within CIR (Peak Information Rate; Comitted Information Rate);

(b) DDR is used as a shared resource that is accessed by different units. They all have different traffic characteristics. We must enable them to share the DDR as per their traffic requirements and ensure that traffic from any one does not adversely impact the other; and

(c) DDR memory is a critical shared resource. Memory performance is strongly impacted by the memory access pattern due to the critical DDR timing parameters. It is therefore important, in a system that allows so many different types of units to share the same DDR, that a mechanism be provided to push the memory interface to as high a utilization efficiency as possible.

(d) At the transmission end, the packet must not under-run and the required data rate must be maintained. Therefore, we must have a mechanism that, inspite of the latencies involved in reading data from the DDR, still enables us to maintain a certain rate of reading the packet fragments from the DDR; ' i

OBJECT OF THE INVENTION:

Thus the object of the present invention is to provide a packet processing system on chip (SoC) device at least one of the above described issues faced by the currently available packet processing system on chip (SoC) devices.

BRIEF DESCRIPTION OF THE DRAWINGS:

The features of the present invention are set forth with particularity in the appended claims. The invention itself, together with further features and attended advantages, will become apparent from consideration of the following detailed description, taken in conjunction with the accompanying drawings. One or more embodiments of the present invention are now described, by way of example only, with reference to the accompanied drawings wherein like reference numerals represent like elements and in which: Figure 1 illustrates the overall block diagram of the packet processing system on chip device in accordance with an embodiment of the present invention.

Figure 2 illustrates a flowchart of a method for determining a dynamic queue fill level for atleast a first data type stored in a storage device.

Figure 3 illustrates a flowchart of a method for determining a dynamic queue fill level for atleast a first data type stored in a storage device without using the division operation.

Figure 4 illustrates an algorithm determining a dynamic queue fill level for atleast a first data type stored in a storage device without using the division operation.

Figure 5 is a graph illustrating error in PIR (Peak Information Rate) as a fraction of the Queue Fill level.

Figure 6 is a graph illustrating error in PIR as a fraction of PIR reported.

Figure 7 is a graph illustrating error in PIR as a fraction of PIR reported for (Committed Information Rate) CIR/PIR ratio<=l .

Figure 8 illustrates an apparatus for determining a dynamic queue fill level for at least a first data type in accordance with an embodiment of the present invention.

Figure 9 illustrates the overall flow chart of the arbitration method in accordance with the teachings of the present invention.

Figure 10 illustrates the priority levels including the initial and the final priority levels allocated to multiple clients in accordance with a first example.

Figure 1 1 illustrates the priority levels including the initial and the final priority levels allocated to multiple clients in accordance with a second example.

Figure 12 illustrates the method for allocating the initial priority level and the method of changing the same to final priority level in respect of a two-client based system. Figure 13 illustrates the arbitration unit performing the arbitration method in accordance with the teachings of the present invention.

Figure 14 illustrates a simplified diagram of an application environment for maximizing utilization of memory bus bandwidth in accordance with one embodiment of the present invention.

Figure 15 illustrates a block diagram of an interface for maximizing utilization of memory bus bandwidth in accordance with one embodiment of the present invention.

Figure 16 illustrates a flowchart of a method for maximizing utilization of memory bus bandwidth in accordance with one embodiment of the present invention.

Figure 17 illustrates a flowchart of a method for scheduling the commands in accordance with one embodiment of the present invention. >

Figure 18 illustrates a flowchart of a method for reordering the commands in accordance with one embodiment of the present invention. i

Figure 19 illustrates a functional block diagram of a system to enable avoiding rise of Read- after- Write hazard in accordance with one embodiment of the present invention.

Figure 20 illustrates a functional block diagram of a system to enable avoiding rise of Read- after- Write hazard in accordance with another embodiment of the present invention.

Figure 21 illustrates a simplified structural diagram of a storage device to enable dequeuing of data in accordance with an embodiment of the present invention.

Figure 22 illustrates a flowchart of the method to enable dequeuing of data stored in the storage device in accordance with one embodiment of the present invention.

Figure 23 illustrates a flowchart of the method of dequeuing data of a packet in accordance with an embodiment of the present invention. SUMMARY OF THE INVENTION:

Accordingly, the present invention provides a packet processing system on chip device for efficiently / effectively accessing shared resources, comprising a buffer manager coupled with the storage device configured to determine storage device fill level; and a packet processing unit in communication with the buffer manager and being configured to obtain atleast a first data type contribution factor or an approximated first data type contribution factor and determine a dynamic queue fill level for at least a first data type stored in said storage device.

The present invention further provides a packet processing system on chip device, comprising a buffer manager configured to enable dequeuing of data stored on storage device, said storage device comprising: a plurality of memory locations grouped together to form at least one 1 st block; a plurality of memory locations grouped together to form a 2 n block; and characterized in that the 2 nd block comprises at least one 3 rd block; at least one of the l sl block comprises a plurality of 5 th blocks linked to each other by at least one explicit linkage; and each of the 5 th blocks comprising a plurality of 6 lh blocks which are linked to each other by at least one implicit linkage and at least one of the 6 th block contained in the at least one of the 1 st block comprises at least one explicit linkage to the said at least one 3 rd block.

The present invention further more provides a packet processing system on chip device, comprising: an arbitration unit comprising: a real time resource; a plurality of clients connected to the real time resource and sharing the same; and characterized in that the plurality of clients are connected to the real time resource through an arbitration unit which is configured to receive requests from the said plurality of clients, automatically allocate a predetermined initial priority level (IPL) for each of the requests thus received, wherein the priority values thus allocated to the requests from a particular client are selected from a priority level group (PLG) allocated to that client based on a predetermined priority level allocation method; and change the priority level of a particular request from the initial priority level (IPL) to a final priority level (FPL), if the said request is not serviced within a predetermined time period (Tth), a value of the said final priority level (FPL) being equal to or greater than a value of the said initial priority level (IPL).

The present invention further more provides a packet processing system on chip device, comprising: a DDR interface unit comprising: a) a means for receiving a plurality of read and write commands; b) a means for storing the read commands in a read command queue (RCQ) and storing the write commands in a write command queue (WCQ); c) a means for scheduling and reordering the read commands thus stored in the RCQ and/or the write commands thus stored in the WCQ; and d) a means for cyclically accessing at least one write command thus contained in the WCQ and at least one read command thus contained in the RCQ; wherein means for scheduling and reordering is configured to: i) prioritize issuance of successive commands on different banks of the memory; and ii) prioritize issuance of successive commands to a same row of a particular bank of the memory; and wherein the means for cyclically accessing is configured to cyclic access the WCQ and the RCQ in such a manner so as to minimize bus turnovers with respect to a predetermined number of read and write commands. '

The present invention further provides a method comprising: obtaining either a first data type contribution factor or an approximated first data type contribution factor; obtaining a system fill level; and calculating the dynamic queue fill level for said first data type stored in the system based on the obtained first data type contribution factor or an approximated first data type contribution factor and the system fill level.

The present invention further provides a method comprising:

(a) receiving a plurality of header requests, wherein the headers are stored in at least one 1 st block;

(b) reading header data corresponding to each of the said plurality of header requests, wherein each of the header data is stored in a separate 6 th block in the said at least one 1 sl block;

(c) transmitting each of the header data thus read; and

(d) based on the data thus contained in each of the "n" number of 6 th blocks, receiving further requests for reading data.

The present invention further provides a method comprising:

• receiving requests from the said plurality of clients,

• automatically allocating a predetermined initial priority level (IPL) for each of the requests thus received, wherein the priority, values thus allocated to the requests from a particular client are selected from a priority level group (PLG) allocated to that client based on a predetermined priority level allocation method; and

• changing the priority level of a particular request from the initial priority level (IPL) to a final priority level (FPL), if the said request is not serviced within a predetermined time period (Tth), a value of the said final priority level (FPL) being equal to or greater than a value of the said initial priority level (IPL).

The present invention further provides a method comprising:

(a) receiving a plurality of read and write commands;

(b) storing the read commands in a read command queue (RCQ) and storing the write commands in a write command queue (WCQ); (c) scheduling and reordering the read commands thus stored in the RCQ and/or the write commands thus stored in the WCQ; and (d) cyclically accessing at least one write command thus contained in the WCQ and at least one read command thus contained in the RCQ; wherein in step (c) scheduling and reordering is performed so as to: (i) prioritize issuance of successive commands on different banks of the memory; and (ii) prioritize issuance of successive commands to a same row of a particular bank of the memory; and wherein in step (d) the cyclic access of the WCQ and the RCQ is performed in such a manner so as to minimize bus turnovers with respect to a predetermined number of read and write commands.

DETAILED DESCRIPTION

While the invention is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however that it is not intended to limit the invention to the particular forms disclosed, but on the contrary, the invention is to cover all modifications, equivalents, and alternative falling within the spirit and the scope of the invention as defined by the appended claims.

The method steps have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having benefit of the description herein.

The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such process, method. Similarly, one or more elements in a system or apparatus proceeded by "comprises... a" does not, without more constraints, preclude the existence of other elements or additional elements in the system or apparatus.

Figure 1 illustrates the overall block diagram of the packet processing system on chip device in accordance with the teachings of the present invention. As can be noticed, the said packet processing system on chip device (500) comprises a packet processing unit (506) coupled to an input interface (505) for receiving incoming packets of data. The packet processing unit (506) operatively communicates with the buffer manager (507). The buffer manager (507) in turn, operatively communicates a number of components which form part of the packet processing system on chip device. More particularly, the buffer manager (507) operatively communicates an output interface (508) and with an external storage device (501) such as but not limited to a DRAM or a DDR or a SDDR or any other similar type of memory. The buffer manager (507) operatively communicates with the external storage device (501) via an arbitration unit (504) and a memory interface (502). The arbitration unit (504) may be additionally operatively communicating with one or more processors or applications (503,) forming part of the packet processing system on chip device (500) or which are outside the logical boundaries of the packet processing system on chip device.

In an embodiment the packet processing system on chip device can form part of a switch or a gateway or a Gigabit Passive Optical Network (GPON) system on chip device or similar devices.

As is well know, the GPON technology is being considered as a promising solution for the next-generation broadband access network. Since the network topology of the GPON is point- to-multipoint, a media access control called Dynamic Bandwidth Allocation (DBA) is an important factor for determining the performance of GPON.

The Dynamic Bandwidth Allocation (DBA) is a process whereby an optical network unit (ONU) or alternatively referred to as an optical network terminal (ONT) requests upstream bandwidth to an optical line terminal (OLT) and the OLT reassigns the upstream bandwidth dynamically amongst the different ONUs. In GPON there are two forms of DBA reporting namely status-reporting (SR) and non-status reporting (NSR).

In NSR DBA, the OLT uses service level agreement (SLA) configuration data and monitored idle GPON Encapsulation method (GEM) frames in each traffic container (T-CONT) as inputs. NSR DBA has the disadvantage that there is no way for the OLT to know how best to assign bandwidth across several ONUs that need more.

In SR DBA, the OLT polls ONUs for their backlogs. A given ONU may have several so- called traffic containers (T-CONTs), each with its own priority or traffic class. The ONU reports each T-CONT separately to the OLT. The report message contains a logarithmic measure of the backlog in the T-CONT queue. By knowledge of the SLA for each T-CONT across the entire passive optical network (PON), as well as the size of each T-CONT's backlog, the OLT can optimize allocation of the spare bandwidth on the PON. In SR DBA, the OLT uses SLA configuration data, monitored idle GEM frames in each T-CONT (traffic container) and DBA reporting information from all ONUs as inputs.

The process of ONU providing its queue status information such as the dynamic queue fill level of data types is called DBA reporting.

There has been felt a need to provide to an efficient method to generate a DBA report without much of investment of system resources and which can assist in OLT in arriving at the dynamic bandwidth allocation effectively. More particularly, the packet processor receiving the aggregate queue fill levels from the buffer manager is required to estimate the fraction of bytes that belong to PIR; and the fraction that fall within CIR (Peak Information Rate; Comitted Information Rate).

Accordingly, the present invention provides a method performed by the packet processing system on chip device for determining a dynamic queue fill level for at least a first data type stored in a storage device, the said storage device comprising a plurality of data types stored thereupon, said method comprising: a) obtaining either a first data type contribution factor or an approximated first data type contribution factor; b) obtaining a storage device fill level; and c) calculating the dynamic queue fill level for said first data type stored in the storage device based on the obtained first data type contribution factor or an approximated first data type contribution factor and the storage device fill level. In an embodiment of the present invention, the first data type contribution factor is obtained by configuring a predetermined constant value.

In another embodiment of the present invention, the first data type contribution factor is obtained by user repeatedly setting and resetting the contribution factor at user defined intervals.

In yet another embodiment of the present invention, the first data type contribution factor is obtained by: a) determining a volume of the first data type being sent for storage onto the storage device over a period of time; b) determining a total volume of all other data types being sent for storage onto the storage device over the aforesaid period of time; and c) calculating the first data type contribution factor as: contribution factor = (volume of first data type / volume of all other data types).

In still another embodiment of the present invention, the dynamic queue fill level for said first data type stored in the storage device is calculated as : Dynamic queue fill level = storage device fill level x contribution factor

In an embodiment of the present invention, calculating the approximated first data type contribution factor comprises the step of determining the volume of the first data type and the total volume of all other data types.

In another embodiment of the present invention, if the volume of the first data type is 0, the approximated first data type contribution factor is assigned a value of 0, the storage device fill level is completely assigned to dynamic fill level for the other data types and the dynamic fill level for the first data type is assigned a value of 0.

In still another embodiment of the present invention, if the volume of the first data type is a non-zero value and is less than the total volume of all other data 1 types, a digital value representative of the volume of the first data type is left shifted repeatedly till its value is equal to or greater than a digital value representative of the volume of all other data types, wherein a number of times the left shifting is performed is used to calculate the approximated first data type contribution factor and the storage device fill level is right shifted for a number of times equal to that used for calculating the approximated first data type contribution factor to obtain the dynamic queue fill level for said first data type.

In yet another embodiment, the present invention further comprises calculating dynamic queue fill level for all other data types by subtracting the dynamic queue fill level for said first data type from the storage device fill level.

In one more embodiment of the present invention, if the volume of all other data types is a non-zero value and is less than the total volume of the first data type, a digital value representative of the volume of all other data types is left shifted repeatedly till its value is equal to or greater than a digital value representative of the volume of the first data type, wherein a number of times the left shifting is performed is used to calculate the approximated all other data types contribution factor and the storage device fill level is right shifted for a number of times equal to that used for calculating the approximated all other data types contribution factor to obtain the dynamic queue fill level for all other data types.

In a further embodiment, the present invention further comprises calculating dynamic queue fill level for the first data type by subtracting the dynamic queue fill level of all other data types from the storage device fill level.

In a further more embodiment, the present invention further comprises calculating the dynamic queue fill level for the other data types based on the values of the storage device fill level and the calculated queue fill level for the said first data type.

In order to enable the packet processing system on chip device to determine the storage device queue fill level for at least a first data type: the buffer manager is coupled with the storage device configured to determine storage device fill level; and a packet processing unit is arranged to be in communication with the buffer manager and is configured to obtain atleast a first data type contribution factor or an approximated first data type contribution factor and determine a dynamic queue fill level for at least a first data type stored in said storage device.

Now referring to figure 2, there is illustrated a flowchart of the method for determining a dynamic queue fill level for at least a first data type stored in a storage device (501) implemented by a system on chip device (500).

At step 1 10, a plurality of data types is received by the packet processing unit (506).

In one embodiment, the data received by the packet processor is of two types namely Peak Information Rate (PIR) data and Committed Information Rate (CIR) data. Committed Information Rate (CIR) is the average bandwidth guaranteed by an ISP (Internet Service Provider) to work under normal conditions. At any given time the bandwidth should not fall below this committed figure. Above the Committed Information Rate (CIR), an allowance of burstable bandwidth is known as Peak Information Rate (PIR).

At step 120, the packet processor (506) obtains a contribution factor of at least a first data type stored in the storage device (501). In the above mentioned embodiment, the first data type is the PIR data type.

In one embodiment, the contribution factor of PIR data type is obtained by configuring a predetermined constant value. By way of example, the contribution factor of PIR data or that of CIR data type can be configured as part of device bringing up by appropriate means. For example, the value of PIR data can be set to 30% of the storage fill level.

In another embodiment, the contribution factor of PIR data type is given by the user. The user can repeatedly set and reset this value.

In yet another embodiment, the contribution factor of PIR data type is obtained by calculating the contribution factor of PIR data based on the determination the volume of PIR data type being sent for storage on the storage device (501) over a period of time and determination of a total volume of PIR and CIR data types being sent for storage onto the storage device over the aforesaid period of time.

For example, contribution factor (CF.) of PIR data can be calculated as:

CF. = PIR

PIR+CIR

At step 130, the packet processor (506) obtains a storage device fill level. The buffer manager (507) has the information of the storage device fill level which is conveyed to the packet processor (506) at certain intervals or on request made to the buffer manager (507).

At step 140, the packet processor (506) calculates a dynamic queue fill level of PIR data.

In a particular implementation, the dynamic queue fill level of PIR data is calculated by using the storage fill level and the contribution factor of PIR data, which can be obtained by any of the above mentioned three methods.

In yet another embodiment, the dynamic queue fill level of PIR data is calculated by using the storage device fill level and the contribution factor of PIR data which in-turn is calculated by using the volume of PIR data type being sent for storage on the storage device (501 ) over a period of time and a total volume of PIR and CIR data types being sent for storage onto the storage device (501) over the aforesaid period of time.

In all the above mentioned embodiments, the dynamic queue fill level of PIR data can be calculated as: Dynamic queue fill level of PIR = storage device fill level x Contribution factor of PIR.

As it can be noticed, the above methodology uses operations that involve: (a) divisions; and

(b) multiplications.

As implementation of methods involving multiplication and/or division operations may be: (a) costly; and/or (b) involve high resource utilization. it is preferred to provide a method which does not require performing one or more of the division and/or multiplication operation while calculating the dynamic queue fill level for a particular type of data flow (for example PIR / CIR).

Thus, figure 3 illustrates a flowchart of a method for determining a dynamic queue fill level

(DQL) for at least a first data type stored in a storage device without using the division operation.

At step 210, a plurality of data types is received by the packet processor (506). In one embodiment, the data received by the packet processor is Peak Information Rate (PIR) data and Committed Information Rate (CIR) data.

At step 220, the volume of each of the data type received by the packet processor (506) is determined. More particularly, the volume of each data type is separately counted.

At step 230, the packet processor (506) obtains a storage device fill level. The buffer manager (507) has the information of the storage device fill level which is conveyed to the packet processor (506) at certain intervals or on a request made to the buffer manager (507).

At step 240, it is determined as to whether the volume count of PIR is equal to the volume count of CIR. If the response is YES, then in step 250, the storage device fill level is equally assigned to dynamic fill level for PIR and the dynamic fill level of CIR. In other words, dynamic fill level for PIR is set to be equal to the dynamic fill level for CIR.

Although not explicitly shown in figure 3, it may be also possible to determine as to whether volume count of PIR is "0" and if it is determined to be "0", the approximated PIR contribution factor is assigned a value of 0, the storage device fill level is completely assigned to dynamic fill level for the other data types (i.e. to CIR) and the dynamic fill level for PIR is assigned a value of 0. A vice versa procedure in respect of CIR can also be performed. As can be seen from step 260, it can also be checked as to whether the volume of the first data type (PIR in the present case) is less than the total volume of all other data types (CIR), if the response is YES, the approximated dynamic queue level for PIR is calculated in step 270, which comprises repeatedly left shifting a digital value representative of the volume of the first data type till its value is equal to or greater than a digital value representative of the volume of all other data types, wherein a number of times the left shifting is performed is used to calculate the approximated first data type contribution factor and the storage device fill level is right shifted for a number of times equal to that used for calculating the approximated first data type contribution factor to obtain the dynamic queue fill level for said first data type, Step 270 may further comprise calculating dynamic queue fill level for all other data types by subtracting the dynamic queue fill level for said first data type from the storage device fill level.

On the other hand, if the volume of all other data types is less than the total volume of the first data type, the approximated dynamic queue level for CIR is calculated in step 280, which comprises repeatedly left shifting a digital value representative of the volume of all other data types till its value is equal to or greater than a digital value representative of the volume of the first data type wherein a number of times the left shifting is performed is used to calculate the approximated all other data types contribution factor and the storage device fill level is right shifted for a number of times equal to that used for calculating the approximated all other data types contribution factor to obtain the dynamic queue fill level for all other data types. Step 280 in this case may further comprise calculating dynamic queue fill level for the first data type by subtracting the dynamic queue fill level of all other data types from the storage device fill level.

Thus, in general, in addition to calculating dynamic queue fill level for the first data type

(PIR), it is also possible to additionally or alternatively calculate the dynamic queue fill level for the other data types based on the values of the storage device fill level and the calculated queue fill level for the said first data type. One of the methods of for performing the flow chart of figure 3 is illustrated in figure 4. Figure 5 is a graph illustrating error in PIR as a fraction of the Queue Fill level. The graph is plotted as a function of queue fill level for different ratios of CIR/PIR. ("cp_x indicates CIR/PIR ratio of x).

Figure 6 is a gTaph illustrating error in PIR as a fraction of PIR reported. In this graph, when the total amount of data available in the storage device is large, the error in PIR stays bounded within -10% of the data available.

Figure 7 is a graph illustrating error in PIR as a fraction of PIR reported for CIR/PIR ratio<=l .In this graph, when the CIR to PIR ratio is large, there is more uncertainity on the PIR value sent in the DBA report.

Figure 8 illustrates the components of the packet processing system on chip device that are essential for determining a dynamic queue fill level for at least a first data type in accordance with an embodiment of the present invention.

The apparatus (500) may be connected to a storage device (501) configured to store a plurality of data types. The apparatus comprises a buffer manager (507) coupled with the storage device to determine the storage device fill level. The apparatus further comprises a packet processor (506) in communication with the buffer manager which is configured to determine a dynamic queue fill level for at least a first data type stored in the storage device (501 ). The packet processor (506) receives a plurality of data types and obtains or calculates a volume of each of the data type received. The Buffer Manager (507) conveys the storage device fill level to the packet processor at certain intervals. The packet processor (506), using the volume of all data types and the storage device fill level, determines a dynamic queue fill level for at least a first data type stored in the storage device (501). The apparatus shown in figure 8 is by way of illustration only and the packet processor shown processes headers and generates the DBA report. All the packets are processed in the packet processor (506) and handed over to the buffer manager (507) for storage on to the storage device (501). Also, the buffer manager maintains the status of the Qfill levels of all the queues and communicates that to the packet processor. The packet processor counts the number of PIR and CIR bytes passing through it (called PIR fw and CIR fw ) that it uses, as described above to derive the PIR and CIR bytes contrinutions to the Qfill (which is obtained from the buffer manager).

Coming to yet another aspect of the present invention, since the DDR memory is accessed by the packet processor for performing the main function of data transmission and for storing data which is generated during its functioning and also by one or more processors or applications (503 1 to 503 n ) which form part of the packet processing system on chip device

(500) or which are located outside the logical boundaries of the packet processing system on chip device, the memory can be considered as a shared resource. Whenever, we have a resource which is required to be shared between multiple users, a proper method must be available for the purpose of performing transactions on the real-time shared-resource.

The problem of selecting which processor or application (commonly referred to as client) gains access to a memory device (sometimes referred to as shared resource) such as a DRAM or a DDR or a SDDR or any other similar type of memory is the same basic problem as solved in real time systems commonly used for scheduling tasks on a CPU, or scheduling access to a shared hardware resource. The theories in this area have been under development since the early 70's, and are reasonably advanced. While there are a number of approaches to scheduling, the simplest and possibly most robust is a static priority based schedule based on Rate Monotonic Scheduling.

One of such documents that describe an arbitration method is EP 1 398 696, contents of which is incorporated herein as reference.

The limitations and disadvantages of conventional and traditional approaches of arbitration are apparent to one of skill in the art, including for example, the limitation of the arbitration method to effectively handle request from multiple clients with disparate traffic characteristics and hence, there exists a strong need to provide arbitration methods that are effective and at the same time, simple to implement.

The Inventors have felt that despite the methods available till date, there is still a need to provide an arbitration method that effectively handles requests from multiple clients with disparate traffic characteristics. More particularly, there is a need to provide a system wherein arbitration method that effectively handles requests from multiple clients with disparate traffic characteristics is implemented,

More particularly, DDR being a shared resource that is accessed by different units, all of which may have different traffic characteristics, it is required to enable them to share the DDR as per their traffic requirements and ensure that traffic from any one does not adversely impact the other

Accordingly, the present invention provides a request arbitration method for arbitrating requests from a plurality of clients requesting access to a shared real-time resource, said method comprising the steps of receiving requests from the said plurality of clients, automatically allocating a predetermined initial priority level (IPL) for each of the requests thus received, wherein the priority values thus allocated to the requests from a particular client are selected from a priority level group (PLG) allocated to that client based on a predetermined priority level allocation method; and changing the priority level of a particular request from the initial priority level (IPL) to a final priority level (FPL), if the said request is not serviced within a predetermined time period (Tth), a value of the said final priority level (FPL) being equal to or greater than a value of the said initial priority level (IPL).

The present invention also provides a device (500) that acts as an interface between a real time resource (501) and a plurality of clients (503 1 to 503 n ) connected to the real time resource and sharing the same; wherein the plurality of clients (which may form part of the device (500) or which are located outside the logical boundaries of the device (500)) are connected to the real time resource (501) through an arbitration unit (504) which is configured to receive requests from the said plurality of clients, automatically allocate a predetermined initial priority level (IPL) for each of the requests thus received, wherein the priority values thus allocated to the requests from a particular j Client are selected from a priority level group (PLG) allocated to that client based on a predetermined priority level allocation method; and change the priority level of a particular request, from the initial priority level (IPL) to a final priority level (FPL), if the said request is not serviced within a predetermined time period (Tth), a value of the said final priority level (FPL) being equal to or greater than a value of the said initial priority level (IPL).

With reference to figure 9 it can be noticed that in accordance with an embodiment of the present invention, the present invention provides a request arbitration method for arbitrating requests from a plurality of clients requesting access to a shared real-time resource, said method comprising the steps of: a. receiving requests from the said plurality of clients (900), b. automatically allocating (901) a predetermined initial priority level (IPL) for each of the requests thus received, wherein the priority values thus allocated to the requests from a particular client are selected from a priority level group (PLG) allocated to that client based on a predetermined priority level allocation method; and c. changing (904) the priority level of a particular request from the initial priority level (IPL) to a final priority level (FPL), if the said request is not serviced within a predetermined time period (Tth), a value of the said final priority level (FPL) being equal to or greater than a value of the said initial priority level (IPL).

As can be seen from figure 10, in an embodiment of the present invention, the priority level group (PLG) comprises at least one initial priority level (IPL), the value of which can vary from a lowest priority level to a highest priority level. In figure 10, the lowest priority level is represented by a value equal to 1 and a highest priority level is represented by a value equal to 7. However, other values can be allocated to each of the priority levels. In the present example under consideration, figure 10 illustrates providing 7 distinct priority levels (including the lowest and the highest priority level). However, in actual practice, lesser or more number of distinct priority levels can be defined.

By way of example, in respect of a client which is identified in figure 10 by the name "ENQ", the priority level group comprises one initial priority level (IPL), which is allocated a value equal to "3". Similarly, in respect of a client identified by the name "DSP_Cache", comprises one initial priority level (IPL), which is allocated a value equal to "1". Depending upon the nature of tasks that can be provided by a client, the user can choose the value of the priority level to be given to a particular client.

As can be observed from figure 10, in respect of one or more clients, the priority level group (PLG) comprises two initial priority levels (IPL), the values of the said two initial priority levels (IPL) are not equal to each other. By way of example, in respect of the clients identified by the name "DQ", "CoppenQ", "FLM", etc. the priority level group (PLG) comprises two initial priority levels (IPL) namely a first initial priority level (whose value varies depending upon the client) and a second initial priority level (whose value in varies depending upon the client).

As can be observed from figure 1 1 , which provides a second example of the priority level groups allocated to the multiple clients, the priority level group (PLG) allocated to a particular client may be selected independent of the priority level group(s) (PLG) allocated to other client(s). Particularly, in one more embodiment of the present invention, the value of the at least one initial priority level (IPL) contained in the priority level group (PLG) allocated to a particular client is selected independent of the value of the at least one initial priority level (IPL) contained in the priority level group(s) (PLG) allocated to other client(s). Alternatively, the values of the two initial priority levels (IPL) contained in the priority level group (PLG) allocated to a particular client are selected independent of the values of the two initial priority levels (IPL) contained in the priority level group(s) (PLG) allocated to other client(s).

Similar to the selection of the PLG and/or the IPL, the final priority level (FPL) allocated to a particular client is selected independent of the final priority level(s) allocated to other client(s).

Although in each of figure 10 and figure 1 1 for the sake of simplicity, the final priority level allocated to a particular client has been selected to be equal to the maximum value among the priority level group thus already allocated to that client (i.e. explicitly showing support for the criteria a value of the said final priority level (FPL) being equal to a value of the initial priority level (IPL) contained in the PLG), it is possible to define a final priority level (FPL) which is distinct from the initial priority levels (IPL) contained in the PLG. In a further more embodiment of the present invention, the predetermined time period (Tth) allocated to a particular client is selected independent of the predetermined time period(s) allocated to other client(s). As can be seen from figure 12, in a case wherein an arbitration is required to be performed between two clients namely R and B sending requests Rl , R2, R3 and R4 and Bl, B2, B3, B4, B5, B6, B7, B8 and B9 respectively, the predetermined time period (Tth) allocated to the client R (i.e. Tl) can be set independent of the predetermined time period allocated to the client B (i.e. T2). In an embodiment of the present invention, the predetermined time period (Tth) for a particular client is calculated starting from a last time a command from that particular client is serviced.

In another embodiment of the present invention, if the priority level group (PLG) thus allocated to a particular client comprises two distinct initial priority levels (IPL), for the predetermined time period (Tth) immediately after a last command from that client is serviced, the request(s) thus received from that particular client is allocated a first initial priority level. As can be observed from figure 4, the request R3, which is received prior to expiry of the time period Tl (as calculated from the time period when the Request R2 is serviced) is allocated the first initial priority level in accordance with the aforesaid embodiment.

In yet another embodiment of the present invention, if the priority level group (PLG) thus allocated to a particular client comprises two distinct initial priority levels (IPL) and if no request is received for the predetermined time period (Tth) immediately after a last command from that client is serviced, the request thus received thereafter is allocated a second initial priority level. As can be once again observed from figure 12, the request R2, which is received after to expiry of the time period Tl (as calculated from the time period when the Request Rl is serviced) is allocated the second initial priority level in accordance with the aforesaid embodiment.

As can be further observed from figure 12, if a request is not serviced within a predetermined time period (Tth), the priority level of the particular request is changed from the initial priority level (IPL) to a final priority level (FPL). In this regard, it can be noticed that the request R4 from the client R is not serviced prior to expiry of the time period Tl (calculated from the time when the request R3 from the said client R is serviced). Hence, immediately after expiry of the expiry of the time period Tl, the priority level of the request R4 is changed (in the present case from 1 to 6).

Similarly, with respect to the client identified as "B", it can be observed that the first request

Bl arrives and is allocated the first initial priority level (this is because the said request arrives before expiry of the time period (T2) calculated from the time when a last command from that client is serviced). Thereafter, it can be observed, that the request B2 arrives after expiry of the time period (T2) and hence, is automatically allocated the second initial priority level in accordance with the aforesaid embodiment. Similarly, the request B3 arrives after expiry of the time period (T2) and hence, is automatically allocated the second initial priority level in accordance with the aforesaid embodiment. Thereafter, each of the requests B4 to B9 are received before expiry of the time period (T2) calculated from the time when a last command from that client is serviced and hence, each of the said request is allocated the first initial priority level.

Thus, it can be observed, that the system automatically allocates a predetermined initial priority level based on a predetermined priority level allocation method, one of which has been described above by way of example and thereafter, monitors as to whether a particular request from a particular client is pending for a time period which is equal to the predetermined time period (Tth). If it is determined that there exists a particular request which has not been serviced for a period equal to the predetermined time period (Tth), the arbitration method changes the priority level of the particular request from the initial priority level (IPL) to a final priority level (FPL).

It may be observed that the predetermined time period (Tth) represents the maximum time period taken to change the priority level of an un-serviced request from the initial priority level (IPL) to the final priority level (FPL). In a further embodiment of the present invention, the time period taken to change the priority level of an un-serviced request from the initial priority level (IPL) to the final priority level (FPL) can vary from a zero value to the predetermined time period (Tth).

In the present invention, the arbitration method does not de-bar a particular request from being serviced merely because the request is having a low priority level. By way of example, each of the requests B3 to B9 are serviced first despite that they have a first initial priority level (as the first initial priority level for the client B is greater than the first initial priority level for the client R). The method at the same time, keeps track of the request R4 from the client R and once it is determined that the said request R4 could not be serviced for a time period equal to Tth (which in the present case corresponds to Tl), changes the priority value of the said request R4 from the initial priority level to the final priority level (FPL), which in the present case has been set such that it is higher than both of the initial as well as the final priority levels for the client B. Thus, this ensures that once resources become available immediately after the priority value of the request, R4 has been changed, the said request can be serviced. This ensures that the request R4 is serviced before the request B9 is serviced.

Although the arbitration method has been described above with a two-client example situation, the example has been chosen to enable easy understanding of the concept of the arbitration being proposed in the present invention. The arbitration method is applicable irrespective of the number of clients which share the real time resource.

With reference to figure 1, there is shown an illustrative system wherein the request arbitration method has been implemented, wherein the system comprises: a real time resource (501); a plurality of clients (503 1 to 5O3 n ) connected to the real time resource (501) and sharing the same; characterized in that the plurality of clients are connected to the real time resource through an arbitration unit (505) which is configured to receive requests from the said plurality of clients, automatically allocate a predetermined initial priority level (IPL) for each of the requests thus received, wherein the priority values thus allocated to the requests from a particular client are selected from a priority level group (PLG) allocated to that client based on a predetermined priority level allocation method; and change the priority level of a particular request from the initial priority level (IPL) to a final priority level (FPL), if the said request is not serviced within a predetermined time period (Tth), a value of the said final priority level (FPL) being equal to or greater than a value of the said initial priority level (IPL).

With reference to figure 13, there is shown an illustrative block diagram of the arbitration unit, wherein the arbitration unit may comprise: a multiplexer (601), a command reorder (602) and a DDR controller (603). The above mentioned components are operationally connected to each other to perform the arbitration method as described in the present invention.

Coming to yet another aspect of the present invention, as the packet processing system on chip device closely interacts with memory devices which are external thereto, a sufficient level of throughput is required where the throughput depends upon a number of factors such as the nature of the storage medium, the bandwidth of the interface which connects the system (or in other words the on-chip part) to the off chip storage medium, the characteristics of the system (i.e. the clock frequency, latency etc), the method adopted to access the off chip storage medium, etc.

Generally, in a system employing off chip storage medium for storage purpose, increasing the throughput can be attained by one or more of the following manners:

(a) modifying the system configuration and more particularly, increasing the clock speed of the system;

(b) increasing the bandwidth of the interface which connects the on-chip part to the off chip storage medium; (c) adopting a different nature of the storage medium.

Performance of the memory also varies depending upon the method adopted to access the off chip storage medium. By way of example, if we consider a DDR memory based system, the performance of the DDR can degrade to < 30% of the raw DDR interface bus bandwidth depending upon the transaction pattern (wherein raw DDR interface bus bandwidth = interface frequency * interface width * 2 edges per clock). By way of example, if we consider a system that uses DDR memory, which is one of the cost effective storage devices used as the off chip storage medium for storage purpose, it is possible to increase the throughput of the system by one or more of the following manners: (a) increasing the clock speed of the system; (b) increasing the width of the interface which connects the on-chip part to the DDR.

Another possibility for improving the performance of the off-chip memory is adopting a specialized memory technology, for example, by using RLDRAM.

However, increasing the clock speed impacts the timing closure and costs additional power consumption. Moreover, there is always an upper limit to which the clock frequency of the system can be increased for a given DDR technology.

Further, increasing the size of the interface that connects the on-chip part to the DDR results in an increase in the number of pins. This i leads to increase of package cost for the device manufacturer and increase in the BOM cost for the users.

With the option of adopting a specialized memory technology (for example, a RLDRAM), the cost advantages provided by DDR are lost. Also, there can be procurement issues for the users due to limited number of vendors available for such specialized memories.

It is well known to those familiar with the art that DDR bus utilization degrades due to the following:

(i) a certain minimum time needed for the bus turnover to be complete; (ii) transaction to the same memory bank must be separated by a certain minimum duration specified for that memory.

Therefore, a large number of bus turnovers or successive transactions to the same bank result in some cycles to be wasted and thus lead to lower DDR bus utilization.

Typically, for a system that requires storage of large amount of data, it is advantageous to use a DDR; and in order to obtain a high memory bandwidth, the above mentioned issues need to be addressed. For this, the system requires a memory interface which attempts to provide data management to eliminate the memory access bandwidth bottlenecks mentioned above.

Further efficiency can be gained by reordering the transactions such that successive transactions to a given bank have the same row address, thus not requiring cycles spent in pre- charge with each transaction.

Hence, there exists a need to provide a method and a system for maximizing utilization of memory bus bandwidth so that the throughput of the system can be maximized. As the memory's performance is strongly impacted by the memory access pattern due to the critical DDR timing parameters, it is therefore important, in a system that allows so many different types of units to share the same DDR, that a mechanism be provided to push the memory interface to as high a utilization efficiency as possible.

Further, scheduling and reordering the data to improve the bus utilization may result in read after write hazards. This occurs when the write side generates an event upon "posting" a write to the DDR system and the read initiates a transaction to the same address on seeing this event and the read gets executed earlier than the write at the DDR.

Thus, in the present application, the Inventors have provided a method and a system (502) for maximizing efficient utilization of memory bus bandwidth. More particularly, the Inventors have provided a method that enables avoiding rise of Read-after- Write hazard.

Accordingly, the present invention provides a method of maximizing utilization of memory bus bandwidth. In one embodiment, the saidi method comprises the steps of receiving a plurality of read and write commands; storing the read commands in a read command queue

(RCQ) and storing the write commands in a write command queue (WCQ); scheduling and reordering the read commands thus stored in the RCQ and/or the write commands thus stored in the WCQ; and cyclically accessing at least one write command thus contained in the WCQ and at least one read command thus contained in the RCQ; wherein the scheduling and reordering is performed so as to prioritize issuance of successive commands on different banks of the memory; and prioritize issuance of successive commands to a same row of a particular bank of the memory; and wherein the cyclic access of the WCQ and the RCQ is performed in such a manner so as to minimize bus turnovers with respect to a predetermined number of read and write commands.

In one embodiment of the present invention the read command queue (RCQ) comprises a plurality of sub read command queues (RCQl to RCQn), each of the sub read command queue being mapped to a particular bank on the memory and the read commands thus received are stored in the said plurality of sub read command queues (RCQl to RCQn) wherein the read command thus stored in a particular sub read command queue (RCQl or RCQ2 or ... RCQn) correspond to a particular bank (Bl or B2 or Bn) on the memory.

In one embodiment of the present invention, write command queue (WCQ) comprises a plurality of sub write command queues (WCQl to WCQn), each of the sub write command queue being mapped to a particular bank on the memory and the write commands thus received are stored in the said plurality of sub write command queues (WCQl to WCQn) wherein the write command thus stored in a particular sub write command queue (WCQ 1 or WCQ2 or ... WCQn) correspond to a particular bank (B 1 or B2 or Bn) on the memory.

In one embodiment of the present invention, the read commands thus stored in a particular sub read command queue are stored on the basis of their time of receipt (first come first in) and similarly the write commands thus stored in a particular sub write command queue are stored on the basis of their time of receipt (first come first in).

In an embodiment of the present invention, 1 while scheduling and reordering the read/write commands stored in a particular command queue (i.e. the read command queue or the write command queue), the commands are accessed such that they are from different sub command queues so as to interleave accesses to different banks on the memory.

In one embodiment of the present invention, a particular sub- write command queue (WCQl or WCQ2 or ... WCQn) to be accessed for servicing within the write command queue (WCQ) is determined using a round robin methodology and similarly, a particular sub-read command queue (RCQl or RCQ2 or ... RCQn) to be accessed for servicing within the read command queue (RCQ) is determined using a round robin methodology.

In one embodiment of the present invention, the reordering comprises reordering the commands thus stored in a particular sub command queue (i.e. a particular sub read command queue or a particular sub write command queue).

In one embodiment of the present invention, the reordering of the commands thus stored in a particular sub command queue is performed to maximize the number of successive transactions performed in a particular row of the same bank, thus avoiding pre-charge, on the memory.

In one embodiment of the present invention, the reordering the commands thus stored in a particular sub-queue comprises determining for the first command stored in the particular sub- command queue a row identification number; determining for a predetermined number of subsequent commands stored in the same sub-queue, their respective row identification numbers; and increasing a priority of servicing of at least one subsequent command thus stored in the particular sub-queue having a row identification number as that of the previously executed command.

In one embodiment of the present invention, the predetermined number of subsequent commands stored in the same sub-queue is set to be configurable by a user.

In one embodiment of the present invention, a switch from accessing the write command queue to accessing a read command queue occurs when a predetermined number of write commands have been serviced; and there is a read command available in the read command queue.

In one embodiment of the present invention, a switch from accessing the write command queue to accessing a read command queue occurs when a time gap with respect to a previous write command exceeds a predetermined threshold value and there is a read command available in the read command queue. In one embodiment of the present invention, a switch from accessing the read command queue to accessing a write command queue occurs when a predetermined number of read commands have been serviced and there is a write command available in the write command queue.

In one embodiment of the present invention, a switch from accessing the read command queue to accessing a write command queue occurs when a time gap with respect to a previous read command exceeds a predetermined threshold value and there is a write command available in the write command queue.

In one embodiment of the present invention, the method enables avoiding rise of Read-after- Write hazard due to the scheduling and re-ordering.

In an embodiment of the present invention, the method enables avoiding Read-after- Write hazard comprises receiving a write command, the said write command including a corresponding tag; and broadcasting the tag thus contained in the write command after the write operation has been performed. '

In another embodiment of the present invention, the broadcasting comprises pushing the tag in respect of which the write operation has been performed into a FIFO memory block at a memory clock interval; and reading and transmitting the data thus contained in the FIFO memory block to a system at a system clock:

In one embodiment of the present invention, in respect of a write command performable on a particular memory location, broadcasting an acknowledgement message (AMI) which is indicative of placement of the said write command in a corresponding sub write command queue (WCQl or WCQ2 or ...WCQn); and broadcasting "x" acknowledgement messages (AMx) after the broadcasting of the acknowledgement message of the said write command (AMI ), wherein each of the acknowledgement messages (AMx) indicates placement of each of "x" write commands in the said sub write command queue; wherein receiving a certain minimum user defined acknowledgement messages after a system has issued a write command for a particular sub write command queue guarantees to the said system that the said write command has been completed on the memory interface.

In one embodiment of the present invention, if there is no write command received for a sub write command queue for a predefined interval of time, an acknowledgement message (AM) is issued to enable avoidance of a lock-up condition while waiting for AMx in the absence of other write commands to a given write sub command queue.

The present invention also provides a system (502) for maximizing utilization of memory bus bandwidth. In one embodiment, the said system comprises a means for receiving a plurality of read and write commands; a means for storing the read commands in a read command queue

(RCQ) and storing the write commands in a write command queue (WCQ); a means for scheduling and reordering the read commands thus stored in the RCQ and/or the write commands thus stored in the WCQ and a means for cyclically accessing at least one write command thus contained in the WCQ and at least one read command thus contained in the

RCQ; wherein means for scheduling and reordering is configured to prioritize issuance of successive commands on different banks of the memory; and prioritize issuance of successive commands to a same row of a particular bank of the memory without incurring a pre-charge; and wherein the means for cyclically accessing is configured to cyclic access the WCQ and the RCQ in such a manner so as to minimize bus turnovers with respect to a predetermined number of read and write commands.

It has been observed generally, that operations such as the read and write operations are reordered and scheduled before execution to increase the utilization of memory bus bandwidth. The present subject matter achieves this increase by prioritizing issuance of successive read/write operations on different banks of a memory; and prioritizing issuance of successive read/write operations to the same row of a particular bank of the memory. Also the hazards such as read-after- write , (RA W) caused due to reordering and rescheduling of the operations are avoided by broadcasting a tag or an acknowledgement message indicative of complete execution of write command. Figure 14 illustrates a simplified diagram of an application environment showing layer- 1 and layer-2 of DDR Interface (502).

In one embodiment, the DDR interface (502) comprises Layer- 1 (1402), Layer-2 protocol ( 1404) and a DDR memory (501 ). The Layer- 1 protocol ( 1402) comprises a Layer- 1 DDR 2/3 controller (1406) and a command re-ordering and scheduling (CRS) interface block (1408) operatively connected to each other. The Layer- 1 DDR 2/3 controller (1406) is in turn connected to the DDR memory (1412). Further, CRS (1408) is interfaced with a Layer-2 Translation and Arbitration (TA) component (1410) in the Layer -2 (104). CRS (1408) receives read/write commands from Layer-2 TA component (1410), stores and schedules/reorders the received commands and reads/writes in the DDR memory (501) with the help of DDR 2/3 controller (1406) in accordance with the present invention. The CRS (1408) carries out the transaction with the layer- 1 DDR 2/3 controller (1406) in a manner to maximize the DDR interface throughput.

The commands received from layer-2 (1404) are stored in read command queue and write command queue. The read command queue and write command queue in turn comprises a plurality of sub-command queues. For example, the read command queue comprises a plurality of sub-read command queues such as RCQi, RCQ 2 , ...., RCQ n for storing read commands. Similarly, the write command queue comprises a plurality of sub-write command queues such as WCQi, WCQ 2 , WCQ n for storing write commands. Each of the sub read command queue is mapped to a particular bank on the memory. The read command thus stored in a particular sub read command queue (RCQl or RCQ2 or ... RCQn) correspond to a particular bank such as Bl or B2 or Bn of the memory. The write commands thus stored in a particular sub write command queue (WCQl or WCQ2 or ... WCQn) correspond to a particular bank such as Bl or B2 or Bn of the memory. Each bank (Bl or B2 or ... Bn) comprises one or more rows on which a read/write transaction is performed.

Figure 15 illustrates a functional block diagram of an interface for maximizing utilization of memory bus bandwidth in accordance with one embodiment of the present invention. In one embodiment, the CRS interface (1408) comprises a command grouping component (1502), a command reordering and scheduling component (1504-1, 1504-2) hereinafter commonly referred to as command reordering and scheduling component (1504) and a command accessing component (1506).

The interface (1408) receives a plurality of read and write commands (1508) from Layer-2 protocol (1404). The command grouping component (1502) of the interface (1408) receives and stores the read and write commands (1508) in read command queue (RCQ) and the write command queue (WCQ) respectively. In particular, the command grouping component (1502) groups together all read commands and stores them in sub read command queues of the read command queue on a per bank basis.. Similarly the command grouping component (1502) groups together all write commands and stores them in sub write command queues of the write command queue on a per bank basis.

The command reordering and scheduling component (1504) then performs scheduling on the sub-write command queues and the sub-read command queues. Scheduling of the subcommand queues is performed such that the commands thus stored in a particular command queue are accessed from different sub-command queues. This aspect of accessing commands from different sub-command queues enables interleaving of banks residing in the memory. In one embodiment, the scheduling of the sub-command queues follows a round robin methodology. The scheduling on the sub-write command queues (WCQi or WCQ 2 or ... WCQ n ) is performed to access the said queue for servicing within the write command queue (WCQ). Similarly, the scheduling on the sub-read command queues (RCQi or RCQ 2 or ... RCQ n ) is performed to access the said queue for servicing within the read command queue (RCQ).

For scheduling, the command reordering and scheduling component (1504) performs reordering of the commands thus stored in a particular sub-command queue (i.e. a particular sub read command queue or a particular sub write command queue) to maximize the number of transactions performed on a particular row of the bank on the memory. On completion of the reordering, the command accessing component (1506) cyclically accesses at least one write command stored in the WCQ and at least one read command stored in the RCQ and sends the commands to layer- 1 for execution (210) in order to minimize bus turnovers with respect to a predetermined number of read and write commands.

Figure 16 illustrates a flowchart of method for maximizing utilization of memory bus bandwidth in accordance with one embodiment of the present invention.

The method 1600 enables maximizing utilization of memory bus bandwidth begins at block 1602.

In one embodiment, a plurality of read and write commands are received (1602). The received commands are then stored in their respective queues (1604) i.e. the read commands thus received are stored in a read command queue (RCQ) and the write commands thus received are stored in a write command queue (WCQ). The read command queue (RCQ) comprises a plurality of sub read command queues (RCQi to RCQ n ). Similarly, the write command queue (WCQ) comprises a plurality of sub write command queues (WCQi to WCQ n ). The read commands thus stored in a particular sub read command queue are stored on the basis of their time of receipt (first come first in) and similarly the write commands thus stored in a particular sub write command queue are stored on the basis of their time of receipt (first come first in). The read/write commands are stored on a per Bank basis. In particular, there are separate read and write command queues for each bank and the read/write commands are stored in the command queues of the corresponding banks.

The read commands thus stored in RCQ and the write commands thus stored in WCQ are scheduled and reordered (1606). The scheduling and reordering is performed to prioritize issuance of successive commands on different banks of the memory and to prioritize issuance of successive commands to a same row of a particular bank of the memory. Scheduling a particular sub-write command queue (WCQi or WCQ 2 or ... WCQ n ) to be accessed for servicing within the write command queue (WCQ) is determined using a round robin methodology and similarly, scheduling a particular sub-read command queue (RCQi or RCQ 2 or ... RCQ n ) to be accessed for servicing within the read command queue (RCQ) is determined using the round robin methodology. In one embodiment, a switch from accessing a write command queue to accessing a read command queue occurs when a predetermined number of write commands have been serviced and there is a read command available for servicing in the read command queue. In one embodiment, a switch from accessing the write command queue to accessing a read command queue occurs when a time gap with respect to a previous write command exceeds a predetermined threshold value and there is a read command available in the read command queue.

In one embodiment, a switch from accessing the read command queue to accessing a write command queue occurs when a predetermined number of read commands have been serviced and there is a write command available in the write command queue. In another embodiment, a switch from accessing the read command queue to accessing a write command queue occurs when a time gap with respect to a previous read command exceeds a predetermined threshold value and there is a write command available in the write command queue.

The commands thus stored in a particular sub command queue (i.e. a particular sub read command queue or a particular sub write command queue) are reordered. The reordering of the commands thus stored in a particular sub-command queue is performed to maximize the number of transactions performed in a particular row of the bank on the memory. For reordering, a row identification number for the first command stored in the particular subcommand queue is determined. Further, respective row identification numbers for a predetermined number of subsequent commands stored in the same sub-queue are also determined. If a row identification number of at least one subsequent command thus stored in the particular sub-command queue matches with the row identification number of the first command then a priority value of servicing the at least one subsequent command is increased. The number of subsequent commands for which their respective row identification number is determined is a predetermined or a user definable value.

The commands (at least one write command thus contained in the WCQ and at least one read command thus contained in the RCQ) are then cyclically accessed (1608) to minimize bus turnovers with respect to a predetermined number of read and write commands. The aspect of scheduling and re-ordering read/write commands also enables avoiding rise of Read-after- Write (RAW) hazard which is explained in Figures 19 and 20.

Figure 17 illustrates a flowchart of the method for scheduling the commands in accordance with one embodiment of the present invention.

In an exemplary embodiment of the present invention, the command is set as write command and status of variables is initialized (1701). The initialization of status of variables comprises setting the bank to the first bank corresponding to the write command queue in the memory, setting the CmdCount (number of write commands executed) and BanksExamined (number of banks corresponding to the write command queue examined) to 0, and row number of last transaction in the bank to null (invalid) value.

If the number of successive executed write commands does not exceed a threshold value set by the user (1702), the control is moved to next bank corresponding to the next write command queue in the memory (1703) and the bank is checked for availability of write command (1704). If a write command is available in the bank (i.e. in the corresponding sub command queue), reordering is performed (171 1) as illustrated in figure 18. Then the command is issued (1712) and executed followed by updating the status of the variables (1713). The updation of variables includes incrementing the CmdCount, setting the value of current bank to old bank, updating the row number of last transaction in the bank and setting the gap from the last command executed and BanksExamined to 0. After updating the variables, the method proceeds to step 1702.

If a write command is not available for the bank, increment BankExamined (1705). If

BanksExamined exceeds the maximum number of banks corresponding to the write command queue (1706) i.e. if the number of banks examined has complete a cycle, wait for one clock cycle and increment the gap from last command (1707). Whenever there is an opportunity to service a command and there is no command available, the internal gap count is incremented. Then, if gap from last write command has exceeded a threshold value (1708), the read command queue is checked for availability of read commands (1709). If a read command is available in the read command queue, the command type is changed to read command and the status of variables is updated (1710). The updation includes setting the CmdCount and BanksExamined to 0, and row number of last transaction in the bank to null (invalid) value. After the updation, the method proceeds to step 1703. However, if a read command is not available, the method proceeds to step 1703. Otherwise, if gap from last write command executed has not exceeded the threshold value, the method proceeds directly to step 1703 and continue with the processing of the write commands.

If the number of successive executed write commands exceed a threshold value set by the user (1702) and a read command is available (1709), a switch is made from write to read command and the status of variables are again initialized (1710). The updation of variables includes setting the number of commands executed (CmdCount) and number of banks examined (BanksExamined) to 0, and row number of last transaction in the bank to null value. After initialization of the variables, the method proceeds to step 1703.

Otherwise if the number of successive executed write commands exceed a threshold value set by the user (1702) but a read command is not available (1709) in the read command queue(RCQ), the method proceeds to step 1703.

The scheduling method for write command as described above applies to read command as well.

Figure 18 illustrates a flowchart of the method for reordering the commands in accordance with one embodiment of the present invention. I i

In one embodiment, the CmdPos is set to the first command in the queue (1801). If the row number of the last transaction is not valid, use the command at the determined CmdPos (1807), which happens to be the first command in the queue in this case. If the row number of the last transaction is valid, it is checked if the last row (i.e. row address of the previous transaction) of the bank is still open (1803). If the last row of the bank is open, the row number of CmdPos is compared with the row number of last valid transaction. If row number of CmdPos is equal to row number of last valid transaction, use the command at the determined CmdPos (1807) i.e. reorder the commands and use the command at CmdPos at a higher priority compared to other commands in the given queue. If row number of CmdPos is not equal to row number of last valid transaction, move to the next command in the queue (1805) and continue the comparison up to a specified number of commands set by the user (1806). After completion of the comparison, if no match is found within a specified number of commands, set the value of CmdPos to the first command in the queue and use the command at the determined CmdPos (1807).

Figure 19 illustrates a functional block diagram of a method to enable avoiding rise of Read- after- Write hazard in accordance with one embodiment of the present invention.

In one embodiment, the command grouping component (1502) receives the read/write commands (1508) and stores the write commands into the sub write command queues (WCQl, WCQ2,..., WCQn) of the write command queue (WCQ). After the command grouping component (202) stores a write command into the sub write command queue (for example WCQx), a pulse generator (1902) generates a pulse that is then fed into a multiplexer (MUX) (1904). The MUX (1904) then generates an acknowledgement message (AMx) such as a write event (WrE vent) (1905) based on the generated pulse.

In one embodiment, an event "WrEvent" (1905) is generated when no write command is received for a bank for a duration greater than the user-defined threshold. In particular, a time counter (1906) determines that no write command is received for a bank for a duration greater than the user-defined threshold and triggers the MUX (1904). A WrEvent (1905) is then issued when triggered by the time counter (1906).

Whenever a WrEvent (1905) is generated for a bank, the time counter counting the time elapsed since the last Write for that bank resets the time elapsed to 0. There is an upper limit to the size of the write command queue (WCQ), therefore, the layer-2 protocol keeps a count of the number of WrEvents (1905). After issuance of a write command (WRx), when the count of WrEvents (1905) subsequently is greater than a certain predefined number, then the particular write command (WRx) is guaranteed to have been executed and a read command can be issued to the same address as that of WRx therefore avoiding read -after- write hazard. Figure 20 illustrates a functional block diagram of a method to enable avoiding rise of Read- after- Write hazard in accordance with another embodiment of the present invention.

In one embodiment, the command grouping component (1502) receives the read/write commands (1508). Each received write command contains a corresponding write tag. The command grouping component (1502) stores the write commands into the sub write command queues (WCQl, WCQ2,..., WCQn) of the write command queue (WCQ). For example, the command grouping component receives a write command (2001) with a corresponding write tag (2002). The write command (2001) is stored in the sub write command queue along with the corresponding write tag (2002). When a write command (2001) is issued to the DDR Controller, the write tag (2002) is stored in a FIFO memory block (2003) at a memory clock interval. A broadcaster (2004) then reads the WrTag (2002) from the FIFO (2003) and forwards the write tag (2002) to the Layer-2 (1404). The Layer-2 (1404) receives a WrTag (2002) indicating that corresponding Write command has been executed.

The method minimizes the bus turn-arounds by grouping reads and writes separately. The method reduces the bank conflicts and thus takes the advantage of efficiencies due to striping. The method enables to avoiding rise of read-after-write (RAW) hazard. The bandwidth efficiency of memory is improved.

In yet another requirement which is generally felt in the industry and which has been highlighted in the background section, at the transmission end, the packet must not under-run and the required data rate must be maintained. Therefore, we must have a mechanism that, inspite of the latencies involved in reading data from the DDR, still enables us to maintain a certain rate of reading the packet fragments from the DDR.

Typical implementation of data storage on Integrated circuit (IC) chip is made by a queue structure that allows storing of the Head and Tail pointers on the memory residing on the chip, called as on-chip memory, while the Next pointers are embedded in data. However, access to data stored in such queue structures may be delayed due to next pointers residing in the off- chip memory. For example, a delay is caused to access a second block of data since the first block of data has to be read before accessing the second block.

Latency of access may occur due to the characteristics of the memory (e.g. DDR's have higher latencies of access compared to SRAM) and/or due to the system architecture (e.g. multiple clients sharing a given memory interface can lead to larger latencies). This latency limits the bandwidth of a port from where the data has to be fetched. To overcome the bandwidth limitations, existing solutions provide extra memory interfaces for pointers or extra on-chip memory, which makes the implementation of a system-on-a-chip more expensive in terms of silicon area interfaces and/or power consumption.

Therefore there is a need for a system and method that enables pipelining of data dequeue requests to the memory such that the pipelining can reduce the impact of the latency of memory access, thus enabling to sustain the required bandwidth for a port.

Accordingly, the present invention provides a method of enabling dequeuing of data stored in a storage device, said method comprising the steps of receiving a plurality of header requests, wherein the headers are stored in at least one 1 st block; reading header data corresponding to each of the said plurality of header requests, wherein each of the header data is stored in a separate 6 th block in the said at least one 1 st block; transmitting each of the header data thus read; and based on the data thus transmitted from each of the "n" number of 6 th blocks, receiving further requests for reading data.

In one embodiment of the present invention, the method receives a second of the header request prior to transmission of the header data in respect of the first header request.

In one embodiment of the present invention, each of the 6 th blocks contains either a header fragment of a packet or a SCP packet,

In one embodiment of the present invention, the 2 nd block comprises tail fragment and optionally one or more body fragments. In one embodiment of the present invention, the method comprises the plurality of the header requests thus received correspond to headers, all of which are stored in a single 1 sl block.

In one embodiment of the present invention, the method comprises the plurality of the header requests thus received correspond to headers, no two of which are stored in a single 1 st block.

In one embodiment of the present invention, the method comprises the plurality of the header requests thus received correspond to headers, some of which are stored in a single l sl block.

In one embodiment of the present invention, the method comprises the data thus contained in a particular 6 th block thus read indicates presence of only a tail fragment in a 3 rd block of a 2 nd block.

In one embodiment of the present invention, the method comprises the data thus contained in a particular 6 th block thus read indicates presence of the tail fragment and at least one body fragment in a 4 th block of the 3 rd block.

In one embodiment of the present invention, the method comprises the data thus contained in a particular 6 th block thus read indicates presence of the tail fragment and a plurality of body fragments in at least two 4 th blocks of the 3 rd block.

In one embodiment of the present invention, the method comprises the data thus contained in a particular 6 th block thus read indicates absence of both of the tail fragment and a body fragment in the 2 nd block.

In another embodiment of the present invention, the method optionally comprising receiving request(s) for reading data thus contained in the 3 rd block of the 2 nd block based on the data thus transmitted from the 6 th block.

In one embodiment of the present invention, the method comprises the data thus contained in the 6 th block indicating presence of only the tail fragment in the 3 rd block of the 2 nd block, the method further comprises: receiving a request for reading the tail fragment, said request comprising an identification (address) of the 4^ bIoCk contained in the 3 rd block where the tail fragment is stored; reading the tail fragment thus contained in the identified 4 th block of the said 3 rd block; and transmitting the tail fragment thus read.

In one embodiment of the present invention, the method comprises the data thus contained in the 6 th block indicating presence of the tail fragment and at least one body fragment in the 4 th block of the 3 rd block, the method further comprises receiving a plurality of requests for reading the at least one body fragment and the tail fragment; wherein a first of the said plurality of requests comprises an address of the 4 th block in the 3 rd block where the at least one body fragment and the tail fragment are stored and the number of requests thus received is less than or equal to a number of 7 th blocks thus contained in the 4 th block; reading the at least one body fragment and the tail fragment thus contained in the identified 4 th block; and transmitting the data thus read.

In yet another embodiment of the present invention, the method comprises the data thus contained in the 6 th block indicating presence of a tail fragment and a plurality of body fragments, that are contained in a plurality of 4 th blocks in the 3 rd block, the method comprises (a) receiving a plurality of requests for reading the plurality of body fragments in a first of the plurality of 4 th blocks; wherein a first of the said plurality of requests comprises an address of the first 4 th block and the number of requests thus received is equal to a number of 7 th blocks thus contained in the 4 th block; (b)reading the plurality of body fragments thus contained in the first 4 th block and transmitting the data thus read; and (c) based on the data thus read, receiving at least one further request for reading data contained in at least one further block of the plurality of 4 blocks.

In one embodiment of the present invention, the method comprises step (c), the maximum number of requests thus received is given by [(x-1) * n) + y], wherein, 'x' is the number of explicit linkages thus read in step (b), 'n' is the number of 7 th blocks in a 4 th block and 'y' is the number of 7 th blocks containing either tail or body fragment in the last 4 th block.

Accordingly, the present invention provides a storage device configured to enable dequeuing of data. In one embodiment, the said storage device comprising a plurality of memory locations grouped together to form at least one 1 st block; a plurality of memory locations grouped together to form a 2 nd block; characterized in that the 2 nd block comprises at least one 3 rd block; at least one of the 1 st block comprises a plurality of 5 th blocks linked to each other by at least one explicit linkage; and each of the 5 th blocks comprising a plurality of 6 th blocks which are linked to each other by at least one implicit linkage and at least one of the 6 lh block contained in the at least one of the 1 st block comprises at least one explicit linkage to the said at least one 3 rd block.

In one embodiment of the present invention, the storage device comprises at least one 3 rd block comprises at least one 4 th block.

In one embodiment of the present invention, the storage device comprises "n" number of 4 th blocks in the said 3 rd block, at most of "n-1" explicit linkages are present between the said "n" number of 4 th blocks.

In still another embodiment of the present invention, the storage device comprises the number "n-1" is equal to 2, a first of the said two explicit linkages links a first 4 th block to a second 4 th block and, a second of the said two explicit linkages links either a second 4 th block to a third 4 th block or a first 4 th block to a third 4 th block, i

In one embodiment of the present invention, the storage device comprises the said 1 sl block that comprises "y" number of 5 th blocks, at most of "y-1 " explicit linkages are present between the said "y" number of 5 th blocks.

In one embodiment of the present invention, the storage device comprises the number "y-1" is equal to 2, a first of the said two explicit linkages links a first 5 th block to a second 5 th block and a second of the said two explicit linkages links: a second 5 th block to a third 5 th block; or a first 5 th block to a third 5 th block.

In one embodiment of the present invention, the storage device comprises the 4 th block comprising a plurality of 7 th blocks which are 1 linked to each other by at least one implicit linkage. In one embodiment of the present invention, in respect of a non-self contained packet, the storage device comprises header related information stored in the at least one of the l sl block.

In one embodiment of the present invention, in respect of a self contained packet (SCP), the storage device comprises the entire packet stored in the at least one of the 1 st block.

In one embodiment of the present invention, the storage device comprises the data stored in the 2 nd block represents tail and optionally one or more body fragments of a packet.

In another embodiment of the present invention, the storage device comprises 6 th blocks, wherein each of the 6 th blocks contains either a header fragment of a packet or a SCP packet.

As is well understandable, data in the memory is interchangeably referred to as packet comprising one or more fragments. A fragment can be either a Head fragment, a Body fragment, or a Tail fragment. A packet can be a self-contained packet containing one fragment indicating the entire data in the packet. A packet can be a non-self contained packet containing multiple fragments comprising one head fragment, optionally one or more body fragments, and a tail fragment.

To sustain a dequeue rate in the presence of latencies of fetching the data or packet from the memory, in accordance with a preferred aspect of the present invention, dequeue request is issued without waiting for the data for the first request to have received completely. Therefore, the aspect of issuing at least one subsequent dequeue request without waiting for the receipt of data in respect of at least one previously issued request assists in reducing the impact of the latency on the data throughput of the overall system.

Figure 21 illustrates a simplified structural diagram of a storage device 2100 in accordance with an embodiment of the present invention.

In one embodiment, the storage device (interchangeably referred to as device) 2100 is configured to enable dequeuing of data stored in the device 2100. In one embodiment, the device 2100 comprises a group of one or more 1 st blocks of memory such as l ι , h , , I n (commonly referred to as block I x ), wherein each of the 1 st block comprises a plurality of memory locations grouped together. In one embodiment, the block I x is referred to as a Traffic Queue (TQ)- x, where 'x' denotes the identification number of the traffic queue. A plurality of traffic queues are mapped to a single transmitting communication connection or a transmitting port for dequeuing the packets stored in the device 2100. The port can be for example, a network port like Gigabit Ethernet port. A scheduler selects the order in which the packets from different traffic queues are needed to be dequeued for a given port. The scheduling is performed by the scheduler via any scheduling methods known in the art.

In one embodiment, each of the 1 st blocks I x comprises a plurality of 5 th blocks such as 5i, 5 2> 5 n (commonly referred to as block 5 X ) linked to each other by an explicit linkage indicated as 8 in Figure 21. Each of the 5 th block is a temporary buffer and a linkage from one buffer to another buffer is referred to as an explicit linkage 8 or a next pointer. Each of the 5 th block comprises one or more 6 th blocks such as 6i t 6 2 , , 6 n (commonly referred to as block 6 X ), each of the 6 th blocks are referred to as fragments linked to each other by an implicit linkage indicated as 9 in Figure 21. Fragments are stored in a buffer, and the buffers are stored anywhere in the memory space in the device 2100. The fragments in a buffer are linked implicitly in a sequential manner so that if the address of the buffer is known, then the addresses of all the fragments in that particular buffer is derived using the fragment offsets. In one embodiment, the fragment offset is the size of the fragment. Each fragment or block 6 X stored in the at least one buffer (block 5 X ) is either a head fragment or a SCP fragment. The head fragment stores packet related information such as the size of the packet whereas the

SCP fragment stores the entire data belonging to a packet.

The device 2100 further comprises a plurality of memory locations grouped together to form a block 2. In one embodiment, the block 2 or 2 nd block is a temporary buffer comprising one or more 3 rd blocks such as 3], 3 2l , 3n (commonly referred to as block 3 X ), the block 3 X is configured to store all body/tail fragments excluding the head or SCP fragment of a packet.

Each block 3 X is interchangeably referred to as a headless floating packet storing a tail fragment and/or body fragments of the packet. Each of the 3 rd blocks comprises one or more 4 lh blocks such as 4i, 4 2) , 4 n (commonly referred to as block 4 X ) for storing fragments of the headless floating packet. Each of the block 4 X comprises one or more 7 th blocks such as 7i , 7 2, I n (commonly referred to as block 7 X ). Each block 7 X is either a body fragment or a tail fragment. Fragments indicated as 'E' denotes an empty fragment which is neither a body fragment nor a tail fragment.

Two blocks of the 7 th block contained in the block 4 X are linked to each other by an implicit linkage 10 indicated in Fig. 1. For example, a block 7 X is linked to a block 7 y via an implicit linkage 10, if the blocks 7 X and 7 y are successively stored in the same 4 th block. In one embodiment, the block 4 X comprises explicit linkages linking two 4 th blocks. By way of example, consider four blocks 4i, 4 2 , 4 3 and 4 4 in a 2 nd block. The number of explicit linkages is equal to 3 and the first explicit linkage is a next pointer linking 4i and 4 2 . In one embodiment, the second explicit linkage is a next pointer linking blocks 4 2 and 4 3 , and third explicit linkage is a next pointer linking blocks 4 3 and 4 4 . In another embodiment, the second explicit linkage is a next-to-next pointer linking blocks 4i and 4 3 , and third explicit linkage is a next-to-next-to-next pointer linking blocks 4i and 4 4 . In yet another embodiment, the second explicit linkage is a next-to-next pointer linking blocks 4| and 4 3 , and third explicit linkage is a next pointer linking blocks 4 3 and 4 4 . In still another embodiment, the second explicit linkage is a next pointer linking blocks 4 2 and 4 3 , and third explicit linkage is a next-to-next pointer linking blocks 4 2 and 4 4 . In another embodiment, the block 4 X comprises no explicit linkages indicating that there is only one 4 th block, in the 3 rd block. Similarly, the block 5 X comprises explicit linkages linking any two 5 th blocks in a manner as explained for the 4 th blocks.

In one embodiment, the one or more 6 th blocks in the at least 1 st block are linked to at least one 3 rd blocks via an explicit linkage, indicated as 13 in Fig. 1. The implicit linkages 9, 10 and the explicit linkage 8, 1 1, 12, and 13 are set by Write operations. The Write operation enables storing of received data or packets in the device 100 such that dequeuing of stored data enables sustaining of a dequeue rate in the presence of latencies of fetching the data or packet from the memory. In one embodiment, the enqueuing process is any known enqueuing process existing in the art. In another embodiment, the write process is a process as described in the following paragraphs. In one embodiment, the write process is configured to store data or packets in the appropriate memory locations in the device 2100. In particular, the write process is configured to store the packets in the at least one traffic queue or at least one l sl block. As already described in the foregoing paragraphs, a packet is a collection of one or more fragments. A packet is either a SCP or a non-SCP packet. The non-SCP packet comprises at least one head, optionally one or more body and one tail fragment. When a packet to be written is received, the packet is analyzed and determined as to whether the packet is a SCP or a non-SCP packet.

Enqueuing of a non-SCP packet into a traffic queue depends on the storage device architecture. In one embodiment, the order of receiving the fragments is for example, Head-

Is-Last (HIL) order where one or more body fragments, followed by the tail fragment, and the head fragment, is stored. HIL order receives the head fragment only after receiving the body fragments and the tail fragment. When a body or tail fragment is received during the write process, a reassembly queue ID or a port number on which the packet was received is stored in that body or tail fragment which identifies the corresponding reassembly queue or buffer for storing the fragments. If the fragment received is a tail fragment, the process of storing the fragments in the reassembly queue is complete once the tail fragment is stored.

In one embodiment, a plurality of buffers are linked by an explicit linkage such as a next pointer (indicated as 8, 1 1 in Figure 21). In one embodiment, a plurality of buffers are linked by an explicit linkage such as a next-to-next pointer (indicated as 12, 14 in Figure 21) based on their requirement. In one embodiment, a plurality of buffers is linked by an explicit linkage such as a next-to-next-to-next pointer based on the application requirement. The explicit linkages such as next pointer, next-to-next pointer and next-to-next-to-next pointer are stored in a single fragment or in different fragments of the buffer.

The reassembly queue storing the body fragments and the tail fragment is a headless floating packet, (indicated as 3 rd block in Figure 1) as the reassembly queue does not store the corresponding head fragment. Finally according to HIL order of receiving fragments, the head fragment (6 th block) arrives after arrival of the body and tail fragments and after the completion of packet processing by a packet processor. An explicit pointer (such as explicit linkage 13) to the 1 st body fragment and the size of the packet stored in the reassembly queue are inserted into the head fragment. Following this, the head fragment is appended in the corresponding traffic queue Ix specified by the packet processor. The head fragment thus appended includes an explicit linkage 13 to the headless floating packet 3 rd block as shown in Figure 21.

In another embodiment, the order of receiving the fragments of the packet is the head fragment, one or more body fragments, and the tail fragment, called as Head-Is-First (HIF) order. When the head fragment is received, a single buffer is allocated to store only the head fragment. Therefore when all the fragments are received, the head fragment is processed by the packet processor to identify the traffic queue to which the head fragment is to be enqueued. Based on the identification, the Head Fragment is appended to Ix, with an explicit linkage 13 to the corresponding headless floating packet, as in the case of HIL. Enqueuing of a SCP packet into a traffic queue is similar to enqueuing of the head fragment. However there is no explicit pointer to the 3 r block or headless floating packet as the SCP packet do not have any body or tail fragment.

Figure 22 illustrates a flowchart of the method 2200 to enable dequeuing of data stored in the storage device in accordance with one embodiment of the present invention.

The method 2200 enables the dequeuing of data stored in the storage device 2100 following a write process. The method 2200 for dequeuing of packets stored in the device 2100 begins at block 2210.

At block 2210, receiving a plurality of header requests for reading data contained in at least one 1 st block. For example, a plurality of header requests issued by a scheduler is received for dequeuing packet stored in the device 2100. The plurality of the header requests thus received corresponds to headers that are stored in the 6 th blocks of the 1 st block. In one embodiment, the plurality of the header requests thus received corresponds to headers that are stored in a single 1 st block. In another embodiment, the plurality of the header requests thus received corresponds to headers no two of which are stored in a single 1 st block. In yet another embodiment, the plurality of the header requests thus received corresponds to headers some of which are stored in a single 1 st block.

At block 2220, upon receiving header requests corresponding header data is read. In one embodiment, in response to receiving the request for reading header data contained in a 6 th block, a data reader, or any data reading device reads the head fragment/SCP packet stored in the 6 th block. The head fragment stores packet related information such as the size of the packet, whereas the SCP packet stores the entire packet. In one embodiment, the data contained in a particular 6 th block thus read indicates presence or absence of an explicit linkage to a 3 rd block. In particular, in one embodiment, the data contained in a particular 6 lh block thus read indicates presence of only a tail fragment in a 3 r block of a 2 n block. In one embodiment, the data contained in a particular 6 th block thus read indicates presence of the tail fragment and at least one body fragment in a 4 block of the 3 r block. In one embodiment, the data contained in a particular 6 th block thus read indicates presence of the tail fragment and a plurality of body fragment in at least two 4 th block of the 3 rd block. In another embodiment, the data contained in a particular 6 block thus read indicates absence of both the tail fragment and one or more body fragments in the 2 nd block.

At block 2230, each of the header data thus read is transmitted, In an embodiment, upon receipt of the header requests, the transmission of the header data takes place. The transmitted header data indicates presence or absence of tail and/or one or more body fragments and the number of body and/or tail fragments.

At block 2240, further requests for reading data are received. In one embodiment, further requests for reading data comprises requests for reading header data stored in plurality of 6 blocks. Further requests for reading data comprises requests for reading tail and/or one or more body fragments stored in the 3 rd block.

Figure 23 illustrates a flowchart of the method of dequeuing data of a packet in accordance with an embodiment of the present invention. The method 2300 for dequeuing a packet in accordance with one embodiment of the present invention begins at block 2310.

At block 2310, receiving request(s) for reading data contained in the 6 th block of a 1 st block. In one embodiment, one or more requests for reading data contained in a 6 th block are received. At block 2320, data thus contained in the 6 th block is read. In one embodiment, the data contained in the 6 th block is read and the data thus read indicates presence of tail and/or one or more body fragments and the size of the packet. In another embodiment, the data thus read indicates presence of a SCP packet and the size of the packet.

At block 2330, presence of an explicit linkage to a 3 rd block is determined. In one embodiment, the data thus read contained in the 6 block indicates the presence or absence of an explicit linkage to a 3 rd block. If the presence of the explicit linkage to the 3 rd block is determined (such as 13, shown in Figure 21), then block 2330 proceeds to block 2340 via the "YES" path. The explicit linkage represents the address of a 4 th block in a 3 rd block that has to be dequeued. The explicit linkage enables the data reader to read data thus contained in the 4 block of the 3 rd block (such as 4i, shown in, Figure 21). If the presence of the explicit linkage to the 3 rd block is not determined, then block 2330 proceeds to block 2350 via the "NO" path.

At block 2340, further requests for reading data contained in the 3 r block are received. In an embodiment, based on the determination of the explicit linkage to the 3 rd block, the method 2300 proceeds to receive requests for reading data thus contained in the 3 rd block. As already explained, a 3 rd block comprises one or more 4 th blocks, wherein each of the 4 th blocks comprises one or more fragments or 7 th blocks. Each of the 7 th blocks is either a body or a tail fragment. Reading a packet generally depends on the header data thus read contained in the 6 th block. In one embodiment, if the data contained in the 6 th block indicates presence of only the tail fragment in the 3 rd block, then a request for reading the tail fragment is received. The request comprises identification (address) of the 4 th block, contained in the 3 rd block where the tail fragment is stored. In response to receiving the request, the method 2300 reads the tail fragment thus contained in the identified 4 th block of the said 3 rd block and transmits the tail fragment thus read. In another embodiment, if the data contained in the 6 th block indicates presence of the tail fragment and at least one body fragment in the 4 th block of the 3 rd block, the method 2300 proceeds to receiving a plurality of requests for reading the at least one body fragment and the tail fragment. The first of the said plurality of requests comprises an address of the 4 lh block in the 3 rd block where the at least one body fragment and the tail fragment are stored. The number of requests thus received is less than or equal to a number of 7 th blocks thus contained in the 4 th block. In one embodiment, the maximum number of requests thus received is illustrated by: [(x-1) * n) + y], wherein, 'x' is the number of explicit linkages obtained when the data is read, 'n' is the number of 7 th blocks in a 4 th block, and 'y' is the number of 7 th blocks containing either tail or body fragment in the last 4 th block. The method 2200 proceeds to read all of the 7 th blocks thus stored in the at least one 3 rd block till the completion of dequeuing of a packet. In response to receiving the requests, the method 2300 proceeds to reading the at least one body fragment and the tail fragment thus contained in the identified 4 th block and transmitting the data thus read.

In one embodiment, if the data contained in the 6 th block indicates presence of a tail fragment and a plurality of body fragments, that are contained in a plurality of 4 th blocks in the 3 rd block, the method 2300 proceeds to receiving a plurality of requests for reading the plurality of body fragments in a first of the plurality of 4 th block. The first of the said plurality of requests comprises an address of the first 4 th block and the number of requests thus received is equal to a number of 7 th blocks thus contained in the 4 th block. In response to receiving the requests, the method 2300 proceeds to reading the plurality of body fragments thus contained in the first 4 th block and transmitting the data thus read. Based on the data thus read, the method proceeds to receiving at least one further request for reading data contained in at least one further block of the plurality of 4 th blocks. Explicit linkages such as next pointer and next- to-next pointers linking the plurality of 4 th blocks enable reading data from further 4 th blocks with reduced impact of latencies since more requests can be issued while the first request has still not returned the data. At block 2350, receiving request for reading data thus contained in the next 6 th block. Based on the determination of absence of the explicit linkage to the 3 rd block, the method 2300 proceeds to read the data thus contained in the next 6 th block, therefore repeating the method 2300 for dequeuing further packets required by the scheduler.

As mentioned above, explicit linkage to the third block of the 4 th block or next-to-next pointer and explicit linkage to the second block of the 4 block or next pointer enables dequeuing of 7 th blocks from the third and second 4 th block respectively. Therefore the waiting time for the completion of dequeuing of 7 th blocks from the second block is utilized by dequeuing of 7 th blocks from the third block. This reduces the impact of the latencies of fetching data from memory, and therefore enables sustaining dequeue rate at high bandwidth for a given port.

It may be noted that the memory device 2100 may be available at two different locations, particularly, a first portion of the memory device 2100 may be available within the buffer manager (507) and the remaining part of the memory device 2100 may be available within the DDR memory (501). Typically, each of the 1 st block is available as an on-chip component, particularly accessible by the buffer manager and each of the 2 nd block forms part of the DDR memory (or the external storage device).

It will be appreciated that embodiments of the invention described herein (for example, each of the packet processor and the buffer manager) may be comprises of one or more conventional processors and unique stored program instructions that control the one or more processors to implement, in conjunction with certain non-processor circuits, some, most, or all of the arbitration functions described herein. Alternatively, some or all of the arbitration functions could be implemented by a state machine that has no stored program instructions or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic.

Of course, a combination of the two approaches could be used. Thus, method and means for these functions have been described herein. Further, it is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions and programs and ICs with minimal experimentation.

While the particular preferred embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that changes and modifications may be made without departing from the teachings of the invention. It is therefore contemplated that the present invention cover any and all modifications, variations or equivalents that fall within the scope of the basic underlying principles disclosed above and claimed herein.