Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
TRANSMISSION AND RECEPTION DEVICES FOR REDUCING THE DELAY IN END-TO-END DELIVERY OF NETWORK PACKETS
Document Type and Number:
WIPO Patent Application WO/2015/039687
Kind Code:
A1
Abstract:
The invention relates to a transmission device (110) comprising a processor, configured: (1) to submit multiple read request messages (ReadA, ReadB) over a host interface corresponding to buffers of one or more network packets, (a) to assign for each network packet a unique packet identifier, (b) to calculate for each read request message a start byte offset indicating a relative location of a first completion byte of the read request message within the original network packet, (c) to store the unique packet identifier and the start byte offset as an entry within a transmission database (113); and (3) upon arrival of completion data from the host interface for the submitted read request messages, for each read response message (CPL A1): (a) to associate the read response message (CPL A1) with an entry of the transmission database (113) and extract the packet identifier, (b) to transform the read response message (CPL A1) into a fabric cell by the following operations: (c) to mark the fabric cell with the packet identifier, (d) to mark the fabric cell with a byte offset indicating a relative start byte offset of the fabric cell within the associated network packet, wherein the byte offset is calculated by summing a relative start offset of the corresponding read request message and a location of the read response message (CPL A1) within an entire stream of completion bytes of that read request message, (e) to mark the fabric cell with a first flag if the fabric cell represents a first fabric cell of the network packet and a last flag if the fabric cell represents a last fabric cell of the network packet, (f) to release the transmission database (113) entry if the fabric cell is marked with the last flag, and (g) to submit the fabric cell over a fabric interface.

Inventors:
ELAD YUVAL (DE)
TAL ALEX (DE)
ZECHARIA RAMI (DE)
UMANSKY ALEX (DE)
Application Number:
PCT/EP2013/069418
Publication Date:
March 26, 2015
Filing Date:
September 19, 2013
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
ELAD YUVAL (DE)
TAL ALEX (DE)
ZECHARIA RAMI (DE)
UMANSKY ALEX (DE)
International Classes:
H04L45/121; H04L49/901
Domestic Patent References:
WO2007064116A12007-06-07
Foreign References:
US6714985B12004-03-30
US20050144310A12005-06-30
Other References:
None
Attorney, Agent or Firm:
KREUZ, Georg M. (Messerschmittstr. 4, Munich, DE)
Download PDF:
Claims:
CLAIMS:

1. A transmission device (1 10), comprising: a processor configured to submit multiple read request messages (Read A, ReadB) over a host interface corresponding to buffers of one or more network packets, to assign for each network packet a unique packet identifier, to calculate for each read request message a start byte offset indicating a relative location of a first completion byte of the read request message within the original network packet, to store the unique packet identifier and the start byte offset as an entry within a transmission database (1 13); and upon arrival of completion data from the host interface for the submitted read request messages, for each read response message (CPL A1 ), the processor is further configured: to associate the read response message (CPL A1 ) with an entry of the transmission database (1 13) and extract the packet identifier, to transform the read response message (CPL A1 ) into a fabric cell by the following operations: to mark the fabric cell with the packet identifier, to mark the fabric cell with a byte offset indicating a relative start byte offset of the fabric cell within the associated network packet, wherein the byte offset is calculated by summing a relative start offset of the corresponding read request message and a location of the read response message (CPL A1 ) within an entire stream of completion bytes of that read request message, to mark the fabric cell with a first flag if the fabric cell represents a first fabric cell of the network packet and a last flag if the fabric cell represents a last fabric cell of the network packet, to release the transmission database (1 13) entry if the fabric cell is marked with the last flag, and to submit the fabric cell over a fabric interface.

2. The transmission device (1 10) according to claim 1 , wherein read response messages (CPL A1 ) that belong to different read request messages are arriving out-of- order with respect to a read request submission order.

3. The transmission device (1 10) according to claim 1 or claim 2, configured to submit multiple read request messages (Read A, ReadB) over the host interface before processing a read response message. 4. The transmission device (1 10) according to one of the preceding claims, wherein a single read request message is responded with multiple read response messages.

5. The transmission device (1 10) according to one of the preceding claims, wherein the read response messages of different read request messages are interleaved with one another upon arrival. 6. The transmission device (1 10) according to one of the preceding claims, wherein the host interface comprises a Peripheral Component Interconnect Express interface or a Quick Path Interconnect interface.

7. The transmission device (1 10) according to one of the preceding claims, wherein the read request messages comprise PCIe read requests and the read response messages comprise completion Transaction Layer Packets.

8. The transmission device (1 10) according to one of the preceding claims, wherein the fabric cell comprises a payload field configured to store payload data and a header field configured to store configuration data; wherein the packet identifier is set in the header field of the fabric cell and wherein the byte offset is set in the header field of the fabric cell.

9. The transmission device (1 10) according to one of the preceding claims, comprising: a DMA engine configured to process the submission of the multiple read request messages and the read response messages; and an open transmission database (1 13) configured to store the unique packet identifier and the start byte offset.

10. A transmission system, comprising: a TX device according to one of claims 1 to 9; and a host memory coupled to the TX device by the host interface, wherein the host memory is configured to process the multiple read request messages (Read A, ReadB) submitted from the TX device and to respond to them with read response messages.

1 1 . A reception device (130) comprising a processor configured to perform the following operations upon reception of a fabric cell (Cell A1 ): if the fabric cell (Cell A1 ) is marked with a first flag: extract a packet identifier from the fabric cell, allocate a new RX buffer from an RX ring buffer of a host memory obtaining an RX buffer address, associate the packet identifier and the RX buffer address by adding them as an entry in a reassembly database, write a payload of the fabric cell to a host memory address corresponding to the RX buffer address incremented by a byte offset extracted from the fabric cell; if the fabric cell (Cell A1 ) is not marked with a first flag: extract a packet identifier from the fabric cell, lookup the packet identifier in the reassembly database and extract the RX buffer address therefrom, write a payload of the fabric cell to a host memory address corresponding to the RX buffer address incremented by a byte offset extracted from the fabric cell; if the fabric cell (Cell A1 ) is marked with a last flag perform on top of the above operations the following: delete the entry in the reassembly database after the payload of the fabric cell is written to the host memory address, notify a driver that a new network packet has arrived, notify the driver of any error conditions that were encountered. 12. The reception device (1 10) according to one of the preceding claims, comprising: a DMA engine configured to process the reception of the fabric cells; and an open reassembly database configured to store the packet identifier and the RX buffer address.

13. A reception system, comprising: an RX device according to claim 1 1 or claim 12; and a host memory coupled to the RX device by a host interface, wherein the host memory comprises at least one RX buffer and an RX buffer ring holding addresses of the at least one RX buffers.

14. A transmission method (300), comprising: submitting (1.) multiple read request messages (Read A, ReadB) over a host interface corresponding to buffers of one or more network packets, assigning (1 a.) for each network packet a unique packet identifier, calculating (1 b.) for each read request message a start byte offset indicating a relative location of a first completion byte of the read request message within the original network packet, storing (1 c.) the unique packet identifier and the start byte offset as an entry within a transmission database (1 13); and upon arrival of completion data (503) from the host interface for the submitted read request messages, performing (3.) for each read response (CPL A1 ) message: associating (3a.) the read response message with an entry of the database and extract the packet identifier, transforming (3b.) the read response message into a fabric cell by the following operations: marking (3c.) the fabric cell with the packet identifier, marking (3d.) the fabric cell with a byte offset indicating a relative start byte offset of the fabric cell within the associated network packet, wherein the byte offset is calculated by summing a relative start offset of the corresponding read request message and a location of the read response message within an entire stream of completion bytes of that read request message, marking (3e.) the fabric cell with a first flag if the fabric cell represents a first fabric cell of the network packet and a last flag if the fabric cell represents a last fabric cell of the network packet, releasing (3f.) the transmission database (1 13) entry if the fabric cell is marked with the last flag, and submitting (3g.) the fabric cell over a fabric interface. 15. A reception method (400), comprising performing the following operations upon reception of a fabric cell (Cell A1 ): if the fabric cell (Cell A1 ) is marked with a first flag perform (4a.): extracting (4a. i.) a packet identifier from the fabric cell, allocating (4a. ii.) a new RX buffer from an RX ring buffer of a host memory obtaining an RX buffer address, associating (4a.iii.) the packet identifier and the RX buffer address by adding them as an entry in a reassembly database, and writing (4a. iv.) a payload of the fabric cell to a host memory address corresponding to the RX buffer address incremented by a byte offset extracted from the fabric cell; if the fabric cell (Cell A1 ) is not marked with a first flag perform (4b.): extracting (4b. i.) a packet identifier from the fabric cell, looking-up (4b. ii.) the packet identifier in the reassembly database and extract the RX buffer address therefrom, and writing (4b.iii.) a payload of the fabric cell to a host memory address corresponding to the RX buffer address incremented by a byte offset extracted from the fabric cell; if the fabric cell (Cell A1 ) is marked with a last flag perform (4c.) on top of the above operations (4a., 4b.) the following: deleting (4c. i.) the entry in the reassembly database after the payload of the fabric cell is written to the host memory address, notifying (4c. ii.) a driver that a new network packet has arrived, and notifying (4c.iii.) the driver of any error conditions that were encountered.

Description:
TRANSMISSION AND RECEPTION DEVICES FOR REDUCING THE

DELAY IN END-TO-END DELIVERY OF NETWORK PACKETS

TECHNICAL FIELD The present invention relates to a transmission (TX) device configured for transmitting multiple read requests with respect to network packets and transmitting fabric cells based on a stream of completion packets in response to the read requests. The invention further relates to a reception (RX) device configured for receiving fabric cells and constructing network packets based on a payload and location information in the fabric cells.

Aspects of the invention relate to modern high-speed switching systems that interconnect multiple servers which are co-located under a shared physical enclosure through a shared fabric interconnect. The physical plane or backplane that interconnects these servers can be PCI express or any proprietary fabric that supports cell-based switching. The networking protocol that runs over the physical plane and logically interconnects the servers can be (but not limited to) standard IP over Layer2 Ethernet, FCoE, RoCE or Infiniband. The invention defines a method to deliver network packets between the memory subsystems of two hosts across the physical interconnect in a pure end-to-end cut-through fashion. The invention may be applied in conjunction with a variety of software networking stacks, both sockets based and verbs based (e.g. RDMA).

BACKGROUND

Recent technologies allow implementing networking protocols, such as Ethernet or Infiniband, between multiple hosts over a variety of physical interfaces which are not necessary the protocol's native physical layer. For example, Ethernet networking within a blade or a rack enclosure may use PCI Express or some vendor proprietary physical interconnect as the physical layer instead of using native 802.3 MAC and PHY

interconnects. The communication model in such systems is typically based on the well- known logical operations of send(), i.e. the logical operation of sending information, e.g. messages or data packets and receive(), i.e. the logical operation of receiving information, e.g. messages or data packets that are used for sourcing and sinking the logical units of communications (which are messages or data packets) and are performed by the transmitting and receiving sides respectively. The data flow begins with a send() operation in which the application posts a data buffer to the host kernel to be transmitted over the fabric. The buffer is passed through the transport and routing stacks of the kernel and will eventually reach the device driver of a network device (e.g. NIC or HBA) that serves as an entry point into the fabric. At this point the actual packet data may span multiple physical buffers, for example since the Transport and Routing stacks may add their headers as part of the process of turning the buffer into a transmittable network packet. The device driver will provide the device with pointers to the locations of the different buffers that compose the network packet. The device, equipped with its own DMA engine, will then read the different buffers from host memory, compose a network packet and send it across the network fabric. The packet will traverse the fabric and will eventually reach the device of the destination host. The destination device will use its DMA engine to write the packet into a host buffer that was pre-allocated by the software stack by means of a receive() operation. When DMA is completed the device driver will receive a

corresponding indication and will forward the packet up the relevant networking stack. While modern fabric interconnects attempt to support cut-through delivery of packets across such fabric interconnect, the support for real end-to-end cut-through is constrained by the host's behavior. The presented invention will allow the packet to be sent from the source buffers at the source host into the destination buffer at the destination host using a "pure" cut-through method that allows immediate injection of data into the fabric even if it has arrived out-of-order or interleaved with data of other packets from the memory of the source host.

A data packet should be sent from a memory host location(s) of a source host into a memory host location at a destination host. Each host has an attached "device" and all devices are connected to an interconnect fabric. Each device is responsible for the operations of reading a packet from host memory and sending it into the network fabric and receiving packets from the network fabric and placing them at the host's memory. The device is connected to the memory subsystem of its host through a PCI Express interface. The device at the source reads the packet by submitting PCIe read requests to the multiple buffers that span the packet, building the packet as the payload is received on the corresponding read responses and sending the packet over the interconnect fabric. It is a common practice for high-speed devices to have multiple outstanding read requests submitted on the PCIe fabric in order to allow for saturation of the downstream PCIe link. The PCIe specification allows the completion data of multiple outstanding read requests from the same source to arrive out-of-order with respect to the original read request submission order. Furthermore the PCIe specification allows the completer of a PCIe read requests to split the read response over multiple completion packets. These two relaxations create a completion stream that may be both out-of-order and interleaved as can be seen from Fig. 5 illustrating the case of two read requests "Read A" and "Read B" 501 .

For example, if read requests A,B 501 are submitted one after the other on the PCIe bus then the corresponding read completions 503 may arrive in the following order (starting from left) : B1 ,A1 ,A2,B2,B3,A3. A standard device interface would need store and forward the completion data 503 before composing it into packets 505 and submitting the packets 505 into the fabric. This buffering would be needed by typical devices for two main reasons: a) If A,B are read requests for buffers that account for a single network packet then data re-ordering is needed since the buffers should be composed into a packet in an order which preserves the original payload order of the packet b) If A,B are read requests where each request represents a separate network packet then re-ordering is needed since the different packets cannot be sent in an interleaved way over the fabric (however note that in our example the read responses for these 3 packets were interleaved by the host). Reference sign 501 describes a read request message, reference sign 503 describes a read completion message and reference sign 505 describes packets of a fabric message in Fig. 5.

The store and forward buffering that was just described introduces a delay At that contributes to the end-to-end latency of packets route in the fabric.

SUMMARY

It is the object of the invention to provide a technique for reducing a delay in end-to-end delivery of network packets.

This object is achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures. The invention is based on the finding that a technique for reducing the delay in end-to-end delivery of network packets can be achieved by having the TX side of the device tag each cell with a unique packet identifier and with a byte offset parameter. That tagging allows the RX side of the destination device to perform on-the-fly assembly of cells into packets by directly placing them at corresponding host buffer. This can be done for multiple packet concurrently. This way no store and forward buffering is needed in either the source or the destination devices and the lowest possible end-to-end cut-through latency is achieved.

The invention thus provides a technique for devices that are directly connected to a cell- based switched fabric to achieve pure cut-through delivery by transmitting network packets as a stream of cells that are carried over that fabric, such that cells that belong to the same network packet can be injected at any arbitrary order, including the actual order in which they arrived from the host's memory (which may different than the packet's original byte order) and cells that belong to different network packets may be interleaved with one another as they are injected into the fabric (again while preserving the original order in which they arrived from the host memory). The receiving side will be able to immediately place the data of arriving cells at the correct location within the destination host memory even if cells that belong to the same packet have arrived out-of-order or interleaved with cells of other packets.

In order to describe the invention in detail, the following terms, abbreviations and notations will be used:

PCIe,

PCIexpress: Peripheral Component Interconnect Express according to PCIe

specification.

TX: Transmit. RX: Receive.

IP: Internet Protocol.

FCoE: Fibre Channel over Ethernet. DMA: Direct Memory Access.

RDMA: Remote Direct Memory Access over Converged Ethernet. RoCE: Remote direct memory access over Converged Ethernet.

Infiniband: is a switched fabric communications link used in high performance

computing. 802.3 MAC: IEEE specification for Ethernet protocol.

PHY: physical layer.

NIC: Network Interface Card.

HBA: Host Bus Adapter.

TLP: Transaction Layer Packet according to PCIe specification. QPI: Quick Path Interconnect is a point-to-point processor interconnect

developed by Intel, also called CSI (Common System Interface).

According to a first aspect, the invention relates to a transmission device comprising a processor configured:

to submit multiple read request messages over a host interface corresponding to buffers of one or more network packets, to assign for each network packet a unique packet identifier, to calculate for each read request message a start byte offset indicating a relative location of a first completion byte of the read request message within the original network packet, to store the unique packet identifier and the start byte offset as an entry within a transmission database; and upon arrival of completion data from the host interface for the submitted read request messages, for each read response message the processor is further configured: to associate the read response message with an entry of the transmission database and extract the packet identifier, to transform the read response message into a fabric cell by the following operations: to mark the fabric cell with the packet identifier, to mark the fabric cell with a byte offset indicating a relative start byte offset of the fabric cell within the associated network packet, wherein the byte offset is calculated by summing a relative start offset of the corresponding read request message and a location of the read response message within an entire stream of completion bytes of that read request message, to mark the fabric cell with a first flag if the fabric cell represents a first fabric cell of the network packet and a last flag if the fabric cell represents a last fabric cell of the network packet, to release the transmission database (1 13) entry if the fabric cell is marked with the last flag, and to submit the fabric cell over a fabric interface.

By marking the fabric cell with the packet identifier and a byte offset indicating a relative start byte offset of the fabric cell within the associated network packet, wherein the byte offset is calculated by summing a relative start offset of the corresponding read request message and a location of the read response message within an entire stream of completion bytes of that read request message, a corresponding RX device is able to assign the received cells directly to buffer addresses in the correct sequence. Therefore, no extra buffering is required for reconstructing the correct transmission sequence. The TX device can thus provide an ultra-low latency transfer of data packets between the memory subsystems of two hosts. By the marking, an interleaved and out-of-order data delivery scheme can be realized. In a first possible implementation form of the transmission device according to the first aspect, read response messages that belong to different read request messages are arriving out-of-order with respect to a read request submission order. By applying the marking scheme, out-of-order delivery can be tolerated without producing extra delay for buffers reconstructing the original sequence.

In a second possible implementation form of the transmission device according to the first aspect as such or according to the first implementation form of the first aspect, the TX device is configured to submit multiple read request messages over the host interface before processing a read response message.

By submitting multiple read request messages over the host interface before processing a read response message, the transmission channel can be efficiently used.

In a third possible implementation form of the transmission device according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, a single read request message is responded with multiple read response messages.

By replying a single read request message with multiple read response messages, long packets can be partitioned into smaller packets that can be efficiently processed. Further, delay can be reduced when processing shorter packets. In a fourth possible implementation form of the transmission device according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the read response messages of different read request messages are interleaved with one another upon arrival. The TX device is thus able to process interleaved and out-of-order data and thereby guarantees an efficient usage of the data interfaces.

In a fifth possible implementation form of the transmission device according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the host interface comprises a PCI Express interface or a QPI interface. By using a PCI Express interface or a QPI interface, the transmission device can be applied in standard end-to-end systems where PCI Express or QPI is used as the standard data transmission.

In a sixth possible implementation form of the transmission device according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the read request messages comprise PCIe read requests and the read response messages comprise completion TLPs.

By using PCIe read requests and completion TLPs, the transmission interface can be easily operated in a PCI Express system that is an industry standard.

In a seventh possible implementation form of the transmission device according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the packet identifier is set in a header field of the fabric cell and the byte offset is set in a header field of the fabric cell.

When the packet identifier is set in a header field of the fabric cell and the byte offset is set in a header field of the fabric cell, processing can be increased as the header of the fabric cell is at the beginning of the fabric cell. When looking up the whole fabric cell, relevant information can be found fast.

In a seventh possible implementation form of the transmission device according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the TX device comprises: a DMA engine configured to process the submission of the multiple read request messages and the read response messages; and an open transmission database configured to store the unique packet identifier and the start byte offset.

According to a second aspect, the invention relates to a transmission system, comprising: a TX device according to the first aspect as such or according to any of the preceding implementation forms of the first aspect; and a host memory coupled to the transmission device by the host interface, wherein the host memory is configured to process the multiple read request messages submitted from the TX device and to respond to them with read response messages.

By using such TX device, the transmission system can provide an ultra-low latency transfer of data packets between the memory subsystems of two hosts. An interleaved and out-of-order data delivery scheme can be realized.

According to a third aspect, the invention relates to a reception device comprising a processor configured to perform the following operations upon reception of a fabric cell: if the fabric cell is marked with a first flag: extract a packet identifier from the fabric cell, allocate a new RX buffer from an RX ring buffer of a host memory obtaining an RX buffer address, associate the packet identifier and the RX buffer address by adding them as an entry in a reassembly database, write a payload of the fabric cell to a host memory address corresponding to the RX buffer address incremented by a byte offset extracted from the fabric cell; if the fabric cell is not marked with a first flag: extract a packet identifier from the fabric cell, lookup the packet identifier in the reassembly database and extract the RX buffer address therefrom, write a payload of the fabric cell to a host memory address corresponding to the RX buffer address incremented by a byte offset extracted from the fabric cell; if the fabric cell is marked with a last flag perform on top of the above operations the following: delete the entry in the reassembly database after the payload of the fabric cell is written to the host memory address, notify a driver that a new network packet has arrived, notify the driver of any error conditions that were encountered.

When the reception device processes fabric cells which payload includes a packet identifier and a byte offset written by the corresponding transmission device onto the payload, the reception device is able to assign the received cells directly to buffer addresses in the correct sequence. Therefore, no extra buffering is required for reconstructing the correct transmission sequence. The RX device can thus provide an ultra-low latency transfer of data packets between the memory subsystems of two hosts. By the marking of the fabric cells, an interleaved and out-of-order data delivery scheme can be realized.

In a first possible implementation form of the reception device according to the third aspect, the RX device comprises: a DMA engine configured to process the reception of the fabric cells; and an open reassembly database configured to store the packet identifier and the RX buffer address.

When the open reassembly database is implemented on the reception device memory accesses for evaluating the transmission states are very fast. According to a fourth aspect, the invention relates to a reception system, comprising: an RX device according the third aspect as such or according to the first implementation form of the third aspect; and a host memory coupled to the reception device by a host interface, wherein the host memory comprises at least one RX buffer and an RX buffer ring holding addresses of the at least one RX buffers.

By using such RX device, the reception system can provide an ultra-low latency transfer of data packets between the memory subsystems of two hosts. An interleaved and out-of- order data delivery scheme can be realized. According to a fifth aspect, the invention relates to a transmission method, comprising: (1 ) submitting multiple read request messages over a host interface corresponding to buffers of one or more network packets,

(a) assigning for each network packet a unique packet identifier, (b) calculating for each read request message a start byte offset indicating a relative location of a first completion byte of the read request message within the original network packet,

(c) storing the unique packet identifier and the start byte offset as an entry within a transmission database; and

(3) upon arrival of completion data (503) from the host interface for the submitted read request messages, for each read response message:

(a) associating the read response message with an entry of the database and extract the packet identifier, (b) transforming the read response message into a fabric cell by the following operations:

(c) marking the fabric cell with the packet identifier,

(d) marking the fabric cell with a byte offset indicating a relative start byte offset of the fabric cell within the associated network packet, wherein the byte offset is calculated by summing a relative start offset of the corresponding read request message and a location of the read response message within an entire stream of completion bytes of that read request message,

(e) marking the fabric cell with a first flag if the fabric cell represents a first fabric cell of the network packet and a last flag if the fabric cell represents a last fabric cell of the network packet,

(f) releasing the transmission database entry if the fabric cell is marked with the last flag, and

(g) submitting the fabric cell over a fabric interface.

By marking the fabric cell with the packet identifier and a byte offset indicating a relative start byte offset of the fabric cell within the associated network packet, wherein the byte offset is calculated by summing a relative start offset of the corresponding read request message and a location of the read response message within an entire stream of completion bytes of that read request message, a corresponding reception method, e.g. implemented in a reception device, is able to assign the received cells directly to buffer addresses in the correct sequence. Therefore, no extra buffering is required for reconstructing the correct transmission sequence. The transmission method can thus provide an ultra-low latency transfer of data packets between the memory subsystems of two hosts. By the marking, an interleaved and out-of-order data delivery scheme can be realized.

According to a sixth aspect, the invention relates to a reception method, comprising: performing the following operations upon reception of a fabric cell:

(4a) if the fabric cell (Cell A1 ) is marked with a first flag:

(i) extracting a packet identifier from the fabric cell,

(ii) allocating a new RX buffer from an RX ring buffer of a host memory obtaining an RX buffer address,

(iii) associating the packet identifier and the RX buffer address by adding them as an entry in a reassembly database, and

(iv) writing a payload of the fabric cell to a host memory address corresponding to the RX buffer address incremented by a byte offset extracted from the fabric cell;

(4b) if the fabric cell (Cell A1 ) is not marked with a first flag:

(i) extracting a packet identifier from the fabric cell, (ii) looking-up the packet identifier in the reassembly database and extract the RX buffer address therefrom, and

(iii) writing a payload of the fabric cell to a host memory address corresponding to the RX buffer address incremented by a byte offset extracted from the fabric cell;

(4c) if the fabric cell (Cell A1 ) is marked with a last flag perform on top of the above operations the following:

(i) deleting the entry in the reassembly database after the payload of the fabric cell is written to the host memory address,

(ii) notifying a driver that a new network packet has arrived, and (iii) notifying the driver of any error conditions that were encountered.

When the reception method processes fabric cells which payload includes a packet identifier and a byte offset written by the corresponding transmission method onto the payload, the reception method is able to assign the received cells directly to buffer addresses in the correct sequence. Therefore, no extra buffering is required for reconstructing the correct transmission sequence. The reception method can thus provide an ultra-low latency transfer of data packets between the memory subsystems of two hosts. By the marking of the fabric cells, an interleaved and out-of-order data delivery scheme can be realized.

Aspects of the invention provide a System and method for ultra-low latency transfer of data packets between the memory subsystems of two hosts using an interleaved and out- of-order data delivery scheme.

The methods, systems and devices described herein may be implemented as software in a Digital Signal Processor (DSP), in a micro-controller or in any other side-processor or as hardware circuit within an application specific integrated circuit (ASIC). The invention can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations thereof, e.g. in available hardware of conventional mobile devices or in new hardware dedicated for processing the methods described herein. BRIEF DESCRIPTION OF THE DRAWINGS

Further embodiments of the invention will be described with respect to the following figures, in which: Fig. 1 shows a block diagram illustrating an end-to-end switching system 100 including a transmission device, a TX host memory, a reception device and an RX host memory according to an implementation form;

Fig. 2 shows a timing diagram 200 illustrating a timing of completion packets and fabric cells with respect to read requests according to an implementation form; Fig. 3 shows a schematic diagram of a transmission method 300 according to an implementation form; Fig. 4 shows a schematic diagram of a reception method 400 according to an

implementation form; and

Fig. 5 shows a conventional timing diagram 500 illustrating a delay At in the timing of completion packets and fabric cells with respect to read requests.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

Fig. 1 shows a block diagram illustrating an end-to-end switching system 100 including a TX device 1 10, a TX host memory 120, an RX device 130 and an RX host memory 140 according to an implementation form. The figure shows a single direction. Both devices may also operate with respect to the opposite direction.

At the foundation of the system 100 there is a group of hosts which are interconnected by a switched fabric 105 through "interface devices" TX device 1 10, RX device 130. Each RX/TX device 1 10, 130 has direct access to its attached host's memory 120, 140 and serves as a gateway for the host into the fabric 105. The networking protocol that is implemented above the fabric 105 can be Ethernet, Infiniband, RapidIO, Fiber Channel, Cell or other. The basic logical unit of communication for this networking protocol is referred to as a "network packet".

The communication model between the host and the RX/TX device 1 10, 130 is implemented by a device driver and typically managed through a set of TX 123 and RX rings 143. The TX ring 123 is used by the driver for posting buffers 121 of network packets that should be transmitted by the TX device 1 10 into the fabric 105. The RX ring 143 is used by the driver for posting of buffers 141 that the RX device 130 is expected to fill with network packets that arrive on its fabric interface 107 (for the description below a single RX buffer per network packet is assumed). The RX/TX device 1 10, 130 directly accesses buffers (for read or for write) through its DMA mechanisms. The functionality of each RX/TX device 1 10, 130 is split to a TX engine 1 1 1 and an RX engine 131 . The TX engine 1 1 1 includes a DMA mechanism that can directly access the memory buffer(s) 121 where a network packet is located, fetch the data and transmit it over the network link 105 as a group of cells. The RX engine 131 includes a DMA mechanism that can receive network packets cells and directly place their payload in a host buffer 141 until a complete network packet is constructed. The RX engine 131 may support the placement of multiple packets concurrently by maintaining an Open Reassembly database 133 which is explained in more detail below.

In an implementation form, the physical interface that connects the RX/TX device 1 10, 130 to the Host memory cluster 120, 140 is PCI express.

The cell-based switched fabric 105 that interconnects all devices 1 10, 130 may have any fabric topology (Clos, fat tree, other) and there are no restrictions on the routing algorithms that are implemented by it. Specifically, the fabric 105 may support active- active topologies using multi-path algorithms without impacting the correctness of the algorithm presented below. The level of multi-pathing affects only the number of entries in the Open Reassembly database 133.

The end-to-end operation of transferring a network packet between a source host and a destination host is composed of the following steps:

1. The TX engine submits multiple PCIe read requests over the PCIe bus that

correspond to the buffers of one or more network packets.

1 a. For each network packet the engine assigns a unique packet identifier (PID).

1 b. For each PCIe read request the engine calculates a start byte offset which indicates the relative location of the first completion byte of the request within the original network packet.

1 c. These parameters are stored in an entry within the Open Transmission database.

2. The host's memory subsystem is processing the multiple read requests and

responds to them with completion packets. Completion packets that belong to different read requests may arrive out-of-order with respect to the read request submission order. Also, since the memory subsystem may respond to a single read request with multiple completion packets the completion packets of different read requests may be interleaved with one another as they arrive to the interface device.

3. As completion data for the submitted read requests arrive from the host the TX engine performs the following operations for each completion TLP:

3a. Associate the completion TLP with an Open Transmission database entry, extract PID.

3b. Transform the completion TLP into a fabric cell

3c. Set in the fabric cell header the PID field

3d. Set in the fabric cell header the ByteOffset field which indicates the cell's relative start offset within the associated network packet. The ByteOffset field is calculated by summing the corresponding read request's relative start offset and the location of the completion TLP within the entire completion byte stream of that read request.

3e. Set in the fabric cell header the "First" flag (which indicates the first fabric cell of a network packet) and "Last" flag (which indicates the last fabric cell of a network packet) fields.

3f. If "Last" flag is set release the Open Transmission database entry

3g. Inject the fabric cell into the fabric

Table 1 : steps performed at a transmission side of an end-to-end connection.

When the RX engine receives a fabric cell it does the following operations:

4a. If this is a "First" fabric cell:

i. Extract PID

ii. Allocate a new RX buffer from the RX ring

iii. Associate cell. PID and the RX buffer by adding them as an entry in the Open Reassembly database

iv. Write the payload of the cell to host memory address of (RX buffer

address + eel I. ByteOffset)

4b. If this is not a "First" fabric cell:

i. Extract PID

ii. Lookup cell. PID in the Open Reassembly database and extract RX

buffer address

iii. Write the payload of the cell to host memory address of (RX buffer

address + eel I. ByteOffset)

4c. If this is a "last" fabric cell then on top of the above operations the engine performs the following:

i. Delete Open Reassembly database entry after the cell contents are

written to the buffer Delete Open Reassembly database entry after the cell contents are written to the buffer.

ii. Notify the driver that a new network packet has arrived

iii. Notify the driver of any error conditions that were encountered

Table 2: steps performed at a reception side of an end-to-end connection.

The operation of the TX engine 1 1 1 and the RX engine 131 of an RX/TX device 1 10, 130 as described above guarantees that the logical view of the communication model is maintained and that network packets are delivered appropriately from the source to the destination. Additionally, the result of the flow described above is that each network packet is written to the destination host at the original byte order in which it has arrived from the source host's memory. This implies that by using this technique a truly zero buffering (or pure-cut through) delivery scheme is achieved as shown below with respect to Fig. 2. According to Fig. 2 illustrated and described below, the optimal delivery without the invention, i.e. transfer of Cell A1 starts only after the full packet A is read - meaning only after CPL A3 arrives, and then subsequently Cell B1 , Cell B2, Cell B3 are transferred. In an implementation, the TX device 1 10 is configured to perform the steps 1 , 1 a, 1 b, 1 c and 3, 3a to 3g as described above with respect to table 1. In an implementation, the host memory 120 is configured to perform step 2 described above in table 1.

In an implementation, the RX device 130 is configured to perform the steps 4a, 4b and 4c as described above with respect to table 2.

TX device 1 10 and RX device 130 may form an end-to-end system for transmitting network packets between two hosts. The hosts may communicate by transmitting send() and receive() commands at higher layers, e.g. by using driver software.

Fig. 2 shows a timing diagram 200 illustrating a timing of completion packets and fabric cells with respect to read requests according to an implementation form. Read requests A, B 501 are submitted one after the other on the PCIe bus. The corresponding read completions 503 may arrive in the following order (starting from left): B1 , A1 , A2, B2, B3, A3. A standard device interface would need to store and forward the completion data 503 before composing it into packets 505 and submitting the packets 505 into the fabric. An end-to-end switching system 100 as described above with respect to Fig. 1 may not require such a store and forward buffering of the completion data 503. Instead, the completion data 503 is delivered without additional store-forward buffering. The completion data 503 is transformed to a stream of network cells 207 just upon arrival of each read completion such that a latency saving 207 over the standard system can be realized.

Fig. 3 shows a schematic diagram of a transmission method 300 according to an implementation form. The transmission method 300 includes:

(1 ) submitting multiple read request messages over a host interface corresponding to buffers of one or more network packets,

(a) assigning for each network packet a unique packet identifier,

(b) calculating for each read request message a start byte offset indicating a relative location of a first completion byte of the read request message within the original network packet,

(c) storing the unique packet identifier and the start byte offset as an entry within a transmission database; and

(3) upon arrival of completion data from the host interface for the submitted read request messages, for each read response message:

(a) associating the read response message with an entry of the database and extract the packet identifier,

(b) transforming the read response message into a fabric cell by the following operations:

(c) marking the fabric cell with the packet identifier, (d) marking the fabric cell with a byte offset indicating a relative start byte offset of the fabric cell within the associated network packet, wherein the byte offset is calculated by summing a relative start offset of the corresponding read request message and a location of the read response message within an entire stream of completion bytes of that read request message,

(e) marking the fabric cell with a first flag if the fabric cell represents a first fabric cell of the network packet and a last flag if the fabric cell represents a last fabric cell of the network packet,

(f) releasing the transmission database entry if the fabric cell is marked with the last flag, and

(g) submitting the fabric cell over a fabric interface.

Item (2) illustrates the processing of the multiple read requests by a host memory. In one example, items (1 ) and (3) belong to the transmission method 300. In one example, items (1 ), (2) and (3) belong to the transmission method 300.

In an implementation, the method 300 is implemented in a TX device 1 10 as described above with respect to Fig. 1 . Fig. 4 shows a schematic diagram of a reception method 400 according to an

implementation form. The reception method 400 includes performing the following operations upon reception of a fabric cell:

(4a) if the fabric cell is marked with a first flag:

(i) extracting a packet identifier from the fabric cell, (ii) allocating a new RX buffer from an RX ring buffer of a host memory obtaining an RX buffer address,

(iii) associating the packet identifier and the RX buffer address by adding them as an entry in a reassembly database, and

(iv) writing a payload of the fabric cell to a host memory address corresponding to the RX buffer address incremented by a byte offset extracted from the fabric cell; (4b) if the fabric cell is not marked with a first flag:

(i) extracting a packet identifier from the fabric cell,

(ii) looking-up the packet identifier in the reassembly database and extract the RX buffer address therefrom, and (iii) writing a payload of the fabric cell to a host memory address corresponding to the RX buffer address incremented by a byte offset extracted from the fabric cell;

(4c) if the fabric cell (Cell A1 ) is marked with a last flag perform on top of the above operations the following:

(i) deleting the entry in the reassembly database after the payload of the fabric cell is written to the host memory address,

(ii) notifying a driver that a new network packet has arrived, and

(iii) notifying the driver of any error conditions that were encountered.

In an implementation, the method 400 is implemented in a RX device 130 as described above with respect to Fig. 1 .

From the foregoing, it will be apparent to those skilled in the art that a variety of methods, systems, computer programs on recording media, and the like, are provided. The present disclosure also supports a computer program product including computer executable code or computer executable instructions that, when executed, causes at least one computer to execute the performing and computing steps described herein.

Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein. While the present inventions has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto without departing from the scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.