BUFFERING PACKETS ACCORDING TO THEIR ADDRESSES

Title:

BUFFERING PACKETS ACCORDING TO THEIR ADDRESSES

Document Type and Number:

WIPO Patent Application WO/2006/046028

Kind Code:

Abstract:

A communication interface for providing an interface between a first data processing apparatus and a data link, the first data processing apparatus being connectable to a second data processing apparatus via the data link, the communication interface comprising: an input port for receiving from the first data processing apparatus groups of data for transmission to the second data processing apparatus, each group of data having a destination address indicating a region of memory in the second data processing apparatus; a plurality of buffers for storing groups of data received by the input port; and a control unit arranged to select, for each such group of data, one of the plurality of buffers in which to store the group of data, the control unit selecting the said one of the plurality of buffers in dependence on the destination address of the group.

Inventors:

POPE STEVE LESLIE (GB)
RIDDOCH DAVID JAMES (GB)
ROBERTS DEREK EDWARD (GB)

Application Number:

PCT/GB2005/004125

Publication Date:

May 04, 2006

Filing Date:

October 26, 2005

Export Citation:

Click for automatic bibliography generation Help

Assignee:

LEVEL 5 NETWORKS INC (US)
POPE STEVE LESLIE (GB)
RIDDOCH DAVID JAMES (GB)
ROBERTS DEREK EDWARD (GB)

International Classes:

H04L49/901; G06F12/10; G06F12/1081

Domestic Patent References:

WO2004079981A2

2004-09-16

Foreign References:

US20030112817A1

2003-06-19

Attorney, Agent or Firm:

Slingsby, Philip Roy (Bedford House John Street, London WC1N 2BF, GB)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1A communication interface for providing an interface between a first data processing apparatus and a data link, the first data processing apparatus being connectable to a second data processing apparatus via the data link, the communication interface comprising: an input port for receiving from the first data processing apparatus groups of data for transmission to the second data processing apparatus, each group of data having a destination address indicating a region of memory in the second data processing apparatus; a plurality of buffers for storing groups of data received by the input port; and a control unit arranged to select, for each such group of data, one of the plurality of buffers in which to store the group of data, the control unit selecting the said one of the plurality of buffers in dependence on the destination address of the group.

A communication interface as claimed in claim 1 wherein the control unit is further arranged to determine the destination addresses of groups of data stored in at least one of the buffers and, for each group of data, preferentially select as the said one of the buffers a buffer containing a group of data having a destination address matching the destination address of that group of data.

3.	A communication interface as claimed in claim 1 or claim 2 wherein the plurality of buffers are firstin firstout buffers.

4.	A communication interface as claimed in any preceding claim further comprising a data store and a data manager for storing in the data store data indicating the content of each of the plurality of buffers.

5.	A communication interface as claimed in claim 4 wherein the said data indicating the content of each of the plurality of buffers includes the destination addresses of groups of data stored in each of the buffers.

6.	A communication interface as claimed in claim 5 wherein the control unit is arranged to determine the destination addresses of groups of data stored in at least one of the buffers by accessing the data store.

7.	A communication interface as claimed in any preceding claim wherein the destination address includes a destination aperture.

8.	A communication interface as claimed in any preceding claim wherein the destination address includes an indication of the second data processing apparatus.

9.	A communication interface as claimed in claim 7 as dependent on claim 2 wherein the control unit is arranged to determine that a pair of destination addresses match if the destination aperture of one group of data is the same as the destination aperture of another group of data.

10.

A communication interface as claimed in any preceding claim further comprising a packet creation unit for creating packets of data for transmission to the second data processing apparatus, the packet creation unit being arranged to: receive a first group of data from one of the buffers; determine the destination address of the first group of data; insert at least part of the first group of data into a packet; receive a second group of data from the said one of the buffers; determine the destination address of the second group of data; and only if the destination addresses of the first and second groups of data match, insert at least part of the second group of data into the packet.

11.	A communication interface as claimed in claim 10 wherein the packet creation unit is further arranged to, if the destination addresses of the first and second groups of data do not match: create a further packet; and insert at least part of the second group of data into the further packet.

12.	A communication interface as claimed in claim 11 wherein the packet creation unit is further arranged to, if the destination addresses of the first and second groups of data do not match: terminate the packet comprising the first group of data.

13.

A communication interface as claimed in any of claims 10 to 12 wherein, if the destination addresses of the first and second groups of data match, the packet creation unit is arranged not to insert any parts of the destination address of the second packet that have already been inserted into the packet.

14.	A communication interface as claimed in any preceding claim wherein each group of data is associated with a priority level indicating a priority with which the group is to be transmitted to the second data processing apparatus by the communication interface.

15.	A communication interface as claimed in claim 14 wherein the control unit is further arranged to select the said one of the plurality of buffers in dependence on the priority level of the group.

16.

A communication interface as claimed in claim 14 or claim 15 wherein the control unit is further arranged to determine the priority levels of groups of data stored in at least one of the buffers and, for each group of data, preferentially select as the said one of the buffers a buffer containing a group of data having a priority level matching the priority level of the group of data.

17.	A communication interface as claimed in any of claims 14 to 16 as dependent on claim 4 wherein the data indicating the content of each of the plurality of buffers includes the priority levels of groups of data stored in each of the buffers.

18.	A communication interface as claimed in claim 17 as dependent on claim 6 wherein the control unit is further arranged to determine the priority levels of groups of data stored in at least one of the buffers by accessing the data store.

19.

A communication interface as claimed in any preceding claim further comprising a packet creation unit for forming a stream of data for transmission in a packet over the data link to the second data processing apparatus, the packet creation unit being arranged to: select a first one of the plurality of buffers in dependence on the priority levels of groups of data stored therein; retrieve a first group of data from the first one of the plurality of buffers; and insert at least a part of the first group into a packet; whereby the packet can be created so as to include groups of data of a high priority in preference to groups of data of a lower priority.

20.	A communication interface as claimed in claim 19 wherein the packet creation unit is further arranged to select the first one of the plurality of buffers in dependence on the destination addresses of groups of data stored therein.

21.

A system comprising a first data processing apparatus, a second data processing apparatus and a communication interface for providing an interface between the first data processing apparatus and a data link, the first data processing apparatus being connectable to a second data processing apparatus via the data link, the communication interface comprising: an input port for receiving from the first data processing apparatus groups of data for transmission to the second data processing apparatus, each group of data having a destination address indicating a region of memory in the second data processing apparatus; a plurality of buffers for storing groups of data received by the input port; and a control unit arranged to select, for each such group of data, one of the plurality of buffers in which to store the group of data, the control unit selecting the said one of the plurality of buffers in dependence on the destination address of the group.

22.

A method for buffering received groups of data in a communication interface between a first data processing apparatus and a data link whereby the first data processing apparatus of connectable to a second data processing apparatus, the method comprising: receiving from the first data processing apparatus a group of data for transmission to the second data processing apparatus, the group of data having a destination address indicating a region of memory in the second data processing apparatus; and selecting one of a plurality of buffers of the communication interface in dependence on the destination address of the received group of data; and storing the received group of data in the selected buffer.

23.

A method as claimed in claim 21 wherein the selecting step further comprises determining the destination addresses of further groups of data stored in at least one of the plurality of buffers and selecting as the said one of the buffers a buffer containing a group of data having a destination address matching the destination address of the received group of data.

Description:

BUFFERING PACKETS ACCORDING TO THEIR ADDRESSES

The present invention relates to a communication interface in a communications network, and to a method for buffering data in a communication interface.

When data is to be transferred between two devices over a data channel, each of the devices must have a suitable network interface to allow it to communicate across the channel. The devices and their network interfaces use a protocol to form the data that is transmitted over the channel, so that it can be decoded at the receiver. The data channel may be considered to be or to form part of a network, and additional devices may be connected to the network.

The Ethernet system is used for many networking applications. Gigabit Ethernet is a high-speed version of the Ethernet protocol, which is especially suitable for links that require a large amount of bandwidth, such as links between servers or between data processors in the same or different enclosures. Devices that are to communicate over the Ethernet system are equipped with network interfaces that are capable of supporting the physical and logical requirements of the Ethernet system. The physical hardware component of network interfaces are referred to as network interface cards (NICs), although they need not be in the form of cards: for instance they could be in the form of integrated circuits (ICs) and connectors fitted directly on to a motherboard.

Where data is to be transferred between cooperating processors in a network, it is common to implement a memory mapped system. In a memory mapped system communication between the applications is achieved by virtue of a portion of one application's virtual address space being mapped over the network onto another application. The "holes" in the address space which form the mapping are termed apertures. A particular location within an aperture can be identified by means of an "offset".

Figure 1 illustrates a mapping of the virtual address space (X ₀-X _n) onto another virtual address space (Y ₀-Y _n) via a network. In such a system a CPU that has access to the X ₀-X _n memory space could access a location xi for writing the contents of a register r-i to that location by issuing the store instruction [st r-i , X ₁]. A memory mapping unit (MMU) is employed to map the virtual memory onto physical memory location.

The following steps would then be taken:

1.CPU emits the contents of T ₁ (say value 10) as a write operation to virtual address X ₁ 2.The MMU (which could be within the CPU) turns the virtual address X ₁ into physical address PCi ₁ (this may include page table traversal or a page fault) 3.The CPU's write buffer emits the "write 10, PCi ₁" instruction which is "caught" by the controller for the bus on which the CPU is located, in this example a PCI

(Input/Output bus subsystem) controller. The instruction is then forwarded onto the computer's PCI bus. 4.A NIC connected to the bus and interfacing to the network "catches" the PCI instruction and forwards the data to the destination computer at which virtual address space (Y ₀-Y _n) is hosted. 5.At the destination computer, which is assumed to have equivalent hardware, the network card emits a PCI write transaction to store the data in memory δ.The receiving application has a virtual memory mapping onto the memory and may read the data by executing a "load Y ₁" instruction

These steps are illustrated by figure 2. This figure illustrates that at each point that the hardware store instructions passes from one hardware device to another, a translation of the address from one address space to another may be required. Also note that a very similar chain of events supports read operations and PCI is assumed but not required as the host IO bus implementation.

Hence the overall memory space mapping (X ₀ - X _n) → (Y ₀ - Y _n) is implemented by a series of sub-mappings as follows:

{Xo - Xn}

{PCI _o , PCI _n) (processor 1 address space)

{PCI'o , PCI' _n} (PCI bus address space)

Network - mapping not shown

{PCI'O - PCr _n) (destination PCI bus address space)

{memo - mem _n} (destination memory address space)

{Yo - Yn} (destination application's virtual address space)

The step marked in figure 2 as "Network" requires the NIC / network controller to forward the transaction to the correct destination host in such a way that the destination can continue the mapping chain. This is achieved by means of further memory apertures.

In typical protocols for passing data over a network, bursts of data, which comprise a number of bytes, are formed into a packet for transmission from one device to another. The packet will typically comprise a header which includes identifiers of a place in the address space of the destination device in which the data is intended to be stored. For example, the header might include an indication of an aperture in the memory of the destination device. The packet corresponding to that header could then include bytes of content data intended for that destination aperture. In general, each burst including bytes of content data will have associated with it an aperture and an offset indicating a location in the memory of the destination device. In most known protocols, all bytes within a burst will relate to the same destination aperture.

When compiling a packet from a series of bursts, it is preferable to group together bursts relating to the same destination aperture so that the packet header, which

generally includes an indication of that destination aperture, would be applicable to all the data within the packet. It is then common to include within the content portion of the packet an indication of the destination offset of each burst before including the burst itself. Thus, the content portion of a packet might be formulated as follows:

offset-i . bursti, offset _2> burst ₂, offset ₃, burst ₃

For example, in a given protocol, an offset could occur every 64 bytes. Due to the packet space taken up by offsets and by packet headers, the density of content data in a packet can be quite low.

In general, the overhead of changing apertures, and thus beginning a new packet, is high. In particular, if a new aperture relates to a different central processing unit (CPU), the new packet would generally need to be in a new physical layer. It is therefore efficient to transfer over a network a packet containing as much content data as is permitted by a given protocol, with all of the content data relating to a single aperture specified in the packet header.

In a typical prior art network arrangement, if a dual processor within a data processing apparatus is running two applications simultaneously, it might be producing two parallel streams of data for transmission to another data processing apparatus within the network. These two streams might be intended for two different apertures in the destination memory. For example, the first stream might include data for aperturei while the second stream includes data for aperture ₂. Each new burst will be intended for a new offset within the apertures. Thus, the data produced by the dual processor might be in the order illustrated below:

aperture^ offset^ aperture2, offset-i; aperture^ offset ₂; aperture ₂, offset ₂; etc.

This effect could also be seen within a single application on a single processor, if multiple Direct Memory Access (DMA) engines were employed. In DMA, a NIC is arranged to pull data from memory for transmission, instead of a CPU pushing data

towards the NIC. In general, it is necessary for data intended for more than one aperture to be packetised at the same time.

If a single buffer is used at a communication interface to receive this data, the bursts for aperturβi will become interleaved with the bursts for aperturβ ₂. FIFO (first-in first- out) buffers tend to be used in interfaces, and this means that once data is applied to a buffer it cannot be reordered within it. It can therefore be problematic to create packets containing only data intended for one aperture. This can result in sub- optimal sized packets being emitted by the communication interface onto the network, and efficiency can be severely limited.

In some protocols it is possible to assign a priority to a burst of data. For example, there might be three levels of priority, with 0 indicating high, 1 indicating medium and 2 indicating low priority. This can allow for certain data to be transmitted through a network preferentially. For example, some data such as speech data might be time- critical, while other data could acceptably be sent over a longer period of time.

It would be desirable to provide an improved method and apparatus for buffering data or for forming packets in a communication interface, so as to mitigate some or all of the above-identified problems.

According to the first aspect of the present invention there is provided a communication interface for providing an interface between a first data processing apparatus and a data link, the first data processing apparatus being connectable to a second data processing apparatus via the data link, the communication interface comprising: an input port for receiving from the first data processing apparatus groups of data for transmission to the second data processing apparatus, each group of data having a destination address indicating a region of memory in the second data processing apparatus; a plurality of buffers for storing groups of data received by the input port; and a control unit arranged to select, for each such group of data, one of the plurality of buffers in which to store the group of data, the control unit

selecting the said one of the plurality of buffers in dependence on the destination address of the group.

In embodiments of the invention, the communication interface can thereby store together groups of data intended for the same destination aperture, and this can facilitate packet formation so that packets can advantageously contain relatively large amounts of data all intended for the same address.

The control unit could be further arranged to determine the destination addresses of groups of data stored in at least one of the buffers and, for each group of data, preferentially select as the said one of the buffers a buffer containing a group of data having a destination address matching the destination address of that group of data.

The plurality of buffers could suitably be first-in first-out buffers. In such an arrangement, it is generally not possible to re-order data once it has been stored, so that embodiments of the invention are especially advantageous.

The communication interface could also comprise a data store, and a data manager for storing in the data store data indicating the content of each of the plurality of buffers. This data could include the destination addresses of groups of data stored in each of the buffers. The control unit could then be arranged to determine the destination addresses of groups of data stored in at least one of the buffers by accessing the data store.

The destination address could include a destination aperture or an indication of the second data processing apparatus.

Where the control unit is arranged to select a buffer on the basis of a destination address match, it could determine that the destination addresses match if the destination aperture of one group of data is the same as the destination aperture of another group of data.

The communication interface could additionally comprise a packet creation unit for creating packets of data for transmission to the second data processing apparatus, the packet creation unit being arranged to: receive a first group of data from one of the buffers; determine the destination address of the first group of data; insert at least part of the first group of data into a packet; receive a second group of data from the said one of the buffers; determine the destination address of the second group of data; and only if the destination addresses of the first and second groups of data match, insert at least part of the second group of data into the packet.

The packet creation unit may further be arranged to, if the destination addresses of the first and second groups of data do not match: create a further packet; and insert at least part of the second group of data into the further packet. The packet creation unit could also be arranged to, if the destination addresses of the first and second groups of data do not match: terminate the packet comprising the first group of data. If the destination addresses of the first and second groups of data match, the packet creation unit could be arranged not to insert any parts of the destination address of the second packet that have already been inserted into the packet.

Optionally, each group of data could be associated with a priority level indicating a priority with which the group is to be transmitted to the second data processing apparatus by the communication interface. The control unit could then be arranged to select the said one of the plurality of buffers in dependence on the priority level of the group. The control unit could also be arranged to determine the priority levels of groups of data stored in at least one of the buffers and, for each group of data, preferentially select as the said one of the buffers a buffer containing a group of data having a priority level matching the priority level of the group of data.

Where a data store is utilised, the data indicating the content of each of the plurality of buffers could include the priority levels of groups of data stored in each of the buffers. The control unit could optionally be arranged to determine the priority levels of groups of data stored in at least one of the buffers by accessing the data store.

The communication interface could further comprise a packet creation unit for forming a stream of data for transmission in a packet over the data link to the second data processing apparatus, the packet creation unit being arranged to: select a first one of the plurality of buffers in dependence on the priority levels of groups of data stored therein; retrieve a first group of data from the first one of the plurality of buffers; and insert at least a part of the first group into a packet; whereby the packet can be created so as to include groups of data of a high priority in preference to groups of data of a lower priority. The packet creation unit could be further arranged to select the first one of the plurality of buffers in dependence on the destination addresses of groups of data stored therein.

According to a second aspect of the present invention there is provided a system comprising a first data processing apparatus, a second data processing apparatus and a communication interface for providing an interface between the first data processing apparatus and a data link, the first data processing apparatus being connectable to a second data processing apparatus via the data link, the communication interface comprising: an input port for receiving from the first data processing apparatus groups of data for transmission to the second data processing apparatus, each group of data having a destination address indicating a region of memory in the second data processing apparatus; a plurality of buffers for storing groups of data received by the input port; and a control unit arranged to select, for each such group of data, one of the plurality of buffers in which to store the group of data, the control unit selecting the said one of the plurality of buffers in dependence on the destination address of the group.

According to a third aspect of the present invention there is provided a method for buffering received groups of data in a communication interface between a first data processing apparatus and a data link whereby the first data processing apparatus of connectable to a second data processing apparatus, the method comprising: receiving from the first data processing apparatus a group of data for transmission to the second data processing apparatus, the group of data having a destination address indicating a region of memory in the second data processing apparatus; and

selecting one of a plurality of buffers of the communication interface in dependence on the destination address of the received group of data; and storing the received group of data in the selected buffer.

The selecting step could further comprise determining the destination addresses of further groups of data stored in at least one of the plurality of buffers and selecting as the said one of the buffers a buffer containing a group of data having a destination address matching the destination address of the received group of data.

The invention will now be described by way of example with reference to the accompanying drawings, in which: figure 1 illustrates mapping of one address space on to another over a network; figure 2 illustrates the architecture of a prior art memory mapped architecture; figure 3 is a schematic diagram of a data transmission system; figures 4 and 5 illustrate mapping of bits of an address; figure 6 illustrates memory space apertures and their management domains; figure 7 illustrates features of a port; figure 8 illustrates a queue with control blocks; figure 9 illustrates a dual queue mechanism; figure 10 shows an example of an outgoing aperture table; figure 11 shows an example of an incoming aperture table; figure 12 shows the steps in a PCI write for an outgoing aperture; figure 13 illustrates the operation of pointers in fixed and variable length queues; figure 14 shows the structure of an Ethernet packet; figure 15 illustrates the structure of a burst in frame; figure 16 shows a data burst; figure 17 illustrates data reception into memory; figure 18 illustrates a series of buffers in a communication interface; figure 19 shows a communication network connecting a pair of devices; and figure 20 shows a representation of data packets.

Figure 3 is a schematic diagram of a data transmission system whereby a first data processing unit (DPU) 20 can communicate with a second data processing unit 21 over a network link 22. Each data processing unit comprises a CPU 23, 24 which is connected via a memory bus 25, 26 to a PCI controller 27, 28. The PCI controllers control communications over respective PCI buses 29,30, to which are connected NICs 31 , 32. The NICs are connected to each other over the network. Other similar data processing units can be connected to the network to allow them to communicate with each other and with the DPUs 20, 21. Local random access memory (RAM) 33, 34 is connected to each memory bus 25, 26.

The data transmission system described herein implements several significant features: (1) dynamic caching of aperture mappings between the NICs 31 , 32; (2) a packet oriented setup and teardown arrangement for communication between the NICs; and (3) the use of certain bits that are herein termed "nonce bits" in the address space of one or both NICs.

Dynamic Caching of Aperture Entries

A small number of aperture mappings can be stored efficiently using a static table. To implement this, a number of bits (the map bits) of an address are caught by the address decode logic of an NIC and are used as an index into an array of memory which contains the bits that are used for reversing the mapping (the remap bits). For example, in a system of the type illustrated in figure 3 an NIC might receive over the PCI bus 29 a request for reading or writing data at a specified local address. The NIC stores a mapping that indicates the remote address that corresponds to that local address, the transformation being performed by substituting one or more of the bits of the local address. For example, the second and third nibbles of the address could be substituted. In that case to access the remote address that corresponds to a local address of 0x821 OBEEC the NIC would access the mapping table, determine the mapping for bits "21" (suppose that is bits "32") and then address the corresponding remote address (in this example 0x8320BEEC). (See figure 4)

This method is scalable up to a few hundred or thousand entries depending on the implementation technology used (typically FPGA or ASIC) but is limited by the space available within the device that is used to hold the mapping table. A superior method of implementation is to store the mappings in a larger store (to which access is consequently slower) and to cache the most recently used mappings in an associative memory that can be accessed quickly. If a match for the bits that are to be substituted is found in the associative memory (by a hardware search operation) then the remap is made very quickly. If no match is found the hardware must perform a secondary lookup in the larger memory (in either a table or tree structure). Typically the associative memory will be implemented on the processing chip of the NIC, and the larger memory will be implemented off-chip, for example in DRAM. This is illustrated in figure 5. This method is somewhat similar to the operation of a TLB on a CPU; however here it is used for an entirely different function: i.e. for the purpose of aperture mapping on a memory mapped network card.

In practice, the mapping information must contain all the address information required to transmit a packet over a network. This is discussed in more detail below.

Packet oriented connection setup and tear down protocol

A protocol will now be described for establishing a connection between two applications' address spaces using apertures, where there are two administration domains (one belonging to each of the communicating hosts). The general arrangement is illustrated in figure 6. In domain A there is a host A having a virtual address space A and an NIC A that can access the virtual address space. In domain B there is a host B having a virtual address space B and an NIC B that can access the virtual address space. The NICs are connected together over a network.

In this example mapping entries for devices in domain A can only be set by the operating system on host A. A further implementation in which an application A

running on host A is allowed to set some (but not all) bits on an aperture mapping within domain A is described below.

The connection protocol to be described uses IP (Internet Protocol) datagrams to transfer packets from one host to another (just as for standard Ethernet networks). The datagrams are addressed as <host:port> where <host> is the network identifier of the destination host and <port> is an identifier for the application (NB each application may have a number of allocated parts corresponding to different network connections) within the host. It will be appreciated that the present protocol could be used over other transport protocols than IP.

In the present protocol the connection setup proceeds as follows, assuming host A wishes to make an active connection to a passive (accepting) host B on which an application B is running.

/.Application B publishes its accepting internet address <host _B:port _B> this can be accessed over the network in the normal way. δ.Application A (which for convenience will be referred to as host A) presents a request to Operating System A for the creation of an incoming aperture onto memory within host A to be used for communication. Once this aperture has been defined its details are programmed on NIC A so that incoming network writes that are directed to addresses in that virtual space will be directed onto the corresponding real addresses in memory A. The aperture will be given a reference address: in-index A.

9.The host A sends an IP datagram to <host _B:port _B> which contains: the connect message:

[CONNECT/in-index _A]

Note that the full IP datagram will also contain source and destination IP addresses (and ports), as normal.

10-The connect message is received by application B. The message may be received either directly to user level or to the operating system (according to the status of the dual event queue) as described later.

11. Host B recognises the message as being a request to connect to B ₁ offering the aperture in-index A. Using rules pre-programmed at B (typically for security reasons) host B will decide whether to reject or accept the connection. If B decides to accept the connection, it creates an (or uses a pre-created) incoming aperture which is mapped onto memory B and is given reference address: in-indexe ■ Host B may choose to create a new port for the connection: port's. Host B sends back to host A an accept message as an IP datagram:

[ACCEPT/: port' _B/in-index _B] to host A. Note that the full IP datagram will also contain source and destination IP addresses (and ports), as normal.

Once this has been received, each host has created an aperture, each NIC is set up to perform the mapping for requests to read or write in that aperture, and each host knows the reference address of the other host's aperture.

12. Following the messaging discussed so far, both hosts create outgoing apertures. A creates an aperture which maps application A's virtual address space onto NIC _A outgoing aperture OUT_index _A. This outgoing aperture maps onto [host _B:in-indexB] which maps onto memory B. Host B creates a similar outgoing aperture out-index _B which maps onto memory A. By this means, bi-directional communication is possible through the memory mapped regions. At any time the applications may send a message to the port, which is associated with the memory mapping. These may be used to guarantee out of band data for example:

13.A CLOSE message to indicate that the connection and hence memory mappings should be closed down

14.An ALIVE message to request a response from an non-responding application [ALIVEACK would be the response]

15.An ERROR message which us generated by any hardware element on the data path which has detected a data transfer error. This message is important because it allows feedback to be provided from the memory mapped interface.

Note that where an application already has a virtual address mapping onto an outgoing aperture, step 6 reduces to a request for the NIC to map the outgoing aperture onto a particular host's incoming aperture. This is described further in terms of user level connection management below.

Dual Event Queues

In the present context a port will be considered to be an operating system specific entity which is bound to an application, has an address code, and can receive messages. This concept is illustrated in figure 7. One or more incoming messages that are addressed to a port form a message queue, which is handled by the operating system. The operating system has previously stored a binding between that port and an application running on the operating system. Messages in the message queue for a port are processed by the operating system and provided by the operating system to the application to which that port is bound. The operating system can store multiple bindings of ports to applications so that incoming messages, by specifying the appropriate port, can be applied to the appropriate application.

The port exists within the operating system so that messages can be received and securely handled no matter what the state of the corresponding application. It is bound (tethered) to a particular application and has a message queue attached. In traditional protocol stacks, e.g. in-kemel TCP/IP all data is normally enqueued on the port message queue before it is read by the application. (This overhead can be avoided by the memory mapped data transfer mechanism described herein).

In the scheme to be described herein, only out of band data is enqueued on the port message queue. Figure 7 illustrates this for a CONNECT message. In figure 7, an incoming packet E, containing a specification of a destination host and port (field 50), a message type (field 51 ) and an index (field 52), is received by NIC 53. Since this data is a CONNECT message it falls into the class of out of band data. However, it is

still applied to the message queue 54 of the appropriate port 55, from where it can be read by the application that has been assigned by the operating system to that port.

A further enhancement is to use a dual queue, associated with a port. This can help to minimise the requirements to make system calls when reading out of band messages. This is particularly useful where there are many messages e.g. high connection rate as for a web server, or a high error rate which may be expected for Ethernet.

At the beginning of its operations, the operating system creates a queue to handle out of band messages. This queue may be written to by the NIC and may have an interrupt associated with it. When an application binds to a port, the operating system creates the port and associates it with the application. It also creates a queue to handle out of band messages for that port only. That out of band message queue for the port is then memory mapped into the application's virtual address space such that it may de-queue events without requiring a kernel context switch.

The event queues are registered with the NIC, and there is a control block on the NIC associated with each queue (and mapped into either or both the OS or application's address space(s)).

A queue with control blocks is illustrated in figure 8. The queue 59 is stored in memory 60, to which the NIC 61 has access. Associated with the queue are a read pointer (RDPTR) 62a and a write pointer (WRPTR) 63a, which indicate the points in the queue at which data is to be read and written next. Pointer 62a is stored in memory 60. Pointer 63a is stored in NIC 61. Mapped copies of the pointers: RDPTR' 62b and WPTR' 63b are stored in the other of the NIC and the memory than the original pointers. In the operation of the system:

16.The NIC can determine the space available for writing by comparing RDPTR' and WRPTR, which it stores locally.

17. NIC generates out of band data when it is received in a datagram and writes it to the queue 59.

18.The NIC updates WRPTR and WRPTR' when the data has been written, so that the next data will be written after the last data.

19.The application determines the space available for reading by comparing RDPTR and WRPTR' as access from memory 60.

2O.The application reads the out of band data from queue 59 and processes the messages.

21. The application updates RDPTR and RDPTR'.

22.If the application requires an interrupt, then it (or the operating system on its behalf) sets the IRQ 65a and IRQ' 65b bits of the control block 64. The control block is stored in memory 60 and is mapped onto corresponding storage in the NIC. If set, then the NIC would also generate an interrupt on step 3.

If an interrupt is generated, then firstly the PCI interrupt line is asserted to ensure the computer's interrupt handler is executed, but also a second message is delivered into the operating system's queue. In general, this queue can handle many interrupt types,

such as hardware failure, but in this case, the OS queue contains the following message [ODBDATA:PORT] indicating that out of band data has been delivered to the application queue belonging to [PORT]. The OS can examine the data in queue 59 and take appropriate action. The usual situation will be that the application is blocked or descheduled and the OS must wake it (mark as runnable to the scheduler).

This dual queue mechanism enables out of band data to be handled by the application without involving the OS - while the application is running. Where the application(s) is blocked, the second queue and interrupt enable the OS to determine which of potentially many application queues have had data delivered. The overall arrangement is illustrated in figure 9.

The out of band (OOB) queue holds out of band data, which are:

23. Error events associated with the port

24. Connection setup messages and other signalling messages from the network and other applications 25. Data delivery events, which may be generated either by the sending application the NIC or the receiving OS.

If the queue is to contain variable sized data then the size of the data part of each message must be included at the start of the message.

When applications are to communicate in the present system over shared memory, a single work queue can be shared between two communicating endpoints using non¬ coherent shared memory. As data is written into the queue, write pointer (WRPTR) updates are also written by the transmitting application into the remote network- mapped memory to indicate the data valid for reading. As data is removed from the queue, read pointer (RDPR) updates are written by the receiving application back over the network to indicate free space in the queue.

These pointer updates are conservative and may lag the reading or writing of data by a short time, but means that a transmitter will not initiate a network transfer of data until buffer is available at the receiver, and the low latency of the pointer updates means that the amount of queue buffer space required to support a pair of communicating endpoints is small. The event mechanism described above can be used to allow applications to block on full/empty queues and to manage large numbers of queues via a multiplexed event stream, which is scalable in terms of CPU usage and response time.

Variable length data destined for an event queue would be delivered to a second queue. This has the advantage of simplifying the event generation mechanism in hardware. Thus the fixed size queue contains simple events and pointers (size) into the variable length queue

26.As shown in figure 13, the difference between RDPTR, and WRPTR, indicates the valid events in the queue, and also the number of events because they are of fixed size. 27.The event Var 10 (for illustration) indicates that a variable sized event of size

10 words has been placed on the variable sized queue.

28.The difference between WRPTR ₂ and RDPTR ₂ indicates only the number of words which are in the variable sized queue, but the application is able to dequeue the first event in its entirety by removing 10 words. 29.The application indicates processing of an event to the NIC by updating the RDPTR on the NICs memory a.for the static queue by the number of events processed multiplied by the size of each event b.for the variable sized queue by the number of words consumed (i.e. the same for both cases)

3O.The data on the variable length queue may also contain the size (e.g. if it is a U DP/I P packet)

Enhanced Aperture Mappings and "Nonce Bits"

In this implementation, additional bits, termed "nonce bits" are provided in order to protect against malfunctioning or malicious hardware or software writing inadvertently to apertures. To illustrate this, the following network mapping will be discussed:

<virtual memory address> → <PCI address> → <host:in-index> → ...

... <network packet> → <PCI address> → <physical memory address> → ...

... <virtual memory address>

When performing the mapping to <host in-index> the NIC is able to create an outgoing packet which is addressed by <host: in-index>. This will be recognized by the NIC that receives the packet as being a packet intended for processing as an aperture packet, rather than as a packet intended to pass via a port to a

corresponding application. Thus the packet is to be presented to the incoming aperture lookup hardware.

It should first be noted that under the scheme described above, the PCI address to which the data is sent encodes both the aperture mapping and an offset within the aperture. This is because the NIC can form the destination address as a function of the address to which the message on the PCI bus was formed. The address received by the NIC over the PCI bus can be considered to be formed of (say) 32 bits which include an aperture definition and a definition of an offset in that aperture. The offset bits are also encoded in the outgoing packet to enable the receiving NIC to write the data relative to the incoming aperture base. In the case of a data write the resulting network packet can be considered to comprise data together with a location definition comprising an offset, an in-index and an indication of the host to which it is addressed. At the receiving NIC at the host this will be considered as instructing writing of the data to the PCI address that corresponds to that aperture, offset by the received offset. In the case of a read request the analogous operation occurs. This feature enables an aperture to be utilized as a circular queue (as described previously) between the applications and avoids the requirement to create a new aperture for each new receive data buffer.

In this implementation the network packet also contains the nonce bits. These are programmed into the aperture mapping during connection setup and are intended to provide additional security, enabling apertures to be reused safely for many connections to different hosts.

The processing of the nonce bits for communications between hosts A and B is as follows:

31.At host A a random number is selected as nonce A. 32. Nonce A is stored in conjunction with an aperture in-index A 33.A connect message is sent to host B to set up communications in the way generally as described above. In this example the message also includes nonce A. Thus the connect message includes port B, in-index A, nonce A.

34.On receiving the connect message host B stores in-index A and nonce A in conjunction with outgoing aperture B. 35. Host B selects a random number as nonce B 36. Nonce B is stored in conjunction with an aperture in-index B 37.An accept message is sent to host B to accept the set up of communications in the way generally as described above. In this example the message also includes nonce B. Thus the accept message includes port B', in-index B, nonce B. 38. Host A stores in-index B and nonce B in conjunction with outgoing aperture A.

Once the connection is set up to include the nonce bits all packets sent from A to B via outgoing aperture A will contain nonce B. When received the NIC _B will look up in- index B and compare the received nonce value with that programmed at B. If they differ, the packet is rejected. This is very useful if a malfunctioning application holds onto a stale connection: it may transmit a packet which has a valid [host:in-index] address, but would have old nonce bits, and so would be rejected.

Remembering that the user level application has a control block for the out of band queue, this control block can also be used to allow control of the apertures associated with the application, in such a way that connection setup and tear down may be performed entirely at user level.

Note that some parts of the aperture control block only are user programmable, others must only be programmed by the operating system.

39.User Programmable bits include: nonce bits, destination host bits

40.O/System Programmable bits include: a) base address of incoming aperture (this prevents an application from corrupting memory buffers by mistake or malintent) b) source host bits (this prevents an application from masquerading as originating from another host).

For an untrusted application, kernel connection management would be performed. This means that out of band data would be processed only in the kernel, and no programmable bits would be made available to the application.

An example of an outgoing aperture table is shown in figure 10. Each row of the table represents an aperture and indicates the attributes of that aperture. It should be noted that:

41.A number of aperture sizes may be supported. These will be grouped such that the base address also gives the size of the aperture. Alternatively, a size field can be included in the aperture table.

42.The type field indicates the Ethernet type to use for the outgoing packet. It also indicates whether the destination address is a 4 byte IPv4 address or a 16 bit cluster address. (IPv6 addresses or other protocol addresses could equally be accommodated) The type field also distinguishes between event and data packets within the cluster. (An event packet will result in a fixed size event message appearing on the destinations event queue).

43.The PCI base address is OS programmable only, other fields may be programmed by the application at user level depending on the system's security policy.

44. Source Ethernet address, Source IP and Cluster address and possibly other information is common to all entries and stored in per NIC memory.

45.In all cases addressing of the outgoing Ethernet packet is either <Ethernet MACxIP host : IP port> (in the case of a TCP/IP packet) or

<Ethernet MAC><CI host : Cl in-index : Cl nonce : Cl aperture offset> (in the case of a Cl (computer interface) packet)

(n.b. the offset is derived from the PCI address issued).

46. Each aperture is allocated an initial sequence number. This is incremented by the hardware as packets are processed and are optionally included in cluster address formats

An example of an incoming aperture table is shown in figure 11. Each row of the table represents an aperture and indicates the attributes of that aperture. The incoming aperture is essentially the reverse of the outgoing aperture. It should be noted that:

47.As well as the size being optionally encoded by having fixed size tables, the

EthType can be optionally encoded by grouping separate aperture tables 48.The sequence number fields are optional and the receiver can set a.whether sequence checking should be done b.the value of the initial sequence number

If done this must also be communicated as part of the connection protocol, which could conveniently be performed in a similar way to the communication of nonce values from one host to another.

49.Similarly to outgoing apertures, some information is Per-NIC e.g. IP address,

Ethernet address. 50. For application level robustness it is possible to "narrow" down an aperture by specifying an address and size which specifies a range which lies within the default range. This might be done when the application level data structure is of a size smaller, or different alignment, than the default aperture size and fine grained memory protection is required. 51 The map address is either the PCI address which the NIC should emit in order to write to memory for the aperture, or else a local (to the NICs SRAM) pointer to the descriptor for the event queue.

A PCI write for an outgoing aperture is processed as shown in figure 12. The steps are as follows. a.A PCI burst is emitted whose address falls within the range allocated to the NIC b.The NICs address decoder captures the burst and determines that the address is within the range of the apertures. (It could otherwise be a local control write).

c.Depending on the aperture size (which is coarsely determined from the address), the address is split into <base:offset>. E.g. for a 1 k aperture, the bottom 10 bits would be the offset. The base is fed into the aperture table cache to match the required packet header information. d. Depending on the Ethernet packet type field either an IP/Ethernet or CI/Ethernet packet header is formed. e.The Cl packet would for instance, include the following fields:

Data (containing the data payload of the PCI burst)

Checksum (calculated by hardware over the contents of the header)

Offset (by the address decoder)

Sequence number

Nonce

Aperture index

Cl Host cluster address

6. If a number of PCI bursts arrive for a particular host, then they may be packed into a single Ethernet frame with compression techniques applied to remove redundant header information

7. In the present system a system-specific CRC or checksum is used to provide end-to-end protection and is appended to the data portion of the packet. Although the Ethernet packet also contains a CRC, it may be removed and recalculated on any hop (e.g. at a switch) and so does not provide protection against internal (e.g. switch-specific) corruptions.

8. If the sequence number is applied, then it is incremented and written back to the aperture table entry

For incoming packets, the reverse operation takes place. The incoming aperture is looked up and checked to be: f. valid;

g. sequence number expected matches that of the packet; h. nonce matches (or port); i. expected Ethernet source address; j. expected IP or Cl source addresses (which may be specified as a netmask to allow a range of source addresses to be matched); Any one or more of these checks may be implemented or omitted, depending on the level of security required.

This lookup returns a field of: (base + extent) for the aperture. The offset is checked against the extent to ensure out of aperture access is not made and a PCI write is formed and emitted on the receiver's PCI bus with the format

.... DATA ₂ DATA ₁ base + offset

If the PCI bus is stalled, (say on DATAN) a new PCI transaction will be emitted.

.... DATA _N+I DATA _N base + offset + N

Similarly if consecutive such data packets arrive they may be coalesced into larger PCI bursts simply by removing the redundant intermediate headers.

Protocol Scheme

One example of a protocol scheme that can be used in the above system will now be described.

In the present system, data is written into an aperture in bursts, each of which consists of an address offset value followed by one or more data words. An Ethernet frame can contain more than one burst. In the protocol described herein all the bursts in a single frame are applied to the same memory aperture.

Each burst contains a start address and then a sequence of 32-bit data words with byte-enables.

Figure 14 shows the structure of an Ethernet frame (which may also be termed a packet). The frame has a 14-byte header 205, comprising the destination MAC address 200, the source MAC address 201 , and 16-bit type code or 'Ethertype' field 202 that defines the way that the frame payload is to be used. At the end of the frame is a checksum 203. The user data 206 carried in the frame is interpreted based on the type code contained in the header. To implement the present protocol for Ethernet packets a type code distinct from those indicative of other protocols would be used. Fields in the header are filled according to network byte order (i.e. big-endian), for consistency with other networking protocols.

Ethernet specifies a minimum packet length of 64 bytes. In the present protocol packets shorter than this are padded to the required length with bytes containing all- zeros. (Typically such padding is automatically added by Ethernet MAC chips.) The present protocol allows all-zero padding at the end of any packet. Bursts within a packet can also be padded with zeros. Other data forms, such as escape words, could alternatively be used as padding.

The user data section 206 of a packet according to the present protocol comprises a 6-byte preamble 207 followed by one or more bursts. The preamble 207 is made up as follows:

52Protocol Version number (208) (4 bits)

53. Source Number (209) (12 bits) - this indicates the identity of the source of the subsequent data

54.Aperture Number (210) (12 bits) - this identifies the aperture in the destination unit to which the subsequent data is addressed.

55.Nonce (211 ) (4 bits)

56.Sequence Number (212) (16 bits) - separate sequence for each aperture. The fields could be changed in size, and this could be indicated by the allocation of a different version number to each defined format of the fields.

Figure 15 illustrates the structure of a burst in frame. The burst is made up as follows:

57.Address word: (220) (32 bits) (including 8 flag bits, one of which is set to indicate the start of the burst (SOB)) 58. Data words (221 ) (can include embedded Escape Words and Checksum

Words if required) - these contain the actual user data to be conveyed 59. Escape word (222) with EOB set (see below) 60. Last data word (223) 61.Checkword (224) - made up of two 16-bit CRCs both calculated over the burst

Bursts are not of fixed length. To allow the receiver to identify the end of a burst, the end of each burst is flagged by the use of an escape word. The escape word is identified by having its bytes 1 to 3 equal to a defined constant value, in this example hex C1 E5CA. Byte 0 of the escape word contains flag bits, which apply to the next 32-bit data word. The flag bits are defined as follows:

62.bit 0 - BVO - byte 0 of the next word is valid

63.bit 1 - BV1 - byte 1 of the next word is valid

64.bit 2 - BV2- byte 2 of the next word is valid

65. bit 3 - BV3 - byte 3 of the next word is valid

66.bit 4 - SOB - the next word is Start-Of-Burst

67. bit 5 - EOB - the next word is End-Of-Burst

68. bit 6 - CKS - the next-but-one word is a checkword

69. bit 7 - reserved, set to zero

It is possible that a word may appear in the user data that has its bytes 1 to 3 equal to the defined constant value. To indicate that such a word is valid, the unit that generates the frame must insert an escape word before such a word. Bits 0 to 3 of that escape word are set to indicate that the subsequent word is valid.

An escape word may also be inserted into a burst to indicate that the following data word contains one or more invalid bytes. To achieve this the appropriate ones of bits 0 to 3 of that escape word are not set, so as to indicate that corresponding bytes of the subsequent word are invalid.

Escape words followed by "checkpoint" checkwords (see below) may be inserted into a burst to reduce the amount of data that has to be buffered at a receiving NIC before it can be safely shipped to memory. This will be described in more detail below.

Bursts according to the present protocol do not contain any explicit length count field. The end of the burst is indicated by an escape word. If EOB is flagged then CKS must also be flagged. The checksum word at the end of each burst is mandatory. Thus the shortest possible burst is as illustrated in figure 16. This comprises three words: an escape word 230 with EOB and CKS set, a single data word 231 and a checksum word 232. In this example, the escape word takes the place of the address word.

Each burst begins with an address word which in normal usage indicates the offset into the memory aperture of the receiver at which the data in the burst is to be written. The address value field occupies bytes 1 to 3 of the address word (24 bits). Byte 0 of the address word contains flag bits having the same format and meaning as those of the escape word. These flag bits apply to the first data word of the burst. The SOB flag bit is set in the first word of a burst, guaranteeing that the beginning of a burst can be distinguished from padding words, which have all 32 bits set to zero.

Each burst ends with a checkword. Checkwords may also be added at intervals within a burst. In the present protocol the checkword comprises two 16-bit CRC

fields, together forming 32 bits of check data. The methods by which the two CRCs are calculated are selected so that the use of two blocks of check data provides additional error detection capability over either of the 16-bit blocks of check data individually, but without requiring such intensive processing as would be needed to calculate a single 32-bit block of check data by similar algorithms. Other schemes such as a 32-bit CRC could also be used (with a different version of the protocol).

Both of the 16-bit CRCs are formed by cyclic redundancy check (CRC) algorithms. Both of the fields are computed over the same data, beginning with the ethertype field of the Ethernet frame header and working progressively through the packet. For the purposes of computing the CRC fields, the checkwords themselves are assumed to contain the value all-zero.

The methods for forming the CRCs are as follows:

1. The first CRC field uses the coefficients (the generator polynomial) which are the standard set known as 'X25 ¹. The CRC value is seeded with the 16-bit value 'all-one' at the beginning of each packet. This CRC occupies bytes 0 and 1 of the checkword.

2. The second CRC field uses the coefficients which are the standard set known as 'USB CRC-16'. As with the other CRC field, the CRC value is seeded with the 16-bit value 'all-one' at the beginning of each packet. This CRC occupies bytes 2 and 3 of the checkword.

Other methods could be used to generate one or both of the CRCs, and either or both of the CRCs could be replaced by check data of a form other than a CRC.

This method of forming the checkwords has a number of advantages. First, Ethernet frames are protected in transit by a 32-bit CRC (the Ethernet frame checksum or FCS), which is typically generated and checked by the MAC chips that drive each link. However, there are forms of data corruption that the FCS cannot protect against. Switches can strip and recalculate the FCS; if this happens then the packet payload is not protected inside the switch itself. Switches (and routers) can mangle packets in ways which (often caused by program failures) are quite different to the errors (of a statistical nature) that would be introduced by electrical interference on a

link. Also, routers are bound to recalculate the FCS if they change a packet's IP header, for example by reducing the hop count. Second, by not relying on the Ethernet FCS the present protocol opens up the possibility of cutting latency by using a MAC device which does not buffer a complete Ethernet packet on receive: for example by using cut-through forwarding techniques as described in our co-pending patent application entitled "Managing Data Transmission". Third, it adopts a valuable compromise between the relatively intensive processing that would be needed to generate a 32-bit checksum, and the lower guarantee of data integrity that would be given by a 16-bit checksum.

It is possible that an escape word could be corrupted during transmission, causing it to be treated as a data word at the receiver. This could create result in a 'runaway packet', which could possibly have the effect of the destination memory being over¬ written with junk data. To prevent this, the data from a received burst is not written to memory until a valid checksum word covering that data has been successfully received. In longer bursts, the latency and amount of buffering that is needed can be kept in check by including "checkpoint" checkwords at pre-set intervals. Checkpoint checkwords are formed in the same way as final checkwords, computing the CRCs for the checkpoint checkwords over the all the data in the packet beginning with the ethertype field of the Ethernet frame header and working progressively through the packet up to the word of the checkpoint checkword itself. For the purposes of computing the CRC fields, the checkpoint checkword that is being computed is assumed to contain the value all-zero.

At the receiver the checkwords are verified by using the same algorithms as at the transmitter on the received data. If the verification is successful (i.e. if the CRCs calculated at the receiver match those received in the checkwords) then the data is processed appropriately at the receiver. If the verification is unsuccessful then steps may be taken to have the data retransmitted.

Where packets contain more than one checkword, it is possible that a single packet may include both good data (i.e. data for which the CRCs agree at the receiver) and

bad data (i.e. data for which the CRCs do not agree at the receiver). Data may also be determined to be bad at the receiver if the information in the packet header is not internally consistent, or does not agree with the current state of the receiver, for instance if:

- The ethertype of the packet is not that which is expected by the receiver

- The 4-bit version number of the packet is invalid

- The aperture number specified in the packet is undefined at the receiver

- The source number does not match the one that is recorded at the receiver as being valid for the specified aperture

- The sequence number is not credible according to a checking algorithm implemented by the receiver. For instance the algorithm may treat packets whose sequence number precedes a previously received packed as being invalid, and/or that are received out of sequence (including the case of repeated packets) as being invalid.

- The Ethernet source address and/or the destination MAC address are not as expected by the receiver.

For additional protection, the' sequence number could be incremented by a non- obvious algorithm, or encrypted. This would make it very difficult to perform "man in the middle" attacks.

Some classes of error are dealt with by passing the packet to a kernel software stack. Others cause the packet to be discarded and an event token issued from the receiver of the packet to the transmitter to signal that the error has occurred. In response to the error token the transmitter can take action to rectify the error, for example by re-sending the erroneous packet to the receiver.

Errors that indicate that the traffic on an aperture is damaged - for instance in the case of a dropped or repeated sequence number - cause reception on the relevant aperture to be stopped and an event token to be issued to the transmitter.

Event tokens can be generated by a transmitting NIC and sent to the receiver to indicate an event. At the receiver the event token is enqueued for the attention of the process that 'owns' the aperture to which the event token applies. Queues of event tokens are referred to as "event queues". Each event token consists of one 32-bit word made up as follows:

- bits 31-16 - bits 15-0 of the aperture number to which the event token applies

- bits 15-8 - reserved

- bits 7-4 - bits 3-0 of a pointer index in the specified aperture number

- bits 3-0 - bits 3-0 of an indicator of the type of the event

The following types of event can be defined:

The pointer index field of the event token is only valid if the event token is of type pointer update. In this case it identifies which of a pre-defined set of pointer locations was written to. A typical implementation might be to define four pointer locations at byte offsets 0, 64, 128 and 192 from the base of each aperture, representing them with pointer index values of 0, 1 , 2 and 3.

Where an event token reports an error that cannot be resolved to a valid aperture, the aperture number field is not used and the token is sent to a central logging queue at the receiver.

As explained above, at the beginning of a burst is an indication of the memory address at the receiver at which the data in a burst is to be written. The data is intended to be written to that and subsequent addresses. There will be a checksum at the end of the burst, and once that checksum has been verified the data may safely be written. If that were the only checksum in the burst then in order to ensure safe operation the whole burst would have to be buffered until that checksum had been verified, otherwise the address might have been received incorrectly and if the data were to have been written at the incorrect address it would have overwritten the data already there. However, if there is an intermediate checksum in the burst that can reduce the amount of buffering that is needed. Once a checksum covering the address has been verified it is known to an acceptable level of confidence that the address has been received correctly none of the data in the burst needs to be buffered: it can be written straight to the appropriate place in the memory. If a subsequent checksum indicates that the data has been received incorrectly then the data already stored to memory can be marked as invalid, and the data can be re¬ sent.

One method for performing this will now be described in more detail with reference to figure 17.

Figure 17 illustrates the flow of packets 255 from a transmitter 250 over a data link 251 to a receiver 252. At the receiver the packets are interpreted by an interface device 253 and received data can be written to a memory 254. The memory 254 may in practice be an aperture. Each packet is formed as described above and includes one or more bursts each including a field specifying the address in memory 254 at which the data of the burst is to be written, the data of the burst itself, and a terminating checksum. A burst may also include one or more intermediate checksums between the address specifier and the terminating checksum.

When a burst is received the specified address (A) is determined. The received data to be written at that address is then buffered in a local buffer 256 in the interface device 253 until a checksum in the packet is reached. If the checksum is verified by the interface device the address is assumed to have been correctly received, and so the network device sets a write pointer W operating on memory 254 to the specified address A. The data is written to the write pointer, and the write pointer is incremented as the data is written so that it points to the location in the memory at which the next received data is to be written. The interface device also maintains a checked pointer C operating on memory 254. The checked pointer is initially set to address A. When a checksum in the packet is reached and verified the checked pointer C is updated to the current position of the write pointer W. If the checksum is not verified the checked pointer C is not altered.

As described above, an application running at the receiver is associated with memory 254. When the interface device verifies a checksum it transmits a "P" message to the application associated with the memory to which the data covered by the checksum was written. The P message indicates that data has been successfully written and specifies the addresses between which the successfully written data lies (i.e. the value of the C pointer before and after verification). The P message indicates to the application that that data is now ready for use. If a checksum is not verified then the interface device transmits a "B" message to the application. The B message indicates that data has not been successfully written and specifies the addresses between which the incorrectly written data lies (i.e. the value of the C pointer and the value of the W pointer). The application can then cause the interface device 253 to request the transmitter 250 to retransmit the data intended to be written between those pointer values.

When bursts contain intermediate checksums this method allows the amount of data that has to be buffered before writing to be reduced. It also allows cut-through forwarding to be used on the final hop of data link 251 to receiver 252 without the need to buffer the whole packet in order to perform error correction.

Some applications do not require this level of error recovery and operate correctly so long as the NIC does not deliver any corrupt data, and informs the application of either data corruptions or lost data. In the absence of other information, the application must perform retransmission through negotiation with its communicating peer application.

Also, for other applications, the pointer updates are transmitted over the network as part of the data stream. The error recovery described above can take place so long as the pointer updates are all logged via the event queue.

Figure 18 shows buffers in a communication interface according to an embodiment of the present invention. In this example, the buffers are FIFOs. Three priority levels are shown: P ₀, Pi, and P ₂, with P ₀ representing the highest priority. For each level of priority, there is a pair of FIFOs, Fi and F ₂. Bursts of data received at the input port of the communication interface are sorted in dependence on their priority and their destination address, and more specifically in dependence on their destination aperture. Thus, a burst having a high priority would be assigned to one of the buffers F ₁ or F ₂ for storing Po data. Fi might have a series of bursts all intended for one particular aperture in the destination memory, and F ₂ might have bursts intended for a different aperture, and according to the destination aperture of a received burst, it would be assigned to one of those two buffers. Blocks 180 represent bursts stored in the buffers.

In a particularly preferred embodiment, the number of buffers available in a communication interface for storing data of a given priority will be equal to the maximum of number of CPUs within the host which are expected to be connected to the communication interface, and the number of independent DMA units on the communication interface. For example, the buffers shown in Figure 18 could be used in a network in which two data processors are connected to one another. This facilitates packet formation, because over a short timescale, most data bursts from a given CPU or a given DMA unit will contain data intended for a single destination

aperture. When creating a packet, bursts of data can be read sequentially from one FIFO and formed into a packet, and it will not be necessary to switch between different FIFOs for filling a packet with data intended for one destination aperture. This can allow improved efficiency.

In one embodiment, a list or table or other data store is provided in the communication interface for recording the contents of each buffer. This store can then be checked each time a burst is received at an input port of the interface and compared with the destination aperture and/or priority of the incoming burst so that an appropriate buffer can be selected. If a burst is received and there is no data within the data store indicating a matched destination aperture of a burst already stored in one of the buffers, a buffer could be selected on other criteria, so that the burst could be assigned to a buffer on the basis that the buffer has the most space remaining, or the highest priority for example. By this means bursts destined for the same apertures will tend to be stored in the same buffers.

When reading data out of the buffers in order to form packets for transmitting over the network, the buffers storing the highest priority data are preferably accessed more often than those of lower priority. This could be achieved in a number of ways. For example, when forming a packet for a given aperture, the data for which is stored in the F ₁ buffers of Figure 18, a packet creation unit could first access the Po F ₁ buffer and retrieve a number of bursts intended for that aperture. It could then access the P ₁ F ₁ buffer and retrieve a smaller number of bursts for that aperture. It could then access the P ₂ F ₁ buffer and retrieve a smaller number still of bursts for that aperture. This cycle could be repeated, such that more data is read from the P ₀ buffer than from the other two to create a burst. Alternatively, the packet creation unit could retrieve a fixed number of bursts from a buffer each time it accesses a buffer, but could access the Po buffer more frequently than it accesses the P ₁ buffer and so on. This would also have the effect of prioritising the higher priority bursts.

Figure 19 shows a communication system suitable for implementing embodiments of the present invention. The system comprises a first data processing apparatus 201

and a second data processing apparatus 202. A communication interface 200 connects the first data processing apparatus to the second data processing apparatus via a data link 203. The interface 200 comprises an input port 204 for receiving data from the first data processing apparatus. A control unit 205 is arranged to allocate one of a plurality of buffers 206 to each incoming group of data. A data store 207, such as a memory chip, holds data indicating the content of each of the buffers, and is controlled by means of a data manager 208. A packet creation unit 209 is provided for forming packets from the data in the buffers 206 for transmission over the data link 203.

Figure 20 illustrates the efficiency advantages that can be obtained by use of certain embodiments. Figure 2OA illustrates a typical prior art Ethernet data packet. The Ethernet address is included in the header, together with the destination address of the packet and a sequence number. Additional data may also be inserted in the packet header.

Following the header, an offset of a burst of data is inserted to indicate the location in the destination aperture in which the subsequent content data is to be stored. The content data itself is then inserted, followed by a checksum, CRC. The sequence of offset-content data-CRC continues for further bursts.

In contrast, figure 2OB shows a packet that can be achieved by using improved methods described below by means of an example. The packet header in figure 2OB is the same as that in figure 2OA. Following the header, an offset indicating a location in the destination aperture at which content data is to be stored is inserted into the packet. This offset is then followed by a series of bursts of data with no further offsets. The content data can then be followed by a checksum, CRC.

The data arriving at an input port of the communication interface will have been formed into bursts by an IO bus operating in accordance with a protocol such as PCI or PCI-Express. This protocol will transfer data using separate address and data phases, such that address bits (including an aperture and an offset) will be

transferred to the input port of the communication interface in one phase, and data bits will be transferred in a different phase. The communication interface can thus distinguish between address bits and data bits.

To formulate packets in a communication interface, the packet creation unit receives a first group of data, for example burst _A from an the input port of the communication interface. The group of data is intended for a particular destination aperture. The packet creation unit forms a packet header including the destination address. In this example, the destination aperture is included in the header. More specifically, burst _A is intended for an offset, offset _A, within that aperture. Suppose that the first group of data, burst _A, includes 64 bytes of content data. The offset, offset _A, is inserted into the packet after the packet header (as shown in figure 20B) ₁ followed by the 64 bytes of content data.

The packet creation unit then receives a second group of data, burst _B, which is intended for the same destination aperture. The packet creation unit then determines from bursts the offset for which the content data in that burst is intended. If that offset, offsets indicates a location corresponding to [offset _A plus 64 bytes], that is, following sequentially on from the location in which the last byte of content data of burst _A is to be stored, then the packet creation unit inserts the content data of burst _B directly into the packet without inserting offsets. It can be seen from figure 2OB that omitting superfluous offsets from a packet can allow a considerably higher density of content data per packet.

On the other hand, if the offset of the second group of data, offsets, does not identify a location which follows sequentially from the location in which the byte of data previously inserted into the packet is to be stored, then offsets can be inserted into the packet as in previously known implementations, as shown in figure 2OA.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as

a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Previous Patent: ROTARY VANE ENGINE

Next Patent: IMPROVED NEEDLE