Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
TCP/IP OFFLOAD SYSTEM
Document Type and Number:
WIPO Patent Application WO/2017/046582
Kind Code:
A1
Abstract:
A TCP/IP offload system is disclosed. The system comprises a software TCP/IP module (12) arranged to be executed on a host processor (10), and a hardware TCP/IP device (20). The hardware TCP/IP device (20) is arranged to pass responsibility for a TCP connection to the software TCP/IP module (12) when an exception event occurs. This can allow the amount of hardware required to implement the TCP/IP device to be minimized, and/or allow the TCP/IP device to be optimized for low latency. A hardware parser for parsing documents encoded in a markup language is also disclosed.

Inventors:
SHAH SANJAY (GB)
HIGGS CHRISTOPHER (GB)
PATEL MUKESH (GB)
Application Number:
PCT/GB2016/052835
Publication Date:
March 23, 2017
Filing Date:
September 14, 2016
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NANOSPEED TECH LTD (GB)
International Classes:
H04L29/06; H04L29/08
Foreign References:
US20030158906A12003-08-21
US8549345B12013-10-01
US7389462B12008-06-17
US20040111523A12004-06-10
US20070233886A12007-10-04
Other References:
HYONG-YOUB KIM ET AL: "TCP offload through connection handoff", OPERATING SYSTEMS REVIEW, ACM, NEW YORK, NY, US, vol. 40, no. 4, 18 April 2006 (2006-04-18), pages 279 - 290, XP058201506, ISSN: 0163-5980, DOI: 10.1145/1218063.1217962
Attorney, Agent or Firm:
CLEVELAND (GB)
Download PDF:
Claims:
CLAIMS

1 . A TCP/IP offload system comprising:

a software TCP/IP module arranged to be executed on a host processor; and

a hardware TCP/IP device,

wherein the hardware TCP/IP device is arranged to pass responsibility for a TCP connection to the software TCP/IP module when an exception event occurs.

2. A system according to claim 1 , wherein the hardware TCP/IP device is arranged to check the validity of data packets transferred over the TCP connection and to indicate an exception event if it is determined that a data packet is not valid.

3. A system according to claim 2, wherein the hardware TCP/IP module is arranged to check the validity of a data packet by checking at least one of a checksum and a sequence number. 4. A system according to claim 3, wherein the checksum and/or sequence number are contained in a packet header.

5. A system according to any of the preceding claims, wherein the software TCP/IP module is arranged to establish the TCP connection.

6. A system according to claim 5 or 6, wherein the hardware TCP/IP device is arranged to transfer data over the TCP connection once the connection has been established. 7. A system according to claim 5 or 6, wherein the hardware TCP/IP device is arranged to detect that the TCP connection is established, and to assume responsibility for the TCP connection when it is detected that the connection is established.

8. A system according to any of the preceding claims, wherein the hardware TCP/IP device is arranged to pass responsibility for the connection to the software TCP/IP module when the connection is to be terminated. 9. A system according to any of the preceding claims, wherein the hardware TCP/IP device is arranged to pass data packets to and/or from an application program executing on the host processor when the hardware TCP/IP device has responsibility for the connection. 10. A system according any of the preceding claims, wherein the software TCP/IP module is arranged to process data packets transferred over the TCP connection in parallel with the hardware TCP/IP device.

1 1 . A system according to claim 10, wherein data packets processed by the software TCP/IP module are passed to and/or from an application program when responsibility for the TCP connection is passed to the software TCP/IP module.

12. A system according to claim 10 or 1 1 , wherein packets processed by the software TCP/IP module are discarded when the hardware TCP/IP device has responsibility for the TCP connection.

13. A system according to any of claims 10 to 12, wherein the software TCP/IP module is arranged to check the validity of the data packets. 14. A system according to any of the preceding claims, wherein TCP sequence numbers are used to define a transition point to transfer responsibility for the TCP connection between the software TCP/IP module and the hardware TCP/IP device. 15. A system according to any of the preceding claims, wherein the system is arranged to support multiple simultaneous TCP connections.

16. A system according to claim 15, wherein the hardware TCP/IP device further comprises means for determining which TCP connection a data packet relates to.

17. A system according to any of the preceding claims, wherein the hardware TCP/IP device is implemented on a field programmable gate array. 18. A system according to any of the preceding claims, wherein the system comprises an operating system executing on the host processor, and the software TCP/IP module is part of the operating system.

19. A computer system comprising a host processor and a TCP/IP offload system according to any of the preceding claims.

20. A system according to any of the preceding claims, further comprising a hardware parser for parsing documents encoded in a markup language. 21 . A system according to claim 20, wherein the hardware parser is operable with a plurality of different markup languages.

22. A system for parsing documents encoded in a markup language, the system comprising a hardware parser, wherein the hardware parser is operable with a plurality of different markup languages.

23. A system according to claim 21 or 22, the hardware parser comprising a compiler arranged to compile incoming configuration files to microcode. 24. A system according to claim 23, wherein the configuration files comprise at least one of FIX/FAST templates or XML schema.

25. A system according to claim 23 or 24, wherein the microcode is stored in an instruction memory.

26. A system according to any of claims 23 to 25, wherein the hardware parser is arranged to execute the microcode to extract desired fields from incoming documents.

27. A system according to any of claims 21 to 26, wherein the hardware parser comprises a plurality of hardware decoders connected in parallel.

28. A system according to claim 27, wherein the hardware parser comprises a selector for selecting the output of a decoder.

29. A system according to claim 28 when dependent on claim 23, wherein the selector is controllable by the microcode. 30. A method of offloading TCP/IP processing from a host processor, the method comprising managing a TCP connection on a hardware TCP/IP device, and passing responsibility for the TCP connection to a software TCP/IP module executing on the host processor when an exception event occurs. 31 . A method according to claim 30, further comprising using the software TCP/IP module to establish the TCP connection, and passing responsibility for the TCP connection to the hardware TCP/IP device once the connection has been established. 32. A method of using hardware to parse documents in a plurality of different markup languages, the method comprising compiling incoming configuration files to microcode, and executing the microcode to extract desired fields from incoming documents.

Description:
TCP/IP OFFLOAD SYSTEM

The present invention relates to a TCP/IP offload system, and in particular a TCP/IP offload system for offloading TCP/IP processing from a host computer system to dedicated hardware.

TCP (Transmission Control Protocol) and IP (Internet Protocol) are part of the Internet protocol suite which is used to deliver data over a network. TCP is a connection orientated protocol which provides a guaranteed delivery mechanism, while IP is used to deliver data packets from the source to the destination based on the IP addresses in the packet headers. The TCP/IP protocols are typically implemented as software executing on a host computer system. The software used to implement the TCP and IP protocols is sometimes referred to as a TCP/IP stack due to the layered nature of the Internet protocol suite.

TCP/IP was originally designed for relatively low speed connections. However with the development of modern high bandwidth networks TCP/IP is now increasingly being used at high speeds. As the speed of the connection increases, the amount of processing required to implement the TCP/IP protocols increases. Software implementations of a high speed TCP/IP stack on a host computer system typically consume a large amount of the computer's processing power. TCP/IP offload engines have therefore been developed which move the processing from the host computer system to dedicated hardware in a network interface card (NIC).

Current hardware implementations of TCP/IP offload engines require a significant amount of hardware in order to implement the TCP/IP protocols. This makes the hardware complex and more prone to error. Furthermore, current hardware implementations are not optimized for low latency (i.e. the time taken to process a data packet).

It would therefore be desirable to provide a TCP/IP offload system with less complex hardware and/or which is optimized for low latency. According to a first aspect of the present invention there is provided a TCP/IP offload system comprising:

a software TCP/IP module arranged to be executed on a host processor; and

a hardware TCP/IP device,

wherein the hardware TCP/IP device is arranged to pass responsibility for a TCP connection to the software TCP/IP module when an exception event occurs. The present invention may provide the advantage that data transfer over a TCP connection can be handled by the hardware TCP/IP device, while exceptions can be handled by the software TCP/IP module. This is turn can allow the amount of hardware required to implement the TCP/IP device to be minimized, and/or allow the TCP/IP device to be optimized for low latency.

By an exception event it is preferably meant any event where a received datagram is not as expected, such as packets out of sequence, packets with bad checksum or expected but missing packets. Exceptions such as these are relatively rare but nevertheless must be handled effectively. By using the software TCP/IP stack to handle at least some of these exceptions, the hardware TCP/IP device can be optimized for low complexity and low latency.

Preferably the hardware TCP/IP device is arranged to check the validity of data packets transferred over the TCP connection and to indicate an exception event if it is determined that a data packet is not valid. This can help to ensure that data forward by the hardware TCP/IP device is valid and/or provide a mechanism for identifying exception events. The validity of a data packet may be checked, for example, by checking at least one of a checksum and a sequence number.

Preferably the checksum and/or sequence number are contained in a packet header.

In a preferred embodiment of the present invention, the software TCP/IP module is arranged to establish the TCP connection. By allowing the software TCP/IP module to handle connection set up, the hardware TCP/IP device may be further simplified, which may further improve the reliability of the device. Preferably the hardware TCP/IP device is arranged to transfer data over the TCP connection once the connection has been established. For example, the hardware TCP/IP device may be arranged to detect that the TCP connection is established, and to assume responsibility for the TCP connection when it is detected that the connection is established.

Alternatively or additionally, the software TCP/IP module may be arranged to handle connection closure. Thus the hardware TCP/IP device may be arranged to pass responsibility for the connection to the software TCP/IP module when the connection is to be terminated.

Preferably the hardware TCP/IP device is arranged to pass data packets to and/or from an application program executing on the host processor when the hardware TCP/IP device has responsibility for the connection.

In order to facilitate the handling of exceptions, the software TCP/IP module may be arranged to process data packets transferred over the TCP connection in parallel with the hardware TCP/IP device. This can allow the software TCP/IP module quickly to assume responsibility for the TCP connection when an exception event occurs, since the software TCP/IP module will have been processing the same data packets (possibly at a higher latency). Data packets processed by the software TCP/IP module may then be passed to and/or from an application program when responsibility for the TCP connection is passed to the software TCP/IP module.

However, packets processed by the software TCP/IP module are preferably discarded when the hardware TCP/IP device has responsibility for the TCP connection.

Preferably the software TCP/IP module is arranged to check the validity of the data packets, to help ensure the validity of its own data and/or to identify an exception event. In a preferred embodiment connection set up and closure, and all exception events are handled by the software TCP/IP module. However in other

embodiments some of these functions may be handled by the hardware TCP/IP device. For example, the hardware TCP/IP device may be able to handle some exception events without the need to pass responsibility for the TCP connection to the software TCP/IP module.

Preferably TCP sequence numbers are used to define a transition point to transfer responsibility for the TCP connection between the software TCP/IP module and the hardware TCP/IP device. This may help to ensure that the packet remain in sequence.

The system may be arranged to support multiple simultaneous TCP connections. In order to help achieve this, the hardware TCP/IP device may further comprise means for determining which TCP connection a data packet relates to. This may be done, for example, by reading the IP address from the packet header, and comparing IP address to a list of IP addresses in established TCP connections.

The hardware TCP/IP device may be implemented on a field programmable gate array (FPGA). This may provide a convenient and cost effective way of producing the required functionality. However it will be appreciated that any other type of hardware, such as an application specific integrated circuit (ASIC) or any other type of circuit, could be used instead. Preferably the system comprises an operating system executing on the host processor, and the software TCP/IP module is part of the operating system. This can allow at least part of the software TCP/IP module to be a standard

component, which may reduce cost and/or help to ensure reliability. According to another embodiment of the present invention there is provided a computer system comprising a host processor and a TCP/IP offload system in any of the forms described above.

The TCP/IP offload system described above may be arranged to pass data packets to and/or from an application program executing on the host processor. This can allow documents to be exchanged between application programs over the TCP/IP connection.

Documents which are transferred between application programs may be structured in various different types of format. For example, document markup languages exist for encoding documents in a format which is both human- readable and machine-readable. Typically, software parsers are used for analysing such documents. However software parsers may not offer the lowest latency.

In accordance with an embodiment of the invention, the system further comprises a hardware parser for parsing documents encoded in a markup language. By implementing the parser in hardware, the speed with which an document can be uploaded and/or analysed may be increased.

Preferably the hardware parser is operable with a plurality of different markup languages. This can allow different document language implementations to be analysed in hardware. This aspect of the invention may also be provided independently. Thus, according to another aspect of the invention, there is provided a system for parsing documents encoded in a markup language, the system comprising a hardware parser, wherein the hardware parser is operable with a plurality of different markup languages.

This aspect of the invention may provide the advantage that different document language implementations can be analysed in hardware. This can provide a low latency path for the documents. The different document languages may be, for example, Extensible Markup Language (XML), Financial Information exchange (FIX), FAST (FIX Adapted for Streaming), or any other markup language.

Preferably the hardware parser comprises a compiler arranged to compile incoming configuration files to microcode. The configuration files may be a description of a type of the document, typically expressed in terms of constraints on the structure and content of documents of that type. For example, the configuration files may comprise at least one of FIX/FAST templates or XML schema. The microcode may then be stored in an instruction memory.

Preferably the hardware parser is arranged to execute the microcode to extract desired fields from incoming documents.

Preferably the hardware parser comprises a plurality of hardware decoders connected in parallel. The hardware parser may then further comprise a selector for selecting the output of a decoder. Preferably the selector is controllable by the microcode.

With some or all of the solutions described above, it becomes possible to make sense of the different language implementations on the hardware. This means it is possible to have a low latency path for the documents. This may be

particularly advantageous where high speed access to data is required, such as in the high frequency trading/low latency trading communities.

Furthermore, the ability to update the output format without having to change the hardware design or reload the hardware image permits the decoding path to be changed dynamically.

Methods corresponding to any of the above aspects may also be provided. Thus, according to another aspect of the present invention there is provided a method of offloading TCP/IP processing from a host processor, the method comprising managing a TCP connection on a hardware TCP/IP device, and passing responsibility for the TCP connection to a software TCP/IP module executing on the host processor when an exception event occurs.

The method may further comprise using the software TCP/IP module to establish the TCP connection, and passing responsibility for the TCP connection to the hardware TCP/IP device once the connection has been established.

According to another aspect of the present invention there is provided a method of using hardware to parse documents in a plurality of different markup

languages, the method comprising compiling incoming configuration files to microcode, and executing the microcode to extract desired fields from incoming documents.

Features of one aspect of the invention may be provided with any other aspect. Apparatus features may be provided with method aspects and vice versa.

Preferred features of the present invention will now be described, purely by way of example, with reference to the accompanying drawings, in which: Figure 1 illustrates the process of establishing a TCP connection;

Figure 2 illustrates the process of data transmission on an open TCP connection;

Figure 3 illustrates the process of closing a TCP connection;

Figure 4 shows an overview of a TCP/IP offload system in an embodiment of the invention;

Figure 5 shows parts of a split stack TCP/IP offload system in more detail;

Figure 6 shows a block diagram of a socket manager;

Figure 7 is a state diagram of a TCP connection state block;

Figure 8 is a block diagram of a TCP transmission block;

Figure 9 is a block diagram of a TCP reception block; and

Figure 10 shows parts of a generic parser.

Before giving a detailed description of embodiments of the invention, some of the concepts behind embodiments of the invention will first be discussed.

Internet protocol suite

Transmission Control Protocol (TCP) and Internet Protocol (IP) are part of the Internet protocol suite, commonly referred to as TCP/IP. The suite uses a layered model, in which the application is layer 5, TCP is layer 4, IP is layer 3, MAC (or datalink) is layer 2 and the physical layer is layer 1 . The software and/or hardware functionality used to implement the TCP and IP protocols is sometimes referred to as a TCP/IP stack. TCP/IP data is transmitted in data packets. TCP is a connection orientated protocol which provides a guaranteed delivery mechanism, while IP has the task of delivering packets from the source to the destination based on the IP addresses in the packet headers. The maximum packet size is configurable, typically 1500 bytes. Each packet is comprised of header followed by the payload i.e. the underlying data transmitted. There is also a sequence number in TCP headers and a checksum in both TCP and IP headers to help determine whether the packet data is integral.

A TCP packet from Nl (Network Interface) A to Nl B requires a connection to be established. The steps involved are as follows:

• Nl A sends a SYN (synchronize) packet to Nl B

· Nl B receives N I A's SYN

• NI B sends a SYN-ACK (synchronize-acknowledge) to Nl A

• Nl A receives Nl B's SYN-ACK

• Nl A sends a ACK (acknowledge) to Nl B

• NI B receives Nl A's ACK. The bi-directional TCP connection is now

established.

Figure 1 illustrates the process of establishing a TCP connection.

Each packet sent contains a header detail. Within these headers, there are some essential data that determine the integrity and reliability of the packet. This data includes:

• The sequence number for this source to destination direction of data.

Each packet will have a 32-bit sequence number that will be the accumulation of payload bytes. The source, destination detail of the packet is also included in the header.

• A 16-bit checksum that is calculated over most of the entire packet.

After a connection is established, data is sent from Nl A to Nl B and from Nl B to Nl A. The sequence for each data packet where Nl A sends Nl B a data packet is:

• Nl A sends a data packet to Nl B

• NI B receives the data packet from Nl A and confirms good checksum and sequence number contained in the header supplied in the packet. • NI B sends an ACK packet back to Nl A.

• Nl A receives ACK from Nl B. At this point Nl A knows that the data

transmission for the packet is complete. Figure 2 illustrates the process of data transmission on an open TCP connection.

If in the above, the sequence number received by Nl B is out of sequence, a resend request for the expected sequence number is made by Nl B. It is also possible for a stream, a fixed maximum number, of packets to be sent in a pipeline without waiting for an ACK back to the sender. The sender needs to wait for at least one ACK before it is free to send the very next packet. If Nl A does not receive an ACK back from Nl B in a predefined time, Nl A will resend the packet. A TCP connection may be closed by either party. The steps involved are as follows:

• Nl A sends a FIN packet to Nl B

• NI B receives Nl A's FIN

· Nl B sends a ACK to Nl A

• Nl B sends a FIN to Nl A

• Nl A receives Nl B's ACK

• Nl A receives an FIN from Nl B

• Nl A receives Nl B's ACK. The TCP connection is now closed.

Figure 3 illustrates the process of closing a TCP connection.

Other cases are also possible, such as if one side disappears without the due handshake. If a host closes a connection but still has not read all the incoming data, the host may send a reset flag RST instead of a FIN.

PCI Express

PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe, is a high-speed serial computer expansion bus standard designed to replace the older PCI, PCI-X, and AGP bus standards. The PCIe connection connects the CPUs on the motherboard (the host) to hardware cards plugged into the motherboard PCIe slots. The PCIe connection to the host is via a host driver that allows data to be copied to a series of registers on the hardware card and for the hardware card to update buffers in host memory.

TUN/TAP

In computer networking, TUN and TAP are virtual-network kernel devices. Being network devices supported entirely in software, they differ from ordinary network devices which are backed up by hardware network adapters.

TUN (network TUNnel) simulates a network layer device and it operates with layer 3 packets such as IP packets. TAP (network tap) simulates a link layer device and it operates with layer 2 packets such as Ethernet MAC frames. TUN is used with routing, while TAP is used for creating a network bridge.

Packets sent by an operating system via a TUN/TAP device are delivered to a host user-space program which attaches itself to the device. A user-space program may also pass packets into a TUN/TAP device. In this case, the TUN/TAP device delivers (or "injects") these packets to the operating-system network stack, thus emulating their reception from an external source.

TCP/IP offload engine

Software implementations of TCP/IP on host systems typically require significant computing power due to the need to handle the various aspects of the TCP protocol. These aspects include the following:

• Connection establishment

· Acknowledgment of packets

• Checksum and sequence number calculations

• Sliding window calculations for packet acknowledgement and congestion control.

• Connection termination. TCP/IP offload engines (TOEs) have been developed which move the processing to dedicated hardware. These solutions are complex due to the need to handle connection setup and closure and all the exceptions such as retransmission requests and packet re-ordering. Even then, reliability of the TOEs offered is debatable, and they are not used nearly as extensively as software based TCP stacks. The reliability of software based TCP stacks is taken for granted and built into modern operating systems. However, software based TCP stacks typically suffer from the following problems: 1 . High latency: typically, one-way latency through the stack is 20-40

microseconds.

2. Jitter: There is a spread of latencies. Sometimes, a packet takes 20

microseconds through the TCP stack. Sometimes, it takes 100

microseconds.

Embodiments of the invention

Current hardware implementations of TCP/IP offload engines are complex and not the lowest latency. This is mainly due to the need to handle connection setup and closure and all the exceptions such as retransmission requests and packet re-ordering.

It is not trivial to handle the exception cases of bad TCP packets, such as packets out of sequence, packets with bad checksum or expected but missing packets on hardware. Exceptions such as these are relatively rare but nevertheless must be handled effectively.

In embodiments of the invention, a minimum size hardware implementation of TCP/IP is used that is optimized for low latency. The host TCP/IP stack is used to handle connection setup and closure and all of the exceptions and error conditions. Certain embodiments involve re-using an existing TCP software implementation (the Linux Kernel stack) and optimising the entire system for low latency in a non-congested network.

Embodiments of the invention use a split-stack or a hybrid TCP/IP offload engine (TOE) device. The TCP/IP functionality is partly implemented in hardware such as a Field Programmable Gate Array (FPGA) or any other suitable hardware, and partly utilizing a software TCP/IP stack executing on a host CPU.

Figure 4 shows an overview of a TCP/IP offload system in an embodiment of the invention. Referring to Figure 4, the system comprises a host computer system 10 and a hardware TCP/IP offload engine (TOE) device 20. The host computer system includes a processor, memory, an operating system, and the appropriate input/output devices in a known manner. A TCP/IP stack 12 is provided as part of the host computer system 10. The TCP/IP stack 12 is a software

implementation of TCP/IP, which may be provided as part of the host computer system's operating system. In one embodiment the operating system is Linux, although any other operating system could be used instead. The TCP/IP stack 12 transfers data to and from the user application 14, which also executes on the host computer system. A MAC (datalink) layer 16 and a physical layer 18 provide connections to a network.

In the arrangement of Figure 4, the TCP/IP offload device 20 is connected between the application layer 14 and the MAC layer 16. The TOE device is connected to the host CPU such that Direct Memory Access (DMA) or Memory Mapped IO is possible, for example over a PCIe interface or using a soft or hard CPU on the hardware die.

This implementation comprises two distinct TCP/IP stacks: a simplified hardware TCP/IP stack 20 and full software TCP/IP stack 12, in addition to management software and hardware logic to synchronise the two stacks and track additional state.

Figure 5 shows parts of the split stack TCP/IP offload system in more detail. Referring to Figure 5, the system comprises host computer system 10 and TCP/IP offload device 20. The host computer system 10 comprises a central processing unit (CPU) and memory which are arranged to execute various processes. These processes include TCP/IP stack 12, socket manager 22, and TUN/TAP device 24. The TOE device 20 is implemented in hardware such as an FPGA. The TOE device 20 comprises TCP connection state block 26, TCP reception block 28, reception multiplexer 30, TCP transmission block 32, transmission multiplexer 34, recovery block 36, and direct memory access (DMA) blocks 38, 40, 42, 44. The TCP/IP offload system is located between the application layer and the MAC (datalink) layer. In the arrangement of Figure 5, a TCP connection is either owned by the TOE device 20 or the host TCP/IP stack 12. The host stack 12 is responsible for the TCP connection set-up and teardown. The socket manager 22 in the host initiates a connection (in "client" mode) or responds to a connection request (in "server" mode). It handles ARPs (Address Resolution Protocol) and performs syn-synack-ack handshake to a remote TCP connection via the TUN/TAP interface 24.

When the TCP connection state block 26 in the TOE device 20 detects that a connection is established, it notifies the socket manager 22 via the TCP transmission block 32, and takes ownership of the connection until an exception event occurs or either end closes the connection.

The TCP reception block 28 validates received segments and provides received data to the application via the reception multiplex block 30. If a received segment is not valid, the TCP Reception block 28 informs the TCP connection state block 26 to give control to the socket manager 22. The socket manager 22 then provides received data to the application via the recovery block 36 and reception multiplex block 30. The host TCP/IP stack 12 processes all packets received for the connection, which are transferred to the TCP/IP stack 12 via DMA 38 and TUN/TAP 24. All payload generated by the application is also transmitted to the host TCP/IP stack 12 via TCP transmission block 32, DMA 42 and socket manager 22. This ensures that the receive and transmit byte-streams for the two TCP stacks are identical. Packets from the host TCP stack 12 are used by the TOE device 20 to check the validity of its own state but are not forwarded to the MAC while the TOE device owns the connection.

When the connection is owned by the TOE device 20, the transmission multiplexer 34 discards packets generated by the software stack 12 and only transmits packets generated by the TCP transmission block 32. The reception multiplexer 30 forwards TCP payload data from the TCP reception block 28 when the hardware owns the connection; otherwise the payload from the host via the recovery block 36 is used.

The recovery block 36 tracks the number of bytes received from the far end while in the RECOVERY state and indicates to the TCP connection state block 26 when the receive data has been synchronised. To transfer ownership of a connection between the software and hardware stacks, the TCP sequence numbers are used to define the transition point. A transfer of ownership can be initiated by the hardware stack in response to a connection setup or an unhandled exception, or by the software on the host, for example when recovery is complete.

The split stack TOE can support multiple simultaneous connections.

Parts of the split stack TCP/IP offload system are described in more detail below. Socket Manager

The socket manager 22 is responsible for initiating and managing a TCP connection in the host.

A block diagram of the socket manager is shown in Figure 6. Referring to Figure 6, the socket manager 22 comprises socket manager state machine 50,

TUN/TAP manager 52, TCP transmission DMA manager 54, and TCP reception data manager 56.

In operation, when a new connection is requested by the application, the socket manager state machine 50 retrieves information about the TAP state from the TUN/TAP manager 52. The socket manager state machine 50 also

communicates with the TCP connection state block 26 in Figure 5 to indicate that a new connection has been requested, to ensure that the hardware TOE device 20 is ready to monitor the new connection. The socket manager state machine 50 then creates a socket via the TUN/TAP manager 52, connecting to the TAP interface in the host kernel.

When a connection close is requested by the application, or the hardware notifies the socket manager that the far end has closed the connection, then the socket manager state machine 50 ensures that the socket on the host is torn down correctly and indicates back to the TCP connection state block 26 that teardown is complete. The socket manager state machine 50 also records various metrics for debug and reporting.

The TUN/TAP manager 52 is responsible for creating a transient TAP device in the operating system kernel and managing the associated interface. The TAP device exposes two interfaces, a file descriptor based read/write interface to allow the user space program to emulate hardware by reading and writing raw Ethernet frames, and a socket interface for the user-space software to send and receive the byte-stream. On the file-descriptor interface, Ethernet frames generated by the kernel TCP stack are transferred to the TCP transmission block 32 via DMA 40. Similarly Ethernet frames received from the wire are copied to the host TCP/IP stack 12 via DMA 38 and injected into the TAP device by the TUN/TAP manager 52. On the socket interface, TCP transmit payload data is received from the TCP transmission DMA manager block 54 and data received from the socket is passed to the TCP reception data to application block 56.

These blocks manage the communication of payload data to and from the card, handling the DMA interaction and passing debug metrics to the Socket Manager.

TCP Connection State Block

Referring back to Figure 5, the TCP connection state block 26 is responsible for keeping track of the state of a TCP connection in the TOE device. The state of the connection is updatable by the TCP reception block 28 and the TCP transmission block 32. The TCP connection state block 26 also reads a signal sw conn state from TUN/TAP 24 via TCP transmission block 32, indicating the connection state of the host TCP/IP stack 12. Figure 7 is a state diagram of the TCP connection state block 26. Referring to Figure 7 the possible states are: a not connected state 60; an established state 62; a recovery state 64; and a closing state 66.

In operation, the TCP connection state block 26 is initially in the not connected state 60. When the host TCP/IP stack has established a connection, this is detected by the TCP connection state block 26 by reading the signal

sw conn state. The TCP connection state block then transitions from the not connected state 60 to the established state 62. In the established state, the hardware TCP/IP stack owns the connection.

While in the established state 62, if reception SEQ (packet sequence) or reception ACK is not as expected, the TCP connection state transitions to the recovery state 64. In this state the TCP connection state block 26 sends a signal sw_owns_conn to the host, to indicate that the software owns the connection. This is detected by the socket manager 22, which takes control of the connection.

If reception FIN or reception RST is received while in the established state, the TCP connection state transitions to the closing state 66. In the closing state, the TCP connection state block 26 sends a signal sw_owns_conn to the host, to indicate that the software owns the connection. This is detected by the socket manager 22, which takes control of the connection.

TCP transmission block

A block diagram of the TCP transmission block 32 is shown in Figure 8.

Referring to Figure 8, the TCP transmission block comprises parser 70, TCP transmission processing block 72, and header generation block 74.

In operation, the parser 70 extracts the transmit sequence and acknowledgement numbers, Maximum Segment Size (MSS) and flags from transmitted packets forming the connection setup handshake. The TCP transmission processing block 72 increments sequence numbers, acknowledges all received packets, and honours the transmit window. A validation section 76 within the TCP

transmission processing block 72 validates the data to be transmitted. The header generation block 74 constructs TCP and IP headers and sends the packet to MAC via the transmission multiplex block 34. The header generation block 74 also sends the transmit payload and headers to the socket manager 22 via DMA 42 (see Figure 5).

TCP reception block

Figure 9 shows a diagram of the TCP reception block 28. Referring to Figure 9, the TCP reception block comprises parser 80, flow look up block 82, TCP process block 84, FIFO (first in first out) memory 86, TCP reception processing block 88 and multiplexer 90. In operation the parser 80 extracts the TCP/IP headers from received packets. The flow lookup module 82 uses the extracted headers to determine which connection the packets relate to. During connection set-up the receive window and receive sequence number are extracted and a new flow is created. Received packets are processed by the TCP process module 84. The TCP header information from packets that match an established connection as defined by source and destination IP address and port numbers are passed to the TCP reception processing block 88. A validation section 92 within the TCP reception processing block verifies the checksum and that the sequence numbers match those expected by the connection state. If the packet passes this check then the payload is forwarded from the FIFO 86 to the application by the multiplexer 90, and the connection state is updated to indicate the reception of the packet.

An exception event occurs when the received packet does not contain the expected SEQ and ACK values, indicating that a packet has been re-ordered or dropped on the network. The TCP reception processing block 88 notifies the socket manager 22 that an exception event has occurred via TCP connection state block 26. The TCP connection state block then drops into the RECOVERY state 64 and remains so until the sequence numbers for both transmit and receive are synchronised. In the embodiments of the invention described above, by simplifying the TOE device, it can be optimized for low footprint (size of hardware) and low latency. Typically 99.99% of Ethernet traffic would be expected to pass through the TOE device. For the exceptions and TCP administration tasks, the traffic would be handled through the software TCP stack. This in turn would normally be slower but richer in functionality. However, if desired, some exception events could be handled within the hardware. The lower footprint can allow a larger number of connections to be supported, or can enable the TCP/IP processing to run at a higher frequency, providing another source of latency improvement. By dispensing with the retransmission buffer and receive re-order queue, valuable high-bandwidth internal memory resources can be freed up on the hardware device.

Furthermore, with simpler hardware, things are less likely to go wrong, offering inherent reliability.

Generic parser

The split stack TCP/IP offload system described above can improve the latency of packets passing through the TCP/IP stack. The payload from the TCP/IP stack is typically forwarded to an application executing on the host processor. Such applications may be used, for example, to receive and/or send documents over the network.

Various document markup languages exist for encoding documents in a format which is both human-readable and machine-readable. For example, Extensible Markup Language (XML) is commonly used to express rich structured data in ASCI I format. Websites that frequently update their content often provide an XML feed. An application program that wishes to make use of this content will upload and parse the XML feed.

The Financial Information exchange (FIX) protocol is a rich and flexible tag delimited language used by exchanges globally as a major standard for trading interfaces. A related protocol is known as FAST (FIX Adapted for Streaming). XML and FIX parsers are generally implemented in software. However, in order to further improve the latency with which data can be received, it may be desirable in some situations for hardware implementations to be provided. For example, in the case of market data, it may be desirable to have a low latency path from market data to the decision to trade and/or the instruction to trade.

Both XML and FIX formats are ASCI I formats i.e. it is possible to take FIX and XML formatted files and open them with standard office productivity tools. The usage of each is specific by version of application. Further there are a various versions of the XML and FIX protocols. For example, the version and acceptable language definition of the FIX protocol used by the London Stock Exchange is current FIX 5.0 while that of the Chicago Mercantile Exchange is FIX 4.2. Examples of situations in which XML may be used include:

• reasonably complex configuration files, such as the allowable language structure for an implementation of FIX, and

• to provide data streams such as the EBS FX market data feed that

supplies a continuous FX price data.

EBS (Electronic Broking Services) is a wholesale electronic trading platform used to trade foreign exchange (FX). The language definition for both XML and FIX is rich. The structure usable for any given implementation is provided in a configuration file. This configuration file will define what is and is not acceptable for each message type, typically expressed in terms of constraints on the structure and content of documents of that type. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints. Typically (for example with FIX/FAST) the message definitions are provided by templates, which are known by both parties in the communication. In the case of XML, the message definitions are contained in an XML schema. Current hardware implementations of XML and FIX parsers are generally not complete implementations. Specific fields may be sought and processed without the explicit validity checks that are expected of robust systems.

Alternatively, it is possible for the entire parsing activity to be managed by the host server housing the hardware and for the normalised results to be sent back to the hardware. This is not ideal because of the inherent latencies introduced traversing the bus (e.g. PCIe) to the host and back, and this is not capitalising on the inherent processing capabilities of the hardware.

In the case of software implementations, generic open source and commercially available libraries exist that enable software to utilise and construct FIX and XML messages with relative ease. However no such equivalent products are available in hardware. The reason for this is primarily the engineering effort required to implement this in hardware compared to that required with software based solutions.

In accordance with an embodiment of the invention, a solution is provided to implement generic XML and FIX parsers in hardware. The parsers may be implemented using FPGA or any other suitable hardware, in a similar way to the TCP engine. The hardware solution can be as flexible as the software

equivalent, including a configuration file that defines the language and version implementation. This permits new protocols or message versions to be supported without having to modify the hardware. It is also possible to

dynamically adjust output formats, for example by extracting a new field that was previously not of interest.

In one embodiment the generic parser is used to allow connectivity via hardware to the electronic trading community globally. However the generic parser may be used in many other applications.

In the present embodiment a special purpose execution engine, a generic parser, is provided to permit run-time programmable parsing of this category of protocols. Parsing FIX/FAST, SBE (Simple Binary Encoding) or XML requires the support of various constructs - for example tracking nesting levels, converting values, saving data from the input into temporary storage and looping. For compressed message streams, many fields may be expressed in bit widths that are not integer multiples of 8 or use stop-bit encoding, therefore the generic parser operates at bit-level granularity.

Figure 10 shows parts of a generic parser in accordance with an embodiment of the invention. Referring to Figure 10, the generic parser 100 comprises schema compiler 102, instruction memory 104, decoded microcode block 106, barrel shifter 108, decoder block 1 10, value selector 1 12, comparator 1 14, RAM 1 16, stack 1 18, registers 120 and operation selector 122. In operation, schema files are shared between the sender and receiver ahead of processing. The schema compiler 102 takes the incoming configuration files such as FIX/FAST templates or XML schema 124 and compiles them down to decoded microcode which is stored in the instruction memory 104. The microcode is a set of instructions that operates directly on the hardware. Thus the hardware is effectively a specific purpose processor. The microcode executes on the parser to extract desired fields from the incoming byte stream.

The incoming byte stream (for example from reception multiplexer block 30 in Figure 5) is fed to barrel shifter 108. The barrel shifter 108 is a digital circuit that shifts the incoming byte stream by a specified number of bits in each clock cycle to convert the input byte stream into an input bit stream 126. The input bit stream is fed to decoder block 1 10.

The decoder block 1 10 contains various decoders connected in parallel. For example, the decoder block may contain decoders which are employed to convert ASCII digits into a binary integer number, or to determine the next field length for stop-bit encoding. Other examples are decoders which convert a variety of ASCII digits to floating point or fixed point values and/or signed/unsigned integer values, and decoders with delimiter detection e.g. stop bits or particular character sequence detection. The outputs of the various decoders are fed to value selector 1 12, which may select one or more of the outputs based on an output from the decoded microcode block 106.

The decoded microcode block 106 may perform the following operations:

Select data from one of the input decoders in decoder block 1 10.

Push values onto the stack 1 18.

Write values to an address in the output buffer (RAM 1 16). This allows templates to refer to previous outputs, which is used as a form of compression.

Conditionally jump to a new microcode instruction.

Unconditionally jump to a new microcode instruction.

Consume bits from the input.

Increment or decrement the register 120.

Load register values.

Compare input against a register or constant.

Many operations can be performed in parallel, for example pushing a value onto the stack, decrementing a register and conditionally jumping if the result is not zero can all be performed in a single cycle.

All values, including the number of bits or the selector for the input decoders may come from the input, a decoder, the stack, a register or from a constant operand embedded in the compiled microcode.

With the solution described above, it becomes possible to make sense of the XML and/or FIX language implementations on the hardware. This means it is possible to have a critical low latency path, for example from market data to the decision to trade and the instruction to trade, on the hardware. This is particularly advantageous where high speed access to data is required, such as in the high frequency trading/low latency trading communities where it is desirable to provide the lowest possible latencies while providing an accelerated trading and market data gateway service. The ability to update the output format without having to change the hardware design or reload the hardware image permits the decoding path to be changed dynamically. This results in lower maintenance of a large install base - for example when a protocol change is detected the management software can compile new microcode from the updated templates without having to centrally generate a test a new hardware reconfiguration. Similarly it is possible to dynamically change the output format to add or remove fields during a trading session in order to optimise the data format based on the requirements of the algorithm.

It will be appreciated that features of one of the embodiments described above may be used with any of the other embodiments. Furthermore, the embodiments have been described way of example only, and variations of detail are possible within the scope of the claims.