Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SWITCHED PATH AGGREGATION FOR DATA CENTERS
Document Type and Number:
WIPO Patent Application WO/2015/077878
Kind Code:
A1
Abstract:
Techniques for forwarding packets across a hierarchical organization of switches constituting a network for the purpose of interconnecting a large number of client systems. The network includes at least two layers of packet switching elements, each switching element of one layer being connected to switching elements of another layer. The method may be performed by a control system which is distinct from the packet switching elements. A topology of the network is acquired, and one or more paths are calculated between a respective pairs of packet switching elements of one layer, via at least one packet switching element of another layer. Forwarding state is installed in each packet switching element traversed by a path, such that packets can be forwarded via the path. The paths are analyzed to find paths that can be aggregated, and aggregating at least two paths into a switched path aggregation group.

Inventors:
CASEY LIAM (CA)
Application Number:
PCT/CA2014/051121
Publication Date:
June 04, 2015
Filing Date:
November 25, 2014
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ROCKSTAR CONSORTIUM US LP (US)
CASEY LIAM (CA)
International Classes:
H04L12/28; H04L45/50; H04L45/02
Domestic Patent References:
WO2010037421A12010-04-08
Other References:
MORRIS, STEPHEN B.: "MPLS and Ethernet: Seven Things You Need To Know", INFORMIT, 17 December 2004 (2004-12-17), pages 1 - 9, Retrieved from the Internet
AL-FARES ET AL.: "A Scalable, Commodity Data Center Network Architecture", PROC. SIGCOMM'08, 17 August 2008 (2008-08-17), pages 63 - 74, XP058098076, Retrieved from the Internet DOI: doi:10.1145/1402958.1402967
Attorney, Agent or Firm:
DANIELS IP SERVICES LTD (Kanata, Ontario K2K 2X3, CA)
Download PDF:
Claims:
CLAIMS

1. In a network comprising a first edge bridge, a second edge bridge and a

plurality of intermediate network elements, wherein there are more than two possible communication paths between the first edge bridge and the second edge bridge and all possible communication paths between the first edge bridge and the second edge bridge traverse at least one intermediate network element, a method of forwarding Ethernet frames at the first edge bridge, the method comprising:

associating a respective path identifier with each of a plurality of communication paths pre-established between the first edge bridge and the second edge bridge; and

responsive to determining that a first client Ethernet frame is to be forwarded from the first edge bridge to the second edge bridge:

selecting a first one of the communication paths pre-established between the first edge bridge and the second edge bridge;

encapsulating the client Ethernet frame with a header comprising the path identifier associated with the selected first one of the communication paths; and

forwarding the encapsulated client Ethernet frame on the selected first one of the communication paths.

2. The method of claim 1 , further comprising, responsive to determining that a second client Ethernet frame is to be forwarded to the second edge bridge: selecting a second one of the communication paths pre-established between the first edge bridge and the second edge bridge;

encapsulating the client Ethernet frame with a header comprising the path

identifier associated with the selected second one of the communication paths; and

forwarding the encapsulated client Ethernet frame on the selected second one of the communication paths. The method of claim 1 , wherein the communication paths are Multiprotocol Label Switched (MPLS) Paths and the associated path identifiers are MPLS labels.

The method of claim 1 , wherein the communication paths are Provider

Backbone Bridging - Traffic Engineering (PBB-TE) Ethernet Switched Paths (ESPs) and the associated path identifiers are Backbone VLAN identifiers (B- VI Ds).

The method of claim 3, wherein the header encapsulating the client Ethernet frame further comprises a MAC address of the second edge bridge.

The method of claim 1 , wherein the step of selecting a first one of the communication paths pre-established between the first edge bridge and the second edge bridge comprises applying a load balancing process;

The method of claim 6, wherein applying a load balancing process comprises performing a calculation using values of fields in the header of the first client Ethernet frame to determine an index value wherein the calculation determines the same index when applied to client Ethernet frames with same values of fields in their headers and determines a different index for at least some client Ethernet frames when one or more values of the fields in their headers are different.

The method of claim 1 , wherein the step of associating a respective path identifier with each of a plurality of communication paths pre-established between the first edge bridge and the second edge bridge comprises receiving one or more messages associating a distinct respective path identifier with each of the plurality of communication paths between the first edge bridge and the second edge bridge.

The method of claim 8, wherein:

receiving one or more messages associating a distinct respective path identifier with each of the plurality of communication paths between the first edge bridge and the second edge bridge comprises associating a Switched Path Aggregation Group (SPAG) with a MAC address of the second edge bridge wherein the SPAG comprises a set of at least two of the received path identifiers; and

selecting a first one of the communication paths pre-established between the first edge bridge and the second edge bridge comprises selecting a first one of the path identifiers of the SPAG set.

10. The method of claim 9, wherein:

receiving one or more messages comprises receiving a path status message originating from the second edge bridge, the path status message comprising both a path identifier and a path status field; and selecting a first one of the communication paths comprises:

matching the path identifier of the received path status message with one of the path identifiers associated with the communication paths pre-established between the first edge bridge and the second edge bridge;

examining the value of the path status field to determine whether or not the communication path associated with the matched one of the path identifiers is in an operational state; and adding the matched one of the path identifiers to the SPAG set responsive to determining that the communication path associated with the matched one of the path identifiers is in an operational state.

11. The method of claim 10, further comprising removing the matched one of the path identifiers from the SPAG set responsive to determining that the

communication path associated with the matched one of the path identifiers is not in an operational state.

12. In a hierarchical network comprising at least two layers of packet switching

elements, each packet switching element of a first layer being connected to a plurality of packet switching elements of a second layer, a method of

establishing packet switched paths performed by a control system distinct from the packet switching elements, the method comprising: determining a path between a respective pair of packet switching elements of the first layer, via at least one packet switching element of the second layer;

installing forwarding state in each packet switching element traversed by the path, such that packets can be forwarded via the path; and notifying at least one of the respective pair of packet switching elements of the first level that the path is a candidate to be aggregated by the at least one of the respective pair of packet switching elements into a switched path aggregation group.

13. The method of claim 12, wherein determining the path between the respective pair of packet switching elements is responsive to receiving messages comprising topology information from packet switching elements.

14. The method of claim 13, wherein the topology information in a message

received from a particular packet switching element comprises an identity of the particular packet switching element and an identify of each other packet switching element to which the particular packet switching element is currently connected.

15. The method of claim 14, wherein the identify of each other packet switching element to which the particular packet switching element is currently connected is determined using an instance of Link Level Discovery Protocol (LLDP).

16. The method of claim 12, wherein the packet switching elements are Ethernet switches and the path between the respective pair of packet switching elements comprises an Ethernet Switched Path (ESP).

17. The method of claim 16, wherein installing forwarding state in each packet switching element traversed by the path comprises installing a forwarding entry in each intermediate Ethernet switch on the ESP wherein the forwarding entry associates a Backbone VLAN identifier (B-VID) and a Destination Backbone Medium Access Control (B-MAC) address of the ESP with an egress port of the Ethernet switch.

18. The method of claim 16, wherein notifying at least one of the respective pair of packet switching elements of the first level that the path is a candidate to be aggregated by the at least one of the respective pair of packet switching elements into a switched path aggregation group comprises notifying the at least one of the respective pair of packet switching elements that is configured as the ESP originating Backbone Edge Bridge (BEB) of the B-VID of the ESP.

19. The method of claim 12, wherein the packet switching elements are Label Switched Routers and the path between the respective pair of packet switching elements comprises a Label Switched Path (LSP).

20. The method of claim 19, wherein installing forwarding state in each packet switching element traversed by the path comprises installing a forwarding entry in each intermediate Label Switched Router on the LSP wherein the forwarding entry associates an incoming Multi-Protocol Label Switching (MPLS) label with an outgoing label and next hop.

21. The method of claim 12, wherein notifying at least one of the respective pair of packet switching elements of the first level that the path is a candidate to be aggregated by the at least one of the respective pair of packet switching elements into a switched path aggregation group further comprises providing in a notification an identification for the switched path aggregation group.

22. The method of claim 21 , wherein a traffic profile is associated with the switched path aggregation group, the traffic profile being used to determine in part which subset of arriving packets are to be forwarded on any member path of the switched path aggregation group.

23. The method of claim 12, performed by a control system comprising a general purpose computing element.

24. The method of claim 12, performed by a control system comprising a working element and a standby element.

25. The method of claim 12, performed by a control system comprising a plurality of working elements

26. The method of claim 12, wherein installing forwarding state in each packet switching element traversed by the path such that packets can be forwarded via the path and notifying at least one of the respective pair of packet switching elements of the first level that the path is a candidate to be aggregated by the at least one of the respective pair of packet switching elements into a switched path aggregation group each comprises sending messages over a

management network to the packet switching elements on the path.

27. The method of claim 26, wherein the management network is an Out of Band (OOB) network.

28. The method of claim 26, wherein the management network is an IP routed

network.

29. The method of claim 12, wherein the control system distinct from the packet switching elements comprises a Software Defined Networking (SDN) controller.

30. The method of claim 29, wherein installing forwarding state in each packet switching element traversed by the path such that packets can be forwarded via the path and notifying at least one of the respective pair of packet switching elements of the first level that the path is a candidate to be aggregated by the at least one of the respective pair of packet switching elements into a switched path aggregation group each comprises sending messages to the packet switching elements on the path in accordance with a version of the OpenFlow protocol.

31. An underlay fabric system for transporting, from ingress ports to egress ports, packets of one or more overlay networks, the underlay fabric system

comprising:

an underlay fabric controller;

a plurality of interconnected backbone core bridges (BCBs) wherein each BCB is operable to receive instructions from the underlay fabric controller, the instructions comprising instructions to add, modify and/or delete forwarding entries of the BCB; and a plurality of backbone edge bridges (BEBs) each comprising at least one ingress port, wherein each BEB is connected to at least one BCB and at least one pair of the plurality of BEBs is not directly connected to each other, each BEB being operable to:

receive instructions from the underlay fabric controller, the instructions comprising instructions to add switched paths to particular switched path aggregation groups,

wherein the instructions to each BCB and each BEB would, when completely carried out, result in a full mesh of switched path aggregation groups interconnecting all possible pairs of the plurality of BEBs.

32. The underlay fabric system of claim 31 , wherein the BCBs are Ethernet

Switches and the interconnections between BCBs are Ethernet links.

33. The underlay fabric system of claim 32, wherein the switched path aggregation groups comprise Ethernet Switched Paths (ESPs).

34. The underlay fabric system of claim 33, wherein the forwarding entries of each BCB comprise a B-VID value and a destination B-MAC address value for each ESP that transits the BCB, wherein the destination B-MAC address is a MAC address of a BEB comprising at least one egress port and distinct ESPs have differing B-VID values.

35. The underlay fabric system of claim 31 , wherein each BEB is further operable to receive at least one respective message from each other BEB of the plurality of BEBs, wherein each received message comprises a field reporting a status of a forward switched path from the BEB receiving the message to a respective other BEB of the plurality of BEBs and is transported over a reverse switched path from the respective other BEB to the BEB receiving the message .

36. The underlay fabric system of claim 35, wherein each BEB is further operable to carry out the received instructions to add the forward switched path installed between the BEB receiving the message and the respective other BEB to a particular switched path aggregation group responsive to determining a value of the field reporting the status of the forward switched path in the received message.

37. The underlay fabric system of claim 31 , wherein the at least one ingress port of a BEB is operable to be configured to be one or more client logical ports and wherein the configuration of at least one client logical port comprises an association of the at least one client logical port with a community of interest.

38. The underlay fabric system of claim 37, where the association of the at least one client logical port with a community of interest comprises an association of the at least one client logical port with an l-component Service Instance Identifier (l-SID).

39. The underlay fabric system of claim 37, wherein each BEB is further operable: to receive a client packet on a one of its client logical ports;

to determine an egress BEB for the received client packet, the egress BEB being one of the plurality of BEBs, the determination being based, at least in part, on a destination address field of the received client packet;

responsive to determining the egress BEB for the received client packet, to select, from the full mesh of switched path aggregation groups

interconnecting all possible pairs of the plurality of BEBs, a switched path aggregation group that interconnects with the determined egress BEB; to select one of the packet switched paths added to the selected switched path aggregation group;

to encapsulate the received client packet with an encapsulation that identifies the selected one of the packet switched paths added to the selected switched path aggregation group and that further comprises a field identifying the community of interest associated with the client logical port on which the client packet was received; and

to forward the encapsulated client packet towards the BCB that is a next hop of the selected one of the packet switched paths added to the selected switched path aggregation group.

40. The underlay fabric system of claim 37, wherein each BEB is further operable: to receive from a BCB on a packet switched path that terminates at the BEB, a client packet encapsulated with an encapsulation that identifies the packet switched path and further comprises a field identifying a community of interest.

to remove the encapsulation from the client packet;

to select an egress client logical port based on a combination of the community of interest and a value of at least one header fields of the de- encapsulated client packet; and

to forward the client packet on the selected egress client logical port.

41. The underlay fabric system of claim 31 , wherein each BCB is further operable: to receive a packet encapsulated with an encapsulation that identifies a packet switched path;

to match one of the forwarding entries added responsive to receiving

instructions from the underlay fabric controller with the identity of the packet switched path identified by the encapsulation; and

to forward the received packet according to the matched forwarding entry.

42. The underlay fabric system of claim 31 , wherein at least one of the BCB's is further operable to send a message addressed to the underlay fabric controller, the message reporting topology information;

43. The underlay fabric system of claim 31 , wherein the underlay fabric controller is operable to send messages to at least one BCB, the messages comprising instructions to add, modifiy and/or delete forwarding entries of the at least one BCB.

44. The underlay fabric system of claim 43, wherein the underlay fabric controller is further operable to send the messages to at least one BCB, the messages comprising instructions to add, modify and/or delete forwarding entries of the at least one BCB, responsive to the underlay fabric controller receiving a message reporting topology information.

45. The underlay fabric system of claim 31 , wherein the underlay fabric controller comprises a computing system.

46. A method of transporting overlay network packets over a packet switched

network comprising a plurality of interconnected core packet switches and a plurality of edge switches having installed between each respective pair of edge switches a respective set of at least two traffic engineered packet switched paths, each traffic engineered path transiting at least one core packet switch and no two of the traffic engineered paths between any pair of edge switches transiting exactly the same set of core packet switches, the method comprising: receiving an overlay network packet at an ingress edge switch, the ingress edge switch being one of the plurality of edge switches;

determining an egress edge switch for the received overlay network packet, the egress edge switch being a one of the plurality of edge switches, the determination being based, at least in part, on a destination address field of the received overlay network packet;

responsive to determining the egress edge switch for the received overlay

network packet, selecting one traffic engineered packet switched path from the set of at least two traffic engineered packet switched paths installed between the ingress edge switch and the egress edge switch; encapsulating the received overlay network packet with header fields

associated with the selected traffic engineered packet switched path; and forwarding the encapsulated overlay network packet over the selected one traffic engineered packet switched path towards the egress edge switch.

47. The method of claim 46, further comprising:

receiving at the egress edge switch the encapsulated overlay network packet; removing from the encapsulated overlay network packet the encapsulation applied by the ingress edge switch;

determining a next hop in the overlay network of the overlay network packet based, at least in part, on the destination address field of the overlay network packet; and forwarding the overlay network packet toward the determined next hop.

Description:
SWITCHED PATH AGGREGATION FOR DATA CENTERS

Cross-Reference to Related Application

[0001] This application is based on Provisional Application No. 61/909,054 entitled Ethernet Packet Trunk Aggregation for Data Centers filed November 26, 2013. This Provisional application is hereby incorporated by reference.

Background

[0002] Very large data centers comprise hundreds of thousands of computers, interconnected by many thousands of switches. Typically the latter are routing or Layer 2 / Layer 3 (L2/L3) switches which forward packets by IP address, rather than by bridging the packets based upon their MAC addresses.

[0003] The switches in a large data center are typically organized as a three level hierarchy of Top of Rack (ToR) switches (switches usually situated on top of a Rack of computers), Leaf switches and Spine switches. The Leaf level is also called the

Aggregation level, and the Spine level is also called the Core level.

[0004] Traditionally, communication patterns in a data center have mainly been what is termed "north/south". This pattern is typical of a multi-tenant data center hosting lots of smallish web sites: a web request for a particular web site arrives over the Internet at a core switch, from whence it is directed through an aggregation switch to a ToR switch serving a computer hosting the particular web site.

[0005] Data centers with "north/south" traffic patterns have typically been constructed with each switch dual homed to switches in a layer above, an arrangement frequently called the "double star architecture". This architecture has two core level switches that between them have to have the capacity to switch all traffic in the data center. A consequence of this is that data centers are limited to the capacity of the biggest (and most expensive) switches available with current technology. Further, if it is desired to use bridging throughout, the Spanning Tree protocol will work, but it leaves half the links and one of the two core level switches unutilized. One way this shortcoming is addressed is by deploying Split MLT (Multi-Link Trunking), also known as Multi-Chassis LAG (Link Aggregation), as described in US patent 7, 173,934 to Lapuh et al [0006] The advent of multi-tenant data centers - data centers that host the web sites and applications of many different organizations (tenants), and of cloud computing with its "Platform as a Service" and "Infrastructure as a Service" offerings has changed traffic patterns to be far more "east-west", that is communicating directly between computers attached to different ToR switches. With Network Function Virtualization becoming more popular, even accessing the humble single host web site now involves significant "east- west" traffic as incoming packets are inspected and processed by virtual intrusion detection systems, firewalls, load balancers, etc.

[0007] Consequently there has been a movement away from the traditional double star architecture to build data centers that can handle significant "east-west" traffic, that is communication between computers attached to different ToR switches, and that can scale "horizontally". By horizontal scaling is meant increasing overall capacity by adding more switches rather than by replacing switches with higher capacity switches, as is required when deploying double star architectures. The design most frequently adopted to achieve these goals is a "folded Clos switch design".

[0008] In a folded Clos switch design, ToR, Leaf and Spine switches are all high radix switches, i.e. they have a large number of ports. In a true folded Clos organization there are no direct links between switches of the same level: all ports of each spine switch connect to leaf switches, leaf switch ports are of two types, those that connect to spine switches and those, perhaps of lower capacity, that connect to client ToR switches. Likewise ToR switch ports are of two types, those connected to Leaf switches and those connected to the computers/hosts in their rack. Packets are forwarded upwards towards the Spine and then downwards towards the destination ToR switch. In consequence, any path between any pair of ToR switches will be either two hops (when both ToR switches have at least one link to a common Leaf switch) or 4 hops when the ToR switches are connected to different Leaf switches.

[0009] Al-Fares et al. adopt a special instance of a Clos topology called a "fat tree" to interconnect commodity Ethernet switches (Al-Fares et al., A Scalable, Commodity Data Center Network Architecture, Proc. SIGCOMM'08, August 17-22, 2008, pp63-74, hereby incorporated by reference). Fat trees will be used herein in the description of example embodiments, although implementations of the present technique are not limited to fat trees, or indeed only to forms of Clos topology. Figure 1 depicts such a fat tree organization for switches with 6 ports. Sets of 3 (half of 6) ToR and 3 leaf switches are grouped together in "pods". There are 6 pods and 9 (3 squared) Spine switches, with each Spine switch having a link to each pod.

[0010] In a data center with a folded Clos or fat Tree backbone there will be a relatively large number of paths between each pair of ToR switches. For the 6 port switch fat tree organization of Figure 1 the number of paths between any two ToR switches in different pods is 9. For a data center constructed using switches with 24 ports there would be 144 such paths. It is highly desirable that the switching load of the data center be spread out over all of the spine switches and the links that feed them. As is well known, normal Ethernet bridging requires that potential paths be restricted to those on a spanning tree. In the current state of practice, IP routing is enabled at one or more levels of the hierarchy, so that Equal Cost Multipath (ECMP) forwarding can be used to spread traffic over many or all spine switches and links. (Note that Al-Fares eschews ECMP in favour of predefined routes and structured IP addresses, while the FabricPath product of Cisco Corporation uses routing protocols to establish Layer 2 paths and then a modified forwarding plane akin to routing including modifying hop counts at each switch - see, for example, Ardica, M., Cisco FabricPath, Technology and Design, Chart Package, May 2013, available at:

http://www.cisco.com/web/HR/ciscoconnect/2013/pdfs/Cisco FabricPath Technology a nd Design Max Ardica Technical Lead Engineering.pdf, hereby incorporated by reference.)).

[0011] The main benefit of deploying a regular network organization such as folded Clos is that a large scale network can be built and scaled using just low cost commodity switches at all tiers. Scalability is not limited by expensive, state of the art, "big iron" switches at the top tiers.

[0012] However IP (OSI Layer 3 or L3) forwarding in switches requires more packet processing power than Ethernet (OSI Layer 2 or L2) forwarding, so, for a given packet throughput capacity, pure Layer 2 switches are cheaper than combined L2/L3 switches. Routing requires the continuous operation of a routing protocol, with more overheads and the potential for service affecting reconfigurations when there are switch or link failures (which, given the potential size of data center backbones, occur relatively frequently). [0013] Another drawback of using Layer 3 forwarding is that it interferes with a desired mode of operation of being able to treat all servers (whether directly executing on hosts or as Domains (Virtual Machines)) as equally able to be executed at any host. Having to group related servers locally to the same Layer 3 subnet, as for example is required for load balancing servers deploying Direct Server Return, both limits the scale of a service and results in under utilization of computing resources.

[0014] Also pure Layer 2 switches, with less work to do in forwarding packets, consume less power, a significant factor in large scale data center deployments.

[0015] Accordingly, there is a need for a Layer 2 forwarding method for data centers that can spread traffic over all spine and leaf switches, preferably one that does not require forwarding table updates when there is a fault in the switching backbone. With such a Layer 2 forwarding method, the location of related Domains within a data center will not be constrained.

Summary of the Invention

[0016] Aspects of the present invention provide methods, apparatus and systems for forwarding packets across a hierarchical organization of switches constituting a network for the purpose of interconnecting a large number of client systems.

[0017] Some embodiments provide a method for spreading Ethernet packet streams over the plurality of packet switched paths between all the pairs of switches at a particular hierarchical level, be that the Leaf level or, alternatively, the ToR level.

[0018] Some embodiments use Ethernet Switched Paths (ESPs) of the IEEE 802.1 PBB-TE standard as the paths, wherein each path is uniquely identified by a combination of a destination backbone medium access control (B-MAC) address and a backbone VLAN identifier (B-VID). Advantageously, when there is at most one path between any pair of ToR ports there can be a one to one correspondence between Spine nodes and B-VIDs.

[0019] Some embodiments use a control or management entity to determine the ESPs, control the assignment of B-VIDs and install the paths in the switches of the data center network. [0020] The present technique extends Ethernet Link Aggregation to aggregate packet switched paths into what appears to the aggregation client as a single Ethernet link.

[0021] One aspect of the invention provides a method of forwarding Ethernet frames at a first edge bridge of a network comprising the first edge bridge, a second edge bridge and a plurality of intermediate network elements, wherein there are more than two possible communication paths between the first edge bridge and the second edge bridge, and all possible communication paths between the first edge bridge and the second edge bridge traverse at least one intermediate network element. The method comprises: associating a respective path identifier with each of a plurality of

communication paths pre-established between the first edge bridge and the second edge bridge; and responsive to determining that a first client Ethernet frame is to be forwarded from the first edge bridge to the second edge bridge: selecting a first one of the communication paths pre-established between the first edge bridge and the second edge bridge; encapsulating the client Ethernet frame with a header comprising the path identifier associated with the selected first one of the communication paths; and forwarding the encapsulated client Ethernet frame on the selected first one of the communication paths.

[0022] Another aspect of the invention provides a method of establishing packet switched paths in a hierarchical network comprising at least two layers of packet switching elements, each packet switching element of a first layer being connected to a plurality of packet switching elements of a second layer. The method may be performed by a control system which is distinct from the packet switching elements and may comprise: determining a path between a respective pair of packet switching elements of the first layer, via at least one packet switching element of the second layer; installing forwarding state in each packet switching element traversed by the path, such that packets can be forwarded via the path; and notifying at least one of the respective pair of packet switching elements of the first level that the path is a candidate to be aggregated by the at least one of the respective pair of packet switching elements into a switched path aggregation group.

[0023] Yet another aspect of the invention provides an underlay fabric system for transporting, from ingress ports to egress ports, packets of one or more overlay networks. The underlay fabric system comprises: an underlay fabric control system; a plurality of interconnected backbone core bridges (BCBs) wherein each BCB is operable to receive instructions from the underlay fabric control system, the instructions comprising instructions to add, modify and/or delete forwarding entries of the BCB; and a plurality of backbone edge bridges (BEBs) each comprising at least one ingress port, wherein each BEB is connected to at least one BCB and at least one pair of the plurality of BEBs is not directly connected to each other, each BEB being operable to receive instructions from the underlay fabric control system, the instructions comprising instructions to add switched paths to particular switched path aggregation groups;

wherein the instructions to each BCB and each BEB would, when completely carried out, result in a full mesh of switched path aggregation groups interconnecting all possible pairs of the plurality of BEBs.

[0024] A further aspect of the invention comprises a method of transporting overlay network packets over a packet switched network which comprises a plurality of interconnected core packet switches and a plurality of edge switches having installed between each respective pair of edge switches a respective set of at least two traffic engineered packet switched paths. Each traffic engineered path transits at least one core packet switch and no two of the traffic engineered paths between any pair of edge switches transits exactly the same set of core packet switches. The method comprises: receiving an overlay network packet at an ingress edge switch, the ingress edge switch being one of the plurality of edge switches; determining an egress edge switch for the received overlay network packet, the egress edge switch being a one of the plurality of edge switches, the determination being based, at least in part, on a destination address field of the received overlay network packet; responsive to determining the egress edge switch for the received overlay network packet, selecting one traffic engineered packet switched path from the set of at least two traffic engineered packet switched paths installed between the ingress edge switch and the egress edge switch; encapsulating the received overlay network packet with header fields associated with the selected traffic engineered packet switched path; and forwarding the encapsulated overlay network packet over the selected one traffic engineered packet switched path towards the egress edge switch.

Brief Description of the Drawings

[0025] Aspects of the present invention are pointed out with particularity in the appended claims. Embodiments of the present invention are illustrated by way of example in the following drawings in which like references indicate similar elements. The following drawings disclose various embodiments of the present invention for purposes of illustration only and are not intended to limit the scope of the invention. For purposes of clarity, not every component may be labeled in every figure. In the figures:

[0026] FIG. 1 is a representation of a 3 level fat tree organization of switches as might be deployed, at much bigger scale, in a data center;

[0027] FIG. 2 depicts a functional block diagram of the elements that perform link aggregation according to the IEEE Standard 802.1 AX;

[0028] FIG. 3 is a representation of a PBB-TE Ethernet frame as used in Ethernet Switched Paths (ESPs);

[0029] FIG. 4 combines the network of FIG. 1 with functional block diagrams of the elements that perform switched path aggregation according to the present invention, to show an instance of a Switched Path Aggregation group (SPAG);

[0030] FIG. 5 is a refinement of an Aggregator of FIG. 4;

[0031] FIG. 6 is a flowchart of the steps for the installation and operation of SPAGs;

[0032] FIG. 7 is a representation of an underlay fabric comprising a small full mesh core network realizable using the present invention. Each of the connections shown is a Switched Path Aggregation Group; and

[0033] FIG. 8 depicts a Virtual Machine (VM) Host computer, advantageously enhanced to utilize the present invention.

Detailed Description

[0034] FIG. 1 is a representation of a 3 level fat tree organization of switches as might be deployed, at much bigger scale, in a data center. FIG. 1 depicts uniform switches each having 6 ports of equal capacity. In a real deployment the number of ports per switch would be much larger, say 64 or even 128. The top level, called the spine or core consists of switches 111 to 119. Switches of the second and third levels are organized into pods 10 to 60. Within each pod there are shown 3 second level or leaf level switches (211 , 212 and 213 in pod 10 and 261 , 262 and 263 in pod 60) and 3 third level or Top of Rack (ToR) switches (311 , 312 and 313 in pod 10 and 361 , 362 and 263 in pod 60).

[0035] Within each pod 10 to 60 ports of the ToR switches belonging to the pod are joined by links to ports of the pod's Leaf switches for the bi-directional transmission of packets between the ToR level and the leaf level. Each of the spine switch ports has a link to a port of a leaf switch.

[0036] Not shown in FIG. 1 are the end systems that originate and terminate the packets switched by the three levels of switches. These end systems are typically each attached to one, or maybe two, of the ToR switch ports. These end systems are typically computing devices, hosting perhaps a single application or service, or, alternatively, hosting a plurality of virtual machines (VMs) or "containers" which in turn support applications or services. End systems in some pods may be predominantly storage or database nodes. Also, as Data Centers have connectivity to wide area networks, the organization of some pods may be different, for example lacking any ToR switches and instead having wide-area-capable leaf switches coupled to the spine switches and one or more wide area networks.

[0037] Referring to FIG. 1 , it is clear that in a folded Clos type of network there are a large number of minimal hop paths available for transporting packets from one edge to the other. For example a packet originating at ToR switch 311 and destined for ToR switch 362 could traverse Leaf 211 , Spine 111 and Leaf 261 or it could traverse Leaf 212, Spine 115, Leaf 262, to list 2 of the 9 possible paths. As the radix (number of ports) of the switches grow, the number of potential paths grows faster. It should be noted that, while in this example the ToR switches are designated as edges of the network, the present invention is not limited to the edges being ToR switches. Neither is the invention limited to three-level fat trees and folded Clos organizations of switches. Rather the invention is applicable to any network arrangement where there are multiple minimal hop paths between edges.

[0038] It is well known in the art that, when there are multiple "equal cost" paths available between a packet ingress point and its egress or termination point, it is advantageous to load share network capacity to spread independent packet streams over the available paths. When the packets are Layer 3 IP packets and the switches are IP routers, there is a well-known method, called Equal Cost Multi-Path (ECMP), for achieving this. Consistent with the connectionless nature of IP routing, each router along a path makes an independent determination of the outgoing link to use for forwarding a received IP packet.

[0039] However there is no equivalent to ECMP for spreading packets over multiple paths when it is desired to forward packets at the Layer 2 or Ethernet. Hitherto the approach to forwarding Ethernet packets across data centers, as embodied in the VXLAN solution of Cisco Corporation and the work of the IETF Network Virtualization over Layer 3 (NV03) working group, has been to encapsulate the Ethernet packets inside IP packets so as to be able to deploy ECMP. As well as requiring new

functionality at the edges to associate destination MAC addresses with IP tunnel addresses and having other disadvantages, this approach still requires that the switches be IP routers. The described embodiments of the current invention overcome these disadvantages, enabling all levels of switches in a data center to be lower cost Ethernet switches.

[0040] While hitherto a general multiple hop path packet spreading solution for Layer 2 forwarding has not been available, there is a well-established solution for single hop paths, i.e. links. Link Aggregation Groups (LAGs), also known as a Multi-Link Trunks (MLTs) were first standardized in the IEEE 802.3ad Working Group as Section 43 of IEEE Standard 802.3-2000. This standard was subsequently published as IEEE

802.1AX. FIG. 2, derived from IEEE 802.1AX, depicts how two systems, such as bridges 410 and 412, can partner to treat multiple (point to point, full duplex) links between them, 400, as a single Aggregated Link. Link Aggregation allows MAC Clients, 410 and 412, to each treat sets of two or more ports (for example ports 440, 442 and 444 on MAC client 410 as one set, and ports 441 , 443 and 445 on MAC client 412 as another set) respectively as if they were single ports.

[0041] The IEEE 802.1AX standard defines the Link Aggregation Control Protocol (LACP) for use by two systems that are connected to each other by one or more physical links to instantiate a LAG between them. In particular an Aggregation Key has to be associated with each Aggregation port that can potentially be aggregated together to form an Aggregator. The binding of ports to Aggregators within a system is managed by the Link Aggregation Control function for that system, which is responsible for determining which links may be aggregated, aggregating them, binding the ports within the system to an appropriate Aggregator 430, 432, and monitoring conditions to determine when a change in aggregation is needed. While binding can be under manual control, automatic binding and monitoring may occur through the use of a Link

Aggregation Control Protocol (l_ACP). The LACP uses peer exchanges across the links to determine, on an ongoing basis, the aggregation capability of the various links, and continuously provides the maximum level of aggregation capability achievable between a given pair of Systems.

[0042] Advantageously LACP provides LAG management frames for identifying when members of the link aggregation group have failed. The response to failure is to reduce the number of links in the group, not to recalculate the topology. And increasing the number of paths in a LAG group is automatic too. If extra switching capacity is added to a system there should not be any co-ordination needed in adding extra links to LAG groups. If it is necessary to re-configure some links and switches, this should appear to endpoints as a reduction in active links until the Aggregation Control function is notified of the new resources.

[0043] Note that FIG 2. depicts link aggregation between bridges, but 802.1 AX link aggregation can also be used by end stations connecting to bridges.

[0044] A drawback of 802.1 AX LAGs for folded Clos style of networks is, as can be seen in FIG. 1 , that each link from a particular switch terminates on a distinct system, so that there is no opportunity to form Link Aggregation Groups. Note however, that the embodiments described below can co-exist with deployments where each of the links between switches shown in FIG. 1 is in fact a plurality of links formed into an 802.1 AX LAG. Such deployments may in fact be very common as data centers grow in size and change out old equipment for new, so that single links have differing capacities and more capacity is need in different parts of the network.

[0045] An aspect of at least some embodiments of the present invention is to extend the operation of an Aggregator (430, 432) to bind logical ports instead of, or in addition to, physical ports (440 through 445). A logical port comprises the functionality that encapsulates a packet to be transmitted with a tunnel encapsulation so that the Ethernet packet can be transported to a peer system through a tunnel. Although the tunnels can be set up between directly connected immediate neighbours, their utility in the context of data centers is when they are established between switches of the same level (e.g. ToR layer) attached to different switches at the level above. In the example of FIG. 1 , which is not intended to limit the invention, a tunnel connecting a ToR switch (e.g. 311) to a ToR switch in a different pod (e.g. 361) would traverse 4 physical links.

[0046] There are various methods, well known in the art, for establishing packet switched paths as tunnels depending on the technology (e.g. bridging, routing or label switching) used to realize them. At least some embodiments of this invention are concerned with installing and aggregating packet switched paths into Switched Path Aggregation Groups (SPAGs). As described below, in preferred embodiments the tunnels are Ethernet Switched Paths (ESPs) but the invention is not limited to this type of packet switched path and other embodiments may, for example, use Multi-Protocol Label Switching (MPLS) label switched paths as tunnels.

[0047] Large-scale data centers have to deal with hundreds of thousands or even millions of individual Ethernet end points, each having a distinct Media Access Control (MAC) address. This is particularly so for large data centers in which Host computers each support tens or hundreds of Virtual Machines (VMs), each with its own virtual Network Interface Card (NIC), which in turn has one or more MAC addresses. High performance Ethernet switches that can handle such large numbers of MAC addresses are not cheap: they require huge amount of expensive ternary content addressable memory (TCAM) or other associative structures to look up MAC address at line rate speeds and, by their very size, such structures consume a lot of power. The widely adopted way of avoiding the problem is to retain IP switching in the core of the data center, and limit Ethernet Bridging to either an emulated bridge in a Host computer or to the ToR switch. In the past this has limited virtual machine (VM) migration and put constraints on the placement of computing resources. As noted above, Cisco

Corporation and the IETF NV03 working group propose using IP switches in the core to switch IP encapsulated Ethernet packets to achieve uniform MAC addressing across the data center.

[0048] However Ethernet standards have evolved to also allow for an Ethernet encapsulation process to take place. As specified in the IEEE 802.1 ah Provider

Backbone Bridging (PBB) standard, an ingress network element, a Backbone Edge Bridge (BEB) in IEEE 802.1 terminology, may encapsulate a simple "Customer" Ethernet frame, 500 in FIG. 3, with an outer MAC header 540 for transport across a core network called, in IEEE 802.1 terminology, a Service Provider Backbone network. A Service Provider Backbone network is comprised of the aforementioned BEBs and Backbone Core Bridges (BCBs). The backbone encapsulation header comprises a destination MAC address on the service provider's network (B-MAC DA) 542, a source MAC address on the service provider's network (B-MAC SA) 544, and a VLAN ID (B-VID) 546. BCBs need only work with the backbone MAC addresses and B-VIDs, thus substantially reducing the required table sizes of core switches. (A fourth component of the backbone encapsulation header is a service instance tag (l-SID) 548, but this is only of significance to BEBs.)

[0049] Thus if PBB encapsulation is used for forwarding packets across a data center, the core switches, such as those at the leaf and spine levels, can be general standard high speed bridges. These BCB's have only to forward packets based on the B-MAC and B-VID values in the encapsulation headers. As the possible number of B-MAC and B-VID values that a switch will need to have associative memory for is far fewer than the number of "Customer" MAC addresses, this enables cheaper, more energy efficient data center switching facilities.

[0050] PBB by itself does not provide Ethernet Switched Paths, but an enhancement of PBB does. This enhancement was initially called Provider Backbone Trunks (PBT) as described in in US Patents 8,194,668, 8,422,500, and published US patent applications 2013-0077495, 2013-0176906, 2008-0279196, 2005-0220096 assigned to Rockstar Consortium, hereby incorporated herein by reference in their entirety. Provider

Backbone Bridges - Traffic Engineering (PBB-TE), is the standardization of PBT, first as IEEE 802.1 Qay-2009, and now incorporated into the IEEE 802.1 Q-2011 standard.

[0051] PBB-TE replaces the normal Ethernet spanning tree and flooding operations with explicit configuration of the MAC forwarding at each Backbone Bridge. Explicit MAC forwarding rules are configured for each B-VID in a range of Virtual l_AN identifiers. Explicit configuration permits complete route freedom for paths defined between any pair of source B-MAC (544) and destination B-MAC (542) addresses, permitting the engineering of path placement. Multiple paths between pairs of source B-MAC and destination B-MAC addresses are distinguished by using distinct B-VIDs (546) as path identifiers. Referring to the previous example in FIG. 1 of two paths between ToR switch 311 and ToR 362 packets forwarded from 311 encapsulated with a header that has a destination B-MAC address (542) of ToR switch 362 can be forwarded on either path depending on whether they have a first B-VID or a second B-VID. In general, in a folded Clos type of organization, every potential path to a given edge could be configured by installing the appropriate B-VID/B-MAC DA forwarding entries in the switches that the paths traverse.

[0052] Packet switched paths or tunnels defined by PBB-TE B-VID/B-MAC DA forwarding entries in bridges are known as Ethernet Switched Paths (ESPs). FIG. 4 depicts 3 ESPs, 480, 482 and 484, installed across the radix 6 fat tree network of FIG. 1 between BEBs (ToR switches) 311 and 363. For the radix 6 fat tree of FIG. 4 there are 9 potential paths for ESPs. The paths of the 3 ESPs shown, 480, 482 and 484, happen to not share any leaf or spine level BCBs in common, which is an advantageous arrangement with respect to minimizing the number of switched paths impacted by a link or switch failure, but is not a necessary choice for load sharing of Ethernet packets.

[0053] FIG 4. also depicts entities of the BEBs 311 and 363 that encapsulate and forward Ethernet frames over the Switched Path Aggregation Group (SPAG) composed of switched paths 480, 482 and 484. B-MAC clients 622 and 624 are responsible for encapsulating customer Ethernet frames, 500 in FIG. 3, with a Backbone Encapsulation Header 540. B-MAC clients may have additional functions, but minimally they have to determine a destination B-MAC address for the B-MAC DA field 542 based on the C- MAC DA field 502 of the customer frame. The value assigned to the B-MAC SA field 544 is a MAC address of the BEB. If, as may commonly be the case, the B-MAC client is a Virtual Switch Instance (VSI) serving a particular community of interest identified by a specific l-SID, then the B-MAC client will set the l-SID field 548 to the value of the specific l-SID.

[0054] The representations of BEBs 311 and 363 in FIG. 4 follow the conventions of IEEE 802.1 specifications to show that an encapsulated Ethernet frame is passed from the B-MAC clients 622 and 624 to the Aggregator entities 631 and 632 respectively. Within each Aggregator entity the B-VID Selector (641 and 642) determines the value to be inserted into the B-VID field 546 of the encapsulation header. Since an ESP is defined by the B-VID/B-MAC DA pair and the B-MAC DA has already been determined, selecting a B-VID from the set of those ESP's installed between the Aggregator's BEB and the destination BEB suffices to select the one ESP, from the set, on which the encapsulated Ethernet frame will be transported. The set of those ESP's installed between the Aggregator's BEB and the destination BEB is herein designated to be a Switched Path Aggregation group (SPAG).

[0055] The above and following descriptions describe embodiments of switch path aggregation where the paths are ESPs but the invention is not limited to such embodiments. In particular the invention is also applicable to networks comprising MPLS Label Switched Routers (LSR)s and aggregating MPLS Label Switched Paths (LSPs) between Provider Edges.

[0056] In the case of Link Aggregation Groups it is well known that the process of selecting a link from those in the LAG has to be such that when applied to Ethernet frames with identical headers it must result in the same link being selected. This is to avoid any re-ordering of packets within a single application flow. There is a similar requirement for SPAGs: that the same switched path be selected for customer frames with identical headers with the proviso that, when PBB encapsulation is being used, the l-SID value can be considered part of the customer packet header. Note that the combination of l-SID field 548 with the header fields 502, 504, 506 (if present) and 508 of the customer frame is called an l-TAG. One method of selecting the B-VID for an encapsulated Ethernet frame is to hash its l-TAG and use the resulting value, modulo the number of switched paths in the SPAG, as an index into a table of B-VIDs. When a BEB serves a large number of different customer frame sources, this method should result in an almost even spreading of customer frames over the members of the SPAG, while ensuring that frames from a single flow are always forwarded over the same member of the SPAG. If the number of sources is very small however, as might be the case when the BEB is implemented in a Host computer, then extra fields from higher layer protocol headers might need to be incorporated in the hash operation.

[0057] Once the ESP has been selected and the encapsulated packet headers have been fully constructed, the frame is forwarded onto the core network through the specific Provider Network Port (PNP), one of 661 through 666, that is the transmitter of the first hop of the selected ESP. [0058] In a folded Clos organization, each ESP to a specific destination BEB transits distinct physical ports at the originating and destination BEBs. It should be noted that this need not the case for less structured core network organizations. Also if the BEB is an Hswitch, being, as described below, a realization of a switch on a host computer then the diversity of path routes is only exhibited in the core of the network as all paths will pass though the one or two links that the host computer has to one or two ToR switches.

[0059] When an encapsulated Ethernet frame transported over a member ESP 480, 482, or 484 of a SPAG arrives at a destination BEB it is forwarded to the Aggregator of that SPAG (631 , 633), where the collector function (not shown) forwards the frame onto the B-MAC Client (622, 624 respectively).

[0060] It should be noted that the depiction of ToR switches 311 and 363 as Switched Path Aggregation enabled BEBs follows the style of IEEE 802.1 in showing frames being handed off between distinct sublayer functions. As those skilled in the art will appreciate, actual embodiments of the invention could optimize away distinct sublayers, so that the B-VID is selected as part of the process of determining of the B-MAC DA and the PNP to be used.

[0061] Those skilled in the art will also know that the 802.1 AX LACP allows for multiple Link Aggregation groups to exist between pairs of systems, including Link Aggregation groups with a single member link, and thus will appreciate that multiple Switched Path Aggregation groups could be constructed between pairs of Switch Path Aggregating edge systems. Multiple SPAGs between pairs of edge systems may prove

advantageous in giving different types of Ethernet traffic different types of service. For example Storage Area Network (SAN) traffic could be directed over a SPAG where all the intermediate switches, BCBs, are IEEE 802.1 Qbb or Priority-based Flow Control capable, while all other traffic is directed over a SPAG whose where the BCBs on the paths are plain Ethernet switches. Note this last example is not meant suggest in any way that SPAGs are restricted to frames having a single setting of the so called "p-bits" in their B-VIDs.

[0062] FIG. 5 provides a more detailed look at the components that might comprise the Aggregator 631 of FIG. 4. Before Switched Paths can be aggregated into SPAGs an Aggregation Controller 671 must determine the state of the candidate switched paths that could form a SPAG to another BEB. As will be elaborated on below, in some embodiments, each BEB's Aggregator will be informed of the parameters of each packet switched path available for its use. These parameters will usually comprise the tunnel packet header fields used to encapsulate the Ethernet frame to be transported over the packet switched path. In embodiments using Ethernet Switched Paths (ESPs) as tunnels, the parameters would comprise the Destination B-MAC and B-VID, which together with Source B-MAC address of the BEB itself define the encapsulation header.

[0063] A further parameter to be determined is an identifier or index for the local switch port 661 , 662 or 663, that initiates transmission on the particular packet switched path. In IEEE 802.1 Q terminology these network core-facing ports are called Provider Network Ports (PNPs).

[0064] In some embodiments, particularly those supporting more than one SPAG instance between a pair of BEBs, each potential SPAG will have an associated SPAG identifier. The parameters of a packet switched path provided to the BEB would then include the SPAG identifier that the packet switched path is to be aggregated into. A further refinement would be to associate a traffic filter, specifying, in some fashion, a matching criteria for determining which customer frames are forwarded on which SPAG.

[0065] The Aggregation Controller 671 could directly take all the information it receives concerning the potential switched paths and immediately form a SPAG from the full set of them, by enabling the B-VID Selector 641 to use all the candidate B-VIDs for the SPAG's B-MAC DA. But since failure of links and switches are commonplace in data centers, and pre-configuration of paths could possibly have had undetected errors, most embodiments would place the notified switch paths into a candidate list and determine the status of each switched path in the candidate list. Switched paths determined to be operational would then be added to the SPAG, by for example making an operational ESP's B-VID available to the B-VID Selector 641.

[0066] One method for the Aggregation Controller 671 of a BEB to determine whether a candidate packet switched path is actually capable of transporting Ethernet frames to the peer BEB is to encapsulate a series of test control frames and forward them along the switched path. The first step to forwarding a control frame is to pass it to a Control Parser/Multiplexor 651 which multiplexes control frames received from the Aggregation Controller with encapsulated customer frames received from the B-VID selector 641 to form a single stream of frames to be transmitted by the relevant PNP 661 , 662, or 663. The Control Parser/Multiplexor 651 also examines frames received at the PNPs and steers them to the Aggregation Controller 671 if they are control frames, and to the collector function (not shown) if they are encapsulated customer frames. If the

Aggregation Controller subsequently receives a report control frame from the peer BEB indicating that the test frames it originated were received by the peer BEB then it can conclude that the candidate packet switched path is operational and can be added to the SPAG.

[0067] As disclosed below, protocols for sending test and report control frames generally assume that every forward path has a matching reverse path. Thus the Aggregation Controller must be able to match up each switched packet path for which it is an originator with a corresponding switched packet path that it terminates. While it is not an absolute requirement that the reverse path traverses (in reverse order) the same links and switches as the forward path, it is generally advantageous, from the perspective of removing failed packet switched paths from SPAGs, that forward and reverse paths be the same. In the simplest embodiments, both the forward and reverse paths would share the same path identifier, the same B-VID for ESPs where the B-MAC DA of the forward ESP is the B-MAC SA of the reverse path and vice versa. For other embodiments, the Aggregation Controller will have to be informed of the parameters of the reverse path, such as a different B-VID, when it is informed of the parameters of the candidate path.

[0068] There are at least three protocol choices for sending test and report control frames. These three protocols are asynchronous and a single frame type serves as both a test and report frame. Both end points send to the other a sequence of test frames, which report state information derived in part from whether or not the end point has received test frames sent from the other end point. In particular, if an end point has not received any test frames for a period of time, it signals this in the state information field of the test frames that it is transmitting.

[0069] The first of the protocols is the Link Aggregation Control Protocol (LACP) of IEEE 802.1AX to be operable over ESPs. LACP was original protocol designed to determine the status of the links attached to physical ports. LACP functions so that peer Link Aggregators (FIG. 2 430, 432) synchronise between themselves which of potentially multiple links 400 between them are currently part of the LAG, i.e. LACP ensures that the ports 440, 442, 444 over which the Distributor at one end 470 spreads traffic are matched to the ports from which the Collector at the other end 482 receives traffic. An ESP extended version of LACP would work with logical ports, defined by B-VI D and B- MAC DA pair of the ESP they originate.

[0070] While LACP has a discovery phase in which it determines which links terminate on which neighbour, in embodiments where the ESP destination B-MAC address uniquely identifies the peer aggregation switch, the Aggregation Controller 671 need not perform a discovery phase. Rather the Aggregation Controller 671 can assume either that all the ESPs sharing the same B-MAC DA can be aggregated into one SPAG or that it will have been informed of which ESPs belong to which SPAG, and can move straight to the phase of determining if a candidate ESP is operable to carry encapsulated traffic. Other embodiments may execute a full LACP style frame exchange over each ESP pair so that an Aggregation Controller and its peer can reach agreement on the member ESPs (forward and reverse paths) that will form the Switched Path Aggregation group between them.

[0071] The second protocol is the I EEE 802. l ag Continuity Check Protocol (CCP). Alternative embodiments could use CCP to determine the continuity of an Ethernet Switched Path, i.e. if it is operable to carry traffic. This is similar to the use of CCP to detect failures described in US Patent 7,996,559 "Automatic M EP provisioning in a link state controlled Ethernet network" by Mohan et al, hereby incorporated by reference. This in-band or data plane protocol is implemented in many switches. Note that the standard versions of LACP and CCP utilize standard multicast MAC addresses, which would lead to having to install extra forwarding table entries in each intermediate switch on a path. But Mohan describes substituting the end Unicast address (in the present application the destination B-MAC) for the multicast address in CCP, and a similar substitution could be affected for LACP.

[0072] The third protocol is Bidirectional Forwarding Detection as described in I ETF RFC 5880, "Bidirectional Forwarding Detection (BFD)", by D. Katz and D. Ward hereby incorporated by reference. Embodiments could use a version of BFD to detect failures in the forward or reverse paths of members of SPAGs. It should be noted that I ETF RFC 5884 "Bidirectional Forwarding Detection (BFD) for MPLS Label Switched Paths (LSPs)" by Aggarwal et al. defines a version of BFD for MPLS LSPs so BFD would be a natural choice for embodiments of the present invention where, the packet switched paths aggregated into SPAGs are LSPs. Even though it would require formalization of some of the protocol details, a version of BFD for Ethernet Switched Paths might also be a preferred embodiment, both because of its inherent robustness and because of the flexibility it affords for the end points to mutually set the frequency of the periodic BFD control frames. So, for example, when there is no traffic between a pair BEBs the frequency of sending BFD control frames on each ESP of the one or more SPAGs established between them could be reduced to one every 300ms while, when a SPAG is carrying substantive traffic, the periodicity could be reduced to 0.3ms.

[0073] Advantageously, continuing to run one of these protocols over each ESP after the ESP has been added to a SPAG can be used to maintain the SPAG with the automatic removal from the SPAG of failed ESPs and the automatic addition to the SPAG of newly restored or newly configured ESPs. Failures of links or nodes do not trigger an immediate distributed re-calculation of, and re-convergence to, a new set of paths, as would be the case when paths are installed responsive to spanning tree or routing protocol operation.

[0074] A further refinement of the present invention would be to augment the information carried in the test and report control frames so as to allow the Aggregator control function at each end to determine how congested each ESP in a SPAG is, and consequently to bias the assignment of flows to member ESPs towards assigning fewer flows to congested ESPs. Detecting congestion could be as simple as detecting lost control frames and otherwise comparing the jitter in arrival times with some base line, or it could involve adding timestamps and/or other fields to the control frames to allow the a finer deduction of the congestion state experienced by the control frame. It should be noted that an advantage of this refinement is that it does not require of intermediate switches that they perform special operations on the control frames, in keeping a goal of at least some embodiments of this invention: reducing the complexity required in core switches.

[0075] It is well known in the art that traffic flows within a data center can be categorized as "elephants" (large and long lasting) or "mice" (more numerous, but small and short lasting), and it may be policy to assign the different categories of traffic flow to different priorities. It might be found to be advantageous to send at least some of the test and report control frames at the same priority as the "elephant" flows, as congestion due to colliding "elephant" flows can be long lasting. Although care will be needed in management reporting to distinguish loss of a sequence of frames due to congestion, from loss due to a failure on an ESP's path, there are minimal consequences in mistakenly removing an ESP totally from a SPAG, rather than just reducing the number of flows assigned to it.

[0076] In a further aspect of the present invention, the packet switched paths that will be aggregated into SPAGs are set up by a management entity or controller. In a straightforward implementation one such controller might control all the participating switches in a data center. There are various methods known in the art to provide robustness using one or more standby controllers and implementations of the present technique are not limited to having a single management entity compute and install the required packet switched paths. Any of the schemes known in the art for master and standby or distributed control could be used.

[0077] Alternatively the overall controller function could be distributed amongst a plurality of controllers, with switches allocated in some fashion to be controlled by respective controllers. For simplicity the following description of controller functionality is related to a single operational controller, which since, as will be presently described, can be used to set up an Ethernet Underlay Network or Fabric, is herein called an Underlay Fabric Controller (UFC). However, as noted above, the UFC is not limited to such an embodiment and could be a distributed controller. The UFC could be implemented on a suitably programmed general purpose computing element or on a more specialized platform.

[0078] The UFC needs to determine the topology of the constituent BEB and BCB switches, to calculate a set of Ethernet Switched Paths (ESPs) that are to be the SPAG members between BEBs, to install the B-VID, B-DA forwarding table entries in the relevant switches to realize the ESPs, and to install in BEBs the information necessary for them to aggregate some or all of the ESPs terminating on another BEB into a Switched Path Aggregation Group. Below is described how each of these steps might be accomplished. Once SPAGs have been established, the UFC needs to monitor the core network, that is the underlay fabric, for changes in topology, brought about by link or node failures taking equipment out of service and by maintenance operations reinstating failed equipment and/or adding new equipment. Further the UFC may monitor the utilization of SPAGs, removing one or more member packet switched paths from underutilized SPAGs and adding extra packet switched paths to SPAGs suffering from congestion.

[0079] As shown in FIG. 6, the first stage 602 is for the UFC to acquire the topology of the core network. While it is possible that the UFC is configured with complete topology information, most embodiments will involve at least some determination by constituent switches as to which other switches they themselves are connected to. The discovery phase of Link State Protocols such as IS-IS and OSPF provide one way of doing this. While IS-IS, in particular, could be run when there is no IP layer in place, there is a specific Ethernet layer protocol for switches to discover characteristics of their neighbours. Link Level Discovery Protocol (LLDP), as now defined in IEEE 802.1AB- 2009, is a widely implemented way for Ethernet Switches to advertise their identity to the devices at the other end of each of their links. Switches regularly transmit an LLDP advertisement from each of their operational ports. While LLDP advertisements can carry extra information, such as meaningful text names for originating switches, the two essential items of information conveyed are the originating system identifier (called a chassis ID) and an originating port identifier. When a switch receives LLDP information from any of its neighbours it uses it to populate, or update, a management information base (MIB) called the remote MIB.

[0080] There are various methods in the art by which the UFC can be informed of the topology information in the remote MIBs of the all the switches in a particular system: perhaps the best known is the use of SNMP (Simple Network Management Protocol). As is known in the art, to be amenable to SNMP management requires that each switch have an IP address. One approach for communicating between switches and their management system or controller is to have a dedicated management port on each switch, and to establish and operate a separate management network alongside the main network that implements SPAGs. Alternatively, when the switches are all Ethernet switches, a spanning tree protocol VLAN, either the default VLAN (i.e. no VLAN-ID in the packet header) or one with a VLAN-ID separate from the range B-VIDs used to support ESPs, could be dedicated to the management network. For either alternative, the out of band (OOB) management network will usually be a routed network, complete with one or more Dynamic Host Configuration Protocol (DHCP) servers to assign IP addresses to newly powered up switches. In one embodiment the management network might be a single subnet, with Ethernet spanning tree bridging for forwarding packets. In other embodiments a data center might be organized in a set of subnets, say one for each pod, and one covering the Spine switches, with IP routing between subnets.

[0081] It is understood that the invention is not limited in how the management network is constructed, nor indeed whether topology information is conveyed to the UFC by way of an IP based protocol. In a Software Defined Networking (SDN) architectural embodiment, an operator might pre-configure for each switch a default ESP between it and an SDN controller. In embodiments minimizing pre-configuration operations perhaps just those switches directly connected to the SDN controller would be pre-configured, and they in turn would advertise in their LLDP advertisements to their neighbours the B- VI D, Destination B-MAC to be used to reach an SDN controller.

[0082] When it receives new or updated topology information the UFC performs an evaluation of which ESPs should be installed or removed across the core network (step 604). A key decision, which might be a network operator settable parameter, is how many members a SPAG should have so as to provide good load spreading, and uniform loading of switches, while not running into forwarding information base memory constraints. As noted above, the number of potential paths that could be installed is very large for higher radix switches: each ESP requires at least 60bits of forwarding table space at each switch in its path. Depending on the type of workload a data center has, a choice of a number such as 8 ESPs per SPAG might give nearly optimal load spreading without incurring unsupportable forwarding table costs.

[0083] For a uniform network organization, such as a fully populated fat tree, paths may be evaluated with the objective of uniformly distributing paths across all the switches, where paths between a pair of BEBs are maximally disjoint and the total number of paths passing though each Spine switch is the same (or as near as the same as the mathematics allows). In other embodiments, the basic uniform spreading of paths may be modified to accommodate differing link capacities, as might occur when some switches in a data center have been upgraded to a new model, while others remain with lower capacity links. Other embodiments might modify the assignment of paths to BEB pairs based on a traffic matrix reflecting either provisioned or measured traffic patterns. For example if two pods in a data center are dedicated to Internet interfacing, relatively more traffic can be expected between them and pods serving hosts that are web servers than between the Internet interfacing pods and say database backend pods. Yet other embodiments may not assign any paths and SPAGs between particular pairs of BEBs when it is known from administrative or operational means that there is no customer traffic directly between the pair.

[0084] It should be noted that the UFC may use the reported topology information in additional ways: it may have a component that checks reported configurations against an ideal for the chosen architecture and generates trouble tickets or work orders for the correction of any mis-wiring or mis-configuration of the constituent switches.

[0085] Once the routes of the packet switched paths that are going to be members of SPAGs have been determined then each participating switch may be directed by the management system to install the packet switched paths (step 606). Installing ESPs in the BCBs they transit comprises installing in the BCBs' forwarding information base the destination B-MAC and B-VID values for the ESPs. Those skilled in the art will appreciate that, in the terminology of Software Defined Networking (SDN), the management system would be called an SDN Controller, and the instructions to the switches would be conveyed over a Control Plane interface, perhaps using a version of the SDN related OpenFlow protocol. However other management protocols could be used. Neither are implementations of the present technique limited to having direct communication between the entity that computes the paths and each switch that has to install the forwarding entries that realize the paths. As is well known in the art, control plane protocols, such as Generalized Multi-Protocol Label Switching (GMPLS) and Resource Reservation Protocol - Traffic Engineering (RSVP-TE), allow for the hop-by- hop signalling of a switched path.

[0086] It will be appreciated that, as well as being instructed as to forwarding table entries (and potentially reverse path forwarding checks), edge switches (BEBs), being the end points of SPAGs, may need to be notified as to which packet switched paths belong to which SPAG. Alternatively this information might be deduced form the destination B-MAC of each ESP, or might be determined by protocol exchanges over each packet switched path in a similar fashion to the exchanges defined in IEEE

802.1 AX that determine which links can be aggregated into a particular LAG.

[0087] Implementations of the present technique are not limited to having a single management or control entity compute and install the required packet switch paths. Any of the schemes known in the art for master and standby control, or distributed control, or hierarchical control could be used.

[0088] For ongoing system operation, the UFC may need to detect changes in the topology (step 610) and re-evaluate the installed packet switched paths (step 604 again). The re-evaluation may be responsive to detecting load changes or congestion as well as detecting topology changes resulting from the failures of links or switches and the addition of new, or returned to service, links and switches. The routing of installed packet switched paths may be incrementally adjusted. Installed packet switched paths may be first removed from SPAGs or SPAG candidate lists and then uninstalled. New packet switched paths may be first installed and then added to SPAG candidate lists.

[0089] In some embodiments the UFC, in addition to receiving topology reports on a regular basis, might receive notification from BEBs when an ESP is removed from a SPAG because it has ceased to operable. However it will be realized that, provided that there are normally a sufficient number of ESPs in the SPAG so that none of the ESPs becomes congested when the offered load is spread over the remaining operational ESPs, it probably suffices for the UFC to compute and configure replacement ESPs at the slower cadence of topology report reception. That said, there may be a requirement for a low frequency auditing process to detect any corruption of forwarding tables and the like that results in a packet switched path appearing to be operational to the UFC, when in fact the BEB's have determined that it is not operational.

[0090] It should also be noted that the management system must track the allocation of B-VIDs to ensure that, for a given destination B-MAC, each path is properly defined.

[0091] Of particular interest in data centers is the use of Ethernet Virtual Private Networks (EVPNs) as overlay networks. Many data centers of the size that would benefit from the present invention have multiple tenants, each using a fraction of the data center's resources. While a large data center may support tens, even hundreds, of thousands of low usage web server instances, each instance being a web server of a different tenant, a common deployment model is for a data center to assign each tenant a respective number of computing hosts, either physical machines or virtual machines (VMs), with the tenant organizing what computations run on their assigned computing hosts and how its computing hosts communicate with each other. The owners of data centers may themselves be running a number of very large scale applications. Some of these may be tuned for responsiveness for responding to Internet search queries, while others may be tuned for background computing, such as web crawling.

[0092] As the communication patterns within different types of applications are likely to be very different, the owner of a data center would not want to constrain what protocols their applications or their tenants' applications use. This is doubly so with the advent of Network Function Virtualization and Infrastructure as a Service. Further, the resource demand of applications is going to vary with time, and the operator will want to move resources such as host computers and VMs between applications and tenants. This is where EVPNs prove advantageous. An EVPN provides an application with the lowest common denominator for communication between its constituent parts, namely a Layer 2 local area network. The control software of a data center does not need to know anything about how a tenant or an application uses IP: whether it is IPv4 or IPv6, whether IP address spaces overlap, or indeed whether Layer 3 is used at all. Using EVPNs also facilitates moving VMs or other constituent parts of an application from one host to another host anywhere in the data center, since the VM or application constituent will still be reachable without needing to change any IP addresses.

[0093] In a beneficial deployment of Switched Path Aggregation Groups, SPAGs can be used as "links" in an Underlay Fabric. An Underlay Fabric is used to switch or transport packets of one or more overlay networks across the core switches of a network. For the reasons given above, Ethernet Virtual Private Networks (EVPNs) are the overlay networks of most interest for data centers. Narten et al. (in Problem

Statement: Overlays for Network Virtualization, Internet Draft draft-ietf-nvo3-overlay- problem-statement-04, herein incorporated by reference), in relation to the IETF NV03 Working Group, describes the general way that overlay networks are realized as "Map and Encap". When a packet of a specific overlay network instance arrives at a first-hop overlay device, i.e. a underlay fabric edge device, the device first performs a mapping function that maps the destination address (either L2 or L3) of the received packet into the corresponding destination address of the egress underlay fabric edge device, where the encapsulated packet should be sent to reach its intended destination. Once the mapping has been determined, the underlay fabric edge device encapsulates the received packet with an overlay header and forwards the encapsulated packet over the underlay fabric to the determined egress. As well as containing underlay fabric addresses, an overlay header provides a place to carry either a virtual network identifier or an identifier that is locally significant to the egress edge device. In either case, the identifier in the overlay header specifies at the egress to which specific overlay network instance the packet belongs when it is de-encapsulated.

[0094] Because, as Narten et al. states, the NV03 Working Group is focused on the construction of overlay networks that operate over an IP (L3) underlay transport, the (corresponding) address produced by the mapping function is an IP address and the core switches of the underlay fabric must be IP routers. While using IP routers that support Equal Cost Multi-Path (ECMP) does allow for load spreading across the underlay fabric core, a goal of at least some embodiments of the current invention is to reduce the cost of switching in data centers by eliminating the requirement that the switches have high performance routing capabilities. At least some embodiments of the current invention enable pure L2 (bridging) underlay fabrics and deployment of SPAGs gives traffic-engineered load balancing superior to ECMP.

[0095] The utility of SPAGs as "links" in an underlay fabric is most manifest in data centers having folded Clos or similar organizations. These designs result in a myriad of distinct paths for packet flows between any pair of ToR switches. As noted above in a fat tree design using Ethernet switches with as few as 24 ports there are as many as 144 paths between any pair of ToR switches not in the same pod. If a modest number, say 8, Ethernet Switched Paths were installed on distinct paths between every pair of ToR switches and if these ESPs were subsequently aggregated into one or more SPAGs between each pair of ToR switches, a very resilient full mesh for transporting any overlay network packets across the data center core would result.

[0096] FIG. 7 depicts a small example of an underlay fabric for realizing EVPNs.

Underlay Fabric Edge Switches 311 , 312, 313, 361 ,362 and 363, as might be the ToR switches of the first and last pods, 10 & 60 of FIG 3, are full mesh connected by "links". The 'links" are in fact a plurality of distinct packet switched paths, each transiting a series of core switches (not shown), and aggregated into a SPAG. In FIG. 7, the Aggregator 631 of Edge Switch 311 is shown as forming a SPAG 702 with Aggregator 633 of the Edge Switch 363. SPAG 702 is the aggregation of a plurality of packet switched paths, as maybe ESPs 480, 482 and 484 of FIG. 4, installed between the Edge Switch pair 311 and 363 by an Underlay Fabric controller (UFC).

[0097] Given the diversity of the paths that each installed ESP transits, the "link" between a pair of Edge Switches can, for all intents and purposes, be considered always available. Physical links, ports or even entire leaf or spine switches of a given packet switched path may fail and take an ESP out of service, but the capacity of the "link" would drop by only a small percentage (depending obviously on how many ESPs are configured between each pair of edge switches).

[0098] As noted above, in some embodiments there may be more than one traffic engineered SPAG between any pair of Edge Switches in an underlay fabric and different classes of packet traffic may be assigned to different SPAGs. Some embodiments may differentiate between SPAGs to the same egress switch based on overlay network type.

[0099] The logical entity that performs the underlay fabric edge "map and encap" function for an Ethernet Virtual Private Network (EVPN) overlay is known in the art as a virtual switch instance (VSI). Each EVPN instance supported by underlay fabric will have a VSI representation at each underlay fabric edge that has a connection to a client entity of the EVPN instance. FIG. 7 depicts VSIs 711 , 713, 761 and 763 of an EVPN "a" that has client connections (not shown) at 4 of the Underlay Fabric Edge switches 311 , 313, 361 and 363. Other EVPN instances may well have VSIs at a smaller or greater number of Underlay Fabric Edges than the depicted EVPN instance "a", depending on the number and distribution of client connections. Generally the overhead of a VSI is very small and Underlay Fabric Edges can support a large number of VSIs.

[0100] As is well known in the art, a fully meshed core network, can be used to interconnect the root bridges of multiple non-overlapping spanning trees resulting in a bigger Ethernet network (as described, for example, in Interconnections : Bridges and Routers by Radia Perlman, ISBN 0-201-56332-0, hereby incorporated by reference), provided that the forwarding behaviour of the root bridge is modified from normal learning bridge forwarding to be split horizon forwarding. (With split horizon forwarding, frames arriving on a core full-mesh port are not forwarded on any other core full-mesh port). It will be realized though, that in data centers, VSIs at underlay fabric edges may not need to have the full functionality of a root bridge of a spanning tree. The VSIs at underlay fabric edges may only need a split horizon forwarding capability because in data centers the connections to the EVPN clients will be connections to end systems (hosts, VMs etc), rather than connections to learning bridges. Thus, a VSI can be realized as little more than a grouping of forwarding table entries.

[0101] When a VSI 711 of an ingress underlay fabric edge 311 receives an overlay network Ethernet frame in the form of a customer frame (FIG. 3, 500), on a client connection, it first performs the "map" step. This involves determining the egress underlay fabric edge 363 (one of 313, 361 and 363 in FIG. 7) where a peer VSI 763 will forward the customer frame towards its final destination as designated by the customer MAC destination address of the received customer frame (502). Then the "encap" step is performed for the ingress VSI 711 , adding a backbone encapsulation header (FIG. 3, 540) to the customer frame with the destination B-MAC address (542) set to be a MAC address of the egress underlay fabric edge. The source B-MAC address field (544) is set to a MAC address of the ingress underlay fabric edge (311), while the l-component Service Instance Identifier (l-SID) field 548 is set to a community of interest identifier that uniquely identifies the overlay network instance, e.g. EVPN instance, within the underlay fabric and is associated with each of the constituent VSIs (311 , 313, 361 and 363 in FIG. 7) of the specific overlay network instance.

[0102] The B-VID field 546 of the encapsulation header is set by the Aggregator 631 for SPAG 702 responsive to determining that the destination B-MAC address of the SPAG 702 is the address of the egress underlay fabric edge, i.e. matches the destination B-MAC address 542 of the encapsulation header, and then selecting the B- VID corresponding to one of the member ESPs of the SPAG, as described above.

[0103] When the encapsulated frame is received at the egress underlay fabric edge 363 after having been forwarded across the underlay fabric on the ESP identified by the selected B-VID and the mapped destination B-MAC address, the Aggregator 633 uses the l-SID value 548 to select the associated VSI 763 that in turn determines on which client connection the de-encapsulated frame should be further forwarded. The selected VSI must be the one associated with the overlay network instance, e.g. an EVPN instance, [0104] Customer frames arrive at, and depart from, underlay fabric edge switches on what has been designated herein as "client connections". Client connections are also known in the art as access circuits (ACs). The customer frames from distinct

communities of interest must be kept separated. When a subtending host computer is dedicated solely to the computations of one community of interest linked by a single overlay network instance (i.e. the host is dedicated to a single sandbox - see below) then the client connection may be embodied as a physical link terminating on a physical port of the underlay fabric edge. The physical port becomes dedicated to the VSI associated with the particular community of interest. In cases where a subtending host is hosting computations for multiple communities of interest, client connections must be logical links multiplexed onto a physical link and terminating at the underlay fabric edge as VSI logical ports. Customer frames are associated with VSI logical ports based on some form of multiplexing tag carried in the frame header. One such multiplexing tag would be the S-VID (506 of FIG. 3) of the IEEE 802.1 ad or Provider Bridges Ethernet frame format. Another such multiplexing tag would be the E-TAG of the IEEE 802.1 BR standard when the host computer's Network Interface Card (NIC) assumes the role of an IEEE 802.1 BR Bridge Port extender. Most current ToR switches already embrace IEEE 802.1 BR technology. Note that in situations where the host is more than one hop away from the Underlay Fabric Edge Switch (e.g. when the underlay fabric edge functionality is incorporated in the aggregation or second level switches), it might be preferable to use the IEEE 802.1 ad S-VID approach as that approach can exploit spanning tree to provide alternate paths should an intermediate switch fail. With both of the above tags, there is the issue of configuring a mapping between communities of interest and the tags. Yet a third kind of tag would be the IEEE 802.1 ah l-SID, which, since VSI's are identified by I- SIDs, would ameliorate the mapping problem. However no current standard allows the use of the l-SID field without also including the frame B-DA, B-SA and B-VID fields (which are added by Trunk Aggregators in at least some embodiments of this invention).

[0105] As mentioned above, systems such as Cisco Corporation's VXLAN and those based on the lETF's NV03 working groups' Internet Drafts mentioned above, use Layer 3, that is IP, networks as the underlay fabric for EVPNs. EVPN overlays using Layer 2, that is Ethernet, underlay fabrics are also known in the art, but because each suffers from deficiencies (which deficiencies at least some embodiments of the present invention overcome), hitherto all underlay fabrics for data center EVPNs have been Layer 3.

[0106] One putative Layer 2 data center underlay fabric is Provider Backbone Bridges (PBB). Although Virtual Switch Instances (VSIs) are not explicitly taught in the IEEE 802.1 ah Provider Backbone Bridges (PBB) standard (now incorporated into IEEE 802.1 Q-201 1) the rationale for PBB was to allow the transport of l-Tagged customer Ethernet frames over a Provider Backbone Bridged network and the 802.1 Q I- components of Backbone Edge Bridges can be considered a generalization of Virtual Switch Instances performing the l-Tagging and destination BEB determination. However with PBB as the underlay network there can be no load spreading as there is but one respective path between each pair of BEBs, the respective path being determined by the backbone Spanning Tree protocol. Worse, unless clever things are done configuring multiple VLANs, all paths will transit through the same spine switch.

[0107] Another putative Layer 2 data center underlay fabric is PBB-TE. US patent 8, 194,668 to Friskney et al., hereby incorporated by reference, describes PBT (which subsequently was standardised as IEEE802.1ay Provider Backbone Bridges - Traffic Engineering (PBB-TE), now incorporated into the IEEE 802.1 Q-2011 standard). US Patent 8,194,668 describes an EVPN service (called Virtual Private LAN Service or VPLS in 8, 194,668) over ESPs (engineered connections) between BEB's (carrier edge nodes, PE-edge) using VSIs (virtual customer-address-space Ethernet switch instance). However IEEE802.1ay and US Patent 8, 194,688 both lack the establishment of a plurality of ESPs between a pair of BEBs for the purpose of aggregating them into SPAGs for load balancing customer traffic between the pair of BEBs. Lacking the formation of SPAGs means that neither teaches the use of SPAGs in providing a highly resilient full mesh underlay fabric. The aggregation of a plurality of ESPs between BEBs into SPAGs in order to both load balance customer traffic and provide a highly resilient full mesh underlay fabric is a novel aspect of the disclosed embodiments.

[0108] Yet another putative Layer 2 data center underlay fabric is SPBM. US patent 7,688,756 to Allan et al. in describing Provider Link State Bridging (PLSB) subsequently standardized as Shortest Path Bridging (SPBM) in IEEE 802.1aq, specifies that VSIs might be identified by B-MAC addresses and so be clients of a SPBM underlay network. Allan mentions, and US patent 7,91 1 ,944 to Chiabaut et al. confirms that in a SPBM underlay network there could be forwarding state installed for multiple shortest paths between a pair of BEBs. (US Patents 7,688,756 and 7,911 ,944 are hereby incorporated by reference.) However, while the forwarding state installed in BCBs for SPBM shortest paths and the forwarding state installed for ESPs both associate a B-VID and destination B-MAC with an outgoing port, the methods by which such forwarding state is installed are completely different. In at least some embodiments of the current invention, and as described above, bridges install ESP forwarding entries responsive to receiving, from a separate management or control system, instructions explicitly containing the entries. The essence of SPBM is that bridges themselves calculate their own forwarding entries responsive to receiving link state advertisements flooded from other bridges, and then determining the shortest paths between all pairs of bridges. If there are multiple shortest paths found then each bridge uses the same standardised set of tie breaking algorithms to select a pre-administered number (between 1 and 16) of shortest paths for which to install forwarding entries, after having assigning a pre-administered distinct B-VID to each of the selected shortest paths. However those familiar with the tie breaking of SPBM know that it is not appropriate for load spreading in networks, such as those in data centers, that have very large numbers of shortest paths. A method that uses a set of tie breaking algorithms to determine the paths of members of all the SPAGs is brittle, inflexible and incompatible with using the lowest possible cost switches in a data center.

[0109] The brittleness of SPBM with tie breaking arises from the tie breaking algorithms relying on the system identifiers assigned to the participating bridges. It is not clear that for any given network organization, such as a 3 level fat tree, that there even exists an assignment of system identifiers to bridges that would spread out all of the selected paths uniformly across all spine switches rather than concentrating all paths through 16 of the spine switches, for example. But even if such an assignment were found, it would not hold up in the face of link failures, or the addition or removal of switches.

[0110] SPBM is inflexible as it only finds shortest paths and does not permit traffic engineering. If, for example, a 3 level Clos organization were augmented with "cut- through" links between selected leaf switches then SPBM would consider only paths using the cut-through links when determining shortest paths between members of the pods they belong too. [0111] SPBM requires that all the bridges in the system participate in a link state protocol, which requires that bridges have a certain degree of memory and processing power. The standardized set of tie breaking algorithms are only defined to work over a single area, but incorporating thousands or even tens of thousands of switches into a single area of flooded Link State Advertisements would require bridges have memory and processing power way beyond what is practicable, let alone economic.

[0112] It is to be appreciated that the overlay networks support by the traffic engineered load balancing Layer 2 underlay fabric disclosed herein are in no way limited to being EVPN instances. The B-MAC clients (FIG. 4, 622 & 624) of Aggregators 631 and 632 may be any type of MAC client: bridge relay functions, Virtual Switch Instances (VSIs), Routers, Virtual Routing Forwarders, logical ports terminating access circuits and, as will be discussed below, Virtual NICs of VMs. Thus, in addition to EVPN instances, the overlay networks that can be realized using the invention could include IP subnets, IP VPNs, Storage Area Networks (SANs) and so on, and different types of overlay network may be supported simultaneously at underlay fabric edges.

[0113] FIG. 8 depicts an example host computer organization for a computer hosting computations for multiple communities of interest. Such a host computer might have multiple logical links over a single Ethernet physical link connected to a ToR BEB supporting per community of interest VSIs and EVPN instances. In FIG. 8 the realization of a computation belonging to a particular community of interest is by way of Virtual Machines (VMs) 811 ,812, 819.

[0114] As is well known in the art, VMs are complete instances of a total software image that is executable on a computer, wherein multiple independent VMs are supported by a host computer having a base software layer 840 variously called a Hypervisor, a Virtual Machine Monitor (VMM) or "Domain 0". VMs hosted on a computer share the CPUs 802 and one or more Network Interface Cards (NICs) 804 of the host. (Those skilled in the art know that the "Card" in Network Interface Card is an

anachronism, and that the term NIC is used for any logic and hardware drivers of a network port coupled to a computer). [0115] In a multi-tenant data center, the VMs of each tenant, while sharing the resources of the data center, do not interact with each other. Indeed, it would be a breach of service level agreements and security if one tenant could interfere with another tenant's applications and services. Conversely the VMs assigned to a single tenant need to be able to communicate with each other. The most general way that such communication can be realized is to provide each VM with its own virtual NIC (vNIC) 821 , 822, 829, with each vNIC having its own (customer) MAC address, as if the VM had exclusive use of a real NIC.

[0116] Customer MAC (C-MAC) addresses assigned to the vNICs of VMs of a single tenant, or community of interest, need to be distinct. However, tagging Ethernet frames leaving a vNIC with an identifier specific to tenant or community of interest permits use of the same C-MACs by different tenants, either accidentally or deliberately, while also allowing for the separation of tenants required in a data center. As discussed above, this identifier might be an l-SID (546 in FIG. 3) in the core of network, while being an S-VID (506) on the client connection between a vNIC and a VSI of a BEB.

[0117] The Ethernet switching functionality implemented within the Hypervisor, herein denoted as the Host switch or Hswitch 850 is of interest for the present invention. In some embodiments the function of the Hswitch is to tag and direct Ethernet frames from the vNICs 851 , 852, 859 over a hardware NIC 804, and to bridge Ethernet frames between local VMs that belong to the same community of interest. In the terms of IEEE 802.1 Q in some embodiments an Hswitch may be an l-BEB (a Backbone Edge Bridge comprising an l-Component), with the ToR switch it is connected to being a B-BEB (a Backbone Edge Bridge comprising a B -Component), and the link between HSwitch and ToR switch may then be an l-TAG Boundary l_AN.

[0118] It will be understood that there may be differences in philosophy of operation in data centers that will give rise to different forms of linking between Hswitches and ToR switches. In some data centers each host has a single NIC attached to a ToR switch, on the basis that the likelihood of a communications failure that renders the host unusable is of the same order of magnitude as other host failures. In other data centers a host may have two NICs 804 dual homed to different ToR switches for reliability or performance reasons. VMs' vNICs may be assigned to one of the two NICs, with all vNICs being re-assigned to the other of the two NICs in the case of failure. Alternatively, different NICs could be assigned to different types of communication, typically storage operations vs other types of communications. Alternatively, Split MLT, as described above, could be deployed between pairs of ToR switches, so that the Hswitch can treat a plurality of NICs as a single (aggregated) link.

[0119] While the embodiments above assume that the clients of an EVPN or overlay network instance are VMs belonging to the same tenant or community of interest, those skilled in the art will know that VMs are not the only realization for the sharing of resources of a data center amongst multiple tenants. Tenants can "rent" complete host computers, VMs and, increasingly, Containers, for carrying out computations.

Containers, being a reincarnation of the user processes of time sharing systems, are a lighter weight realization of VM functionality, where each tenant's programs on a host share a common operating system with all other programs on the host. In addition tenant's programs can access private or virtualized instances of shared services such as Network Functions Virtualization (NFV) firewalls for handling their external Internet traffic, and block storage controllers for accessing the permanent storage that they have also rented. In this specification the term "Domain" is used to denote any collection of tightly coupled resources used for performing a single or multi-threaded computation at a host. This is an old usage of the term "Domain". As befits its heritage, the open source XEN hypervisor calls its virtual machines "domains", in this specification "Domain" is used henceforth to include any instances of a VM or a Container or an instance of a virtualized service that can be dedicated to a single community of interest. The system that dispatches Domains to hosts will be called a Domain Controller.

[0120] It is possible to give a little more precision to the concept of "community of interest" in relation to Domains in a data center. A data center tenant normally needing communications, computing and storage facilities will be assigned one or more Domains that provide these facilities, free from interference from other tenant's computations and, to a first order, unaffected by other tenant's resource requests. The set of Domains of a tenant constitute a community of interest that will, in many embodiments, communicate directly with each other using a single dedicated overlay network instance, either an EVPN (a Layer 2 VPN) or an IP VPN (a Layer 3 VPN). The set of a tenant's Domains together with the overlay network dedicated to their inter-communications constitute a "sandbox" so called because a tenant is free to do what she likes within her sandbox, but is severely constrained in any interactions between domains within the sandbox and any services outside the sandbox.

[0121] The overlay VPN instance dedicated to a sandbox will have an administered VPN identifier that could also be used as an owner (or principal) identifier for the sandbox's Domains. In embodiments where the overlay is an EVPN the VPN identifier would be either directly an l-SID or be mappable to an l-SID. Note that, depending on the nature of the applications or services a tenant want to realize, the applications or services may be separated into multiple sandboxes, each with its own associated VPN identifier.

[0122] It will be realized that there are several possible choices within a Data Centre network for positioning Layer 2 underlay fabric edge functionality. One approach is to enhance ToR switches with SPAG edge functionality. In this embodiment, each ToR switch may comprise a set of Virtual Switch Instances (VSIs) as B-MAC Clients 622, 624 of the Aggregators 631 , 633 that constitute the edges of the full mesh SPAG underlay fabric shown in FIG. 7. Each VSI is associated with a distinct l-SID identifying an EVPN instance dedicated to a sandbox and has logical ports linked to those of the sandbox's Domains that are located on the host computers subtending off of ToR switch.

[0123] An alternative choice for the location of underlay fabric edge functionality is the Hswitch. Such a placement has the advantage that ToR switches would require only the simple functionality of BCBs, that is, the TOR switches would not require any extra capabilities related to PBB encapsulation, SPAG Aggregators or VSI's. Instead these capabilities can be introduced piecemeal into the Host computer Hypervisor and/or Operating System implementations of Hswitches as the data center operator moves towards realizing a SPAG based Layer 2 underlay fabric over more and more of their infrastructure.

[0124] It will be understood that, when the Underlay Fabric edge switch (the BEB), is in fact an Hswitch embedded in a Host computer, then the number of B-MAC addresses that the core switches (the BCBs) potentially have to deal with will be one or two orders of magnitude greater than if the edge switch functionality were exclusively the preserve of ToR switches. Installing all the ESP mappings needed to achieve full mesh multi- member SPAG connectivity with all other Hswitches may become impractical due to capacity limitations in core switch forwarding entry table space.

[0125] Should it become necessary to limit the number of B-MAC,B-VID forwarding table entries needed in BCBs, there are at least two techniques, deployable individually or combined. The first technique is to divide the hosts into virtual pods and assign sandboxes to individual virtual pods. Virtual pods can be of arbitrary size and shrink or grow as needed. Since all layer 2 inter-communication of Domains within sandboxes stays within a respective sandbox, the Underlay Fabric Controller (UFC) needs only to establish a full mesh of SPAGs per virtual pod, i.e. between the Hswitches of the hosts belonging to the virtual pod. Note that while typically a host would only be a participant in one virtual pod, special hosts, such as those that are gateways to shared services, might belong to many virtual pods. To help with the assigning of sandboxes to virtual pods, and the subsequent management of the sandboxes, virtual pods may be dedicated to particular types of sandboxes e.g. sandboxes where all the Domains are Containers.

[0126] The second technique requires that Domain Controllers notify the UFC of the Hswitch's B-MAC address and the sandbox's l-SID when it installs a domain of the sandbox at the Hswitch's host. By performing a centralized calculation similar to the distributed one performed by BCBs according to PLSB, the UFC can determine which pairs of Hswitches will potentially forward frames to each other given the current assignment of domains to hosts. This allows the UFC just to install the minimum number of ESP forwarding entries required to realize SPAGs that are actually of potential use to the underlay fabrics installed by clients in their current locations.

[0127] Domain Controllers may also be responsible for migrating running domains to a new Host, as first described by Casey et al. in the paper Ά Domain Structure for Distributed Computer Systems', published in the Proceeding of the 6th Symposium on Operating System Principles in the ACM Operating Systems Review, Vol 11 No 5, Nov. 1977, hereby incorporated by reference.

[0128] As part of the Domain migration process, either the Domain is migrated to a host for which there is already an operational SPAG between the host and each of the hosts serving the other Domains of its sandbox, or new SPAGs have to be established so as to provide full mesh connectivity. [0129] As an example, assume that the overlay networks are controlled using the Locator/Identifier Separation Protocol (LISP) as described in Internet Draft draft-maino- nvo3-lisp-cp-03, LISP Control Plane for Network Virtualization Overlays, by F Maino et al. 18 Oct 2013, hereby incorporated by reference. The LISP mapping database holds mappings from sandbox specific addresses (MAC addresses for EVPN overlay networks, and IPv4 addresses and/or IPv6 addresses for Layer 3 overlay networks) to destination Hswitch B-MAC addresses. The Domain Controller can consult these mappings for a list of all Hswitch B-MAC addresses currently associated with a specific I- SID, i.e. with a specific sandbox. The Domain Controller might first try to determine if any of the Hswitches' hosts are suitable for receiving the migrating Domain (see Casey et al referenced above). Otherwise the Domain Controller may develop a short list of B-MAC addresses of hosts that, according to its criteria, are suitable for receiving the migrating Domain, and may send the short list to the UFC. The UFC could then choose one host from the short list based on criteria such as minimising the number of new [B-VID, B-

DA=> port] mappings that the UFC has to install. For example, the UFC might choose a host in a pod that already has multiple members of the community of interest in it.

Alternatively, the UFC may try to avoid hosts for which the local links are more congested than others (assuming that the UFC has a background activity of collecting traffic statistics). Once the UFC has made its choice, the UFC would notify the Domain

Controller so that the work of moving the Domain, installing the new [B-VID, B-DA=> port] mappings, and updating the LISP mapping database can proceed concurrently.

[0130] Finally, it should be noted that SPAGs are not limited in scope to single data centers. Rather when there are multiple communication links between a group of data centers, packet switched paths could be constructed that span between data centers and the inter-data center paths could be aggregated into resilient, load spreading inter- data-center SPAGs. The multiple communication links between two data centers may comprise wavelength channels (sometimes called "lambdas") in the same owned fiber, multiple fibers (preferably diversely routed) or a purchased service such as MPLS pseudo-wires or Metro Ethernet E-Lines. The packet switch paths may be homogeneous (e.g. Ethernet bridged both within the data centers and between data centers), or they could be heterogeneous with Ethernet in the data centers and LSPs between the data centers. [0131] The inter-data-center SPAGs could be used in the realization of a single Layer 2 underlay fabric but, given the difference in cross section of bandwidth within a data center compared to bandwidth between data centers, it would likely be advantageous to deploy virtual pods with only a small number of virtual pods having hosts in more than one data center. Advantageously, inter-data center virtual pods, when combined with domain migration mechanisms, provide a method for the orderly transfer of the complete live load of a data center to other data centers (as might be required when, for example, a typhoon is forecast to hit the originating data center). If a virtual pod is created at the receiving data center and merged with a virtual pod at the originating data center, then, after the full mesh of SPAGs is installed for a merged virtual pod, individual domains can be migrated from the originating data center to the receiving data center while still maintaining inter-communication with the other members of their sandboxes. Once all the domains of all the sandboxes assigned to the virtual pod have been migrated to the receiving data center, the hosts at the originating data center can be removed from the virtual pod, with the reclamation of the ESPs of the SPAGs between them and between and them and the newly formed receiver data center virtual pod.

[0132] Although the invention has been described with reference to certain specific embodiments, various modifications thereof will be apparent to those skilled in the art without departing from the spirit and scope of the invention as outlined in the claims appended hereto.