Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
HIGHLY-SCALABLE, SOFTWARE-DEFINED, IN-NETWORK MULTICASTING OF LOAD STATISTICS DATA
Document Type and Number:
WIPO Patent Application WO/2020/247400
Kind Code:
A1
Abstract:
In an embodiment, a computer-implemented method for highly-scalable, in-network multicasting of statistics data is disclosed. In an embodiment, a method comprises: receiving, from an underlay controller, a match-and-action table that is indexed using one or more multicast ("MC") group identifiers and includes one or more special MC headers; detecting a packet carrying statistics data; determining whether the packet includes an MC group identifier; in response to determining that the packet includes the MC group identifier: using the MC group identifier, retrieving a special MC header, of the one or more special MC headers, from the match-and-action table; generating an encapsulated packet by encapsulating the packet with the special MC header; and providing the encapsulated packet to an interface controller for transmitting the encapsulated packet to one or more physical switches.

Inventors:
SHABAZ MUHAMMAD (US)
HIRA MUKESH (US)
SURESH LALITH (US)
Application Number:
PCT/US2020/035768
Publication Date:
December 10, 2020
Filing Date:
June 02, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
VMWARE INC (US)
International Classes:
H04L12/18; H04L45/16
Foreign References:
US9432204B22016-08-30
US20150169345A12015-06-18
US20190036771A12019-01-31
US9432204B22016-08-30
Other References:
MUHAMMAD SHAHBAZ ET AL: "Elmo: Source-Routed Multicast for Cloud Services", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 27 February 2018 (2018-02-27), XP081232197
MUHAMMAD SHAHBAZ: "Elmo: Source-Routed Multicast for Cloud Services", ARXIV:1802.09815VL, 27 February 2018 (2018-02-27)
Attorney, Agent or Firm:
HEYMAN, Leonard E. et al. (US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A computer-implemented method for highly-scalable, in-network multicasting of load statistics data, the method comprising:

receiving, from an underlay controller, a match-and-action table that is indexed using one or more multicast (“MC”) group identifiers and includes one or more special MC headers;

detecting a packet carrying statistics data;

determining whether the packet includes an MC group identifier;

wherein the MC group identifier identifies an MC group that includes one or more recipients of the statistics data;

in response to determining that the packet includes the MC group identifier:

using the MC group identifier, retrieving a special MC header, of the one or more special MC headers, from the match-and-action table;

generating an encapsulated packet by encapsulating the packet with the special MC header; and

providing the encapsulated packet to an interface controller for transmitting the encapsulated packet to one or more physical switches.

2. The computer-implemented method of Claim 1, further comprising: causing a physical switch, of the one or more physical switches, to:

detect the encapsulated packet that includes the special MC header;

determine whether the special MC header includes an identifier of the physical switch; and in response to determining that the special MC header includes an identifier of the physical switch: use a bitmap associated, in the special MC header, with the identifier of the physical switch to determine whether to replicate the encapsulated packet on any output port of the physical switch.

3. The computer-implemented method of Claim 2, further comprising: in response to determining one or more output ports of the physical switch on which to replicate the encapsulated packet, causing the physical switch to replicate the encapsulated packet on the one or more output ports.

4. The computer-implemented method of Claim 2, further comprising: in response to determining that the special MC header does not include an identifier of the physical switch, causing the physical switch to lookup a flow table using a tenant identifier and a destination IP address included in the special MC header to access and retrieve an s-rule, and use the s- rule to determine whether to replicate the encapsulated packet on any output port of the physical switch.

5. The computer-implemented method of Claim 2, wherein the statistics data comprises load statistic data; wherein a bitmap row of the bitmap associated, in the special MC header, with the identifier of the physical switch includes one or more bits; wherein nth bit of the one or more bits corresponds to nth output port of the physical switch; and wherein nth bit of the one or more bits indicates whether the encapsulated packet is to be replicated by the physical switch on nth output port of the physical switch.

6. The computer-implemented method of Claim 1, wherein the match-and-action table comprises one or more packet rules (“p-rules”); wherein a p-rule of the one or more p-rules indicates one or more output ports implemented on a physical switch on which the physical switch is to relay the encapsulated packet; and wherein the one or more p-rules are generated by the underlay controller based on join-group-requests and leave-group-requests.

7. The computer-implemented method of Claim 1, wherein the one or more recipients of the statistics data include one or more load balancers implemented in service virtual machines; and wherein the statistics data indicates CPU utilization of one or more virtual machines.

8. One or more non-transitory computer-readable storage media storing one or more computer instructions which, when executed by one or more processors, cause the one or more processors to perform:

receiving, from an underlay controller, a match-and-action table that is indexed using one or more multicast (“MC”) group identifiers and includes one or more special MC headers;

detecting a packet carrying statistics data;

determining whether the packet includes an MC group identifier;

wherein the MC group identifier identifies an MC group that includes one or more recipients of the statistics data;

in response to determining that the packet includes the MC group identifier:

using the MC group identifier, retrieving a special MC header, of the one or more special MC headers, from the match-and-action table;

generating an encapsulated packet by encapsulating the packet with the special MC header; and

providing the encapsulated packet to an interface controller for transmitting the

encapsulated packet to one or more physical switches.

9. The one or more non-transitory computer-readable storage media of Claim 8, storing additional instructions for: causing a physical switch, of the one or more physical switches, to:

detect the encapsulated packet that includes the special MC header;

determine whether the special MC header includes an identifier of the physical switch; and in response to determining that the special MC header includes an identifier of the physical switch: use a bitmap associated, in the special MC header, with the identifier of the physical switch to determine whether to replicate the encapsulated packet on any output port of the physical switch.

10. The one or more non-transitory computer-readable storage media of Claim 9, storing additional instructions for: in response to determining one or more output ports of the physical switch on which to replicate the encapsulated packet, causing the physical switch to replicate the encapsulated packet on the one or more output ports.

11. The one or more non-transitory computer-readable storage media of Claim 9, storing additional instructions for: in response to determining that the special MC header does not include an identifier of the physical switch, causing the physical switch to lookup a flow table using a tenant identifier and a destination IP address included in the special MC header to access and retrieve an s-rule, and use the s-rule to determine whether to replicate the encapsulated packet on any output port of the physical switch.

12. The one or more non-transitory computer-readable storage media of Claim 9, wherein a bitmap row of the bitmap associated, in the special MC header, with the identifier of the physical switch includes one or more bits; wherein nth bit of the one or more bits corresponds to nth output port of the physical switch; and wherein nth bit of the one or more bits indicates whether the encapsulated packet is to be replicated by the physical switch on nth output port of the physical switch.

13. The one or more non-transitory computer-readable storage media of Claim 8, wherein the statistics data comprises load statistic data; wherein the match-and-action table comprises one or more packet rules (“p-rules”); wherein a p-rule of the one or more p-rules indicates one or more output ports implemented on a physical switch on which the physical switch is to relay the encapsulated packet; and wherein the one or more p-rules are generated by the underlay controller based on join-group-requests and leave-group-requests.

14. The one or more non-transitory computer-readable storage media of Claim 8, wherein the one or more recipients of the statistics data include one or more load balancers implemented in service virtual machines; and wherein the statistics data indicates CPU utilization of one or more virtual machines.

15. A hypervisor implemented in a computer host and configured to implement mechanisms for highly-scalable, in-network multicasting of statistics data, the hypervisor comprising: one or more processors;

one or more memory units; and

one or more non-transitory computer-readable storage media storing one or more computer instructions which, when executed by the one or more processors, cause the one or more processors to perform:

receiving, from an underlay controller, a match-and-action table that is indexed using one or more multicast (“MC”) group identifiers and includes one or more special MC headers;

detecting a packet carrying statistics data;

determining whether the packet includes an MC group identifier;

wherein the MC group identifier identifies an MC group that includes one or more recipients of the statistics data;

in response to determining that the packet includes the MC group identifier:

using the MC group identifier, retrieving a special MC header, of the one or more special MC headers, from the match-and-action table;

generating an encapsulated packet by encapsulating the packet with the special MC header; and

providing the encapsulated packet to an interface controller for transmitting the

encapsulated packet to one or more physical switches.

16. The hypervisor of Claim 15, storing additional instructions for: causing a physical switch, of the one or more physical switches, to:

detect the encapsulated packet that includes the special MC header;

determine whether the special MC header includes an identifier of the physical switch; and in response to determining that the special MC header includes an identifier of the physical switch: use a bitmap associated, in the special MC header, with the identifier of the physical switch to determine whether to replicate the encapsulated packet on any output port of the physical switch.

17. The hypervisor of Claim 16, storing additional instructions for: in response to determining one or more output ports of the physical switch on which to replicate the encapsulated packet, causing the physical switch to replicate the encapsulated packet on the one or more output ports.

18. The hypervisor of Claim 16, storing additional instructions for: in response to determining that the special MC header does not include an identifier of the physical switch, causing the physical switch to lookup a flow table using a tenant identifier and a destination IP address included in the special MC header to access and retrieve an s-rule, and use the s- rule to determine whether to replicate the encapsulated packet on any output port of the physical switch.

19. The hypervisor of Claim 16, wherein the statistics data comprises load statistic data; wherein a bitmap row of the bitmap associated, in the special MC header, with the identifier of the physical switch includes one or more bits; wherein nth bit of the one or more bits corresponds to n1th output port of the physical switch; and wherein nth bit of the one or more bits indicates whether the encapsulated packet is to be replicated by the physical switch on n output port of the physical switch.

20. The hypervisor of Claim 15, wherein the match-and-action table comprises one or more packet rules (“p-rules”); wherein a p-rule of the one or more p-rules indicates one or more output ports implemented on a physical switch on which the physical switch is to relay the encapsulated packet; and wherein the one or more p-rules are generated by the underlay controller based on join-group-requests and leave-group-requests.

Description:
HIGHLY-SCALABLE, SOFTWARE-DEFINED, IN-NETWORK MULTICASTING

OF LOAD STATISTICS DATA

Muhammad Shahbaz, Mukesh Hira, and Lalith Suresh

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to, and the benefit of, pending U.S. Non-

Provisional Patent Application No. 16/432,477, filed on June 05, 2019, and entitled “HIGHLY SCALABLE, SOFTWARE-DEFINED, IN-NETWORKING MULTICASTING OF LOAD STATISTICS DATA”, which is incorporated herein by reference in its entirety

BACKGROUND

[0002] Datacenter applications frequently exhibit one-to-many communications patterns and often require the computer-network-fabric to provide high throughput with a minimal time latency. These requirements usually can be met by conventional IP multicast networks. However, the conventional IP multicast solutions usually suffer from scalability limitations, and thus cannot support IP multicast traffic between, for example, hundreds of thousands of tenants. But, since typical cloud network environments these days need to support hundreds of thousands of tenants, to satisfy that requirement, the IP multicast networks rely on support provided by additional services. Examples of the additional services include overlay multicast solutions, such as overlay IP multicast. However, in a cloud-based datacenter, the overlay IP multicast is not scalable. Other known solutions are usually unicast-based and, therefore, introduce overheads and high latencies that negatively impact the network throughput and CPU utilization.

[0003] Implementations of overlay IP multicast solutions in network switches are usually based on IP multicast routing protocols such as the Protocol-Independent Multicast (“PIM”), and membership reporting protocols such as the Internet Group Management Protocol (“IGMP’). Implementations of those protocols, however, have many drawbacks. Some of the drawbacks include scalability limitations. The switching chips that implement the multicast protocols can usually support a few thousand of multicast groups, but not hundreds of thousands of multicast groups as required by today’s networks. Thus, the switching chips do not scale sufficiently enough to meet the demands of today’s multi-tenant cloud

environments.

[0004] Another drawback of the PIM/IGMP multicast implementations is a control plane chum. The chums are usually caused by significant processing and communications overheads that occur when multicast group members join and leave the multicast groups and the membership data is rapidly updated by the control plane to reflect the migrations of the members. The control plane chum is particularly undesirable when many multicast groups are implemented across a large number of tenants.

[0005] There have been some proposals in the research community to encode multicast distribution trees for multicast groups in communications packets. However, those approaches are based on complex encodings such as bloom filters which, unfortunately, introduce significant time latencies.

SUMMARY

[0006] In an embodiment, the approach is implemented in software-defined networks (“SDNs”). In an SDN, an underlay-network controller may receive MC group join and leave requests from virtual machine tenants of multicast groups via an application programming interface (“API”). A virtual machine tenant may be an individual who is accessing the SDN system to set up the multicasting environment to enable multicast communications in the SDN network. A request to join an MC group may include, for example, the MC group identifier, network addresses of the recipients of the MC group communications, and the like. Networks usually expose such APIs to the tenants and allow the tenants to request virtual machines, load balancers, firewalls, and other services. Different ways of defining a MC group using the underlay-network controller are described in, for example, the paper entitled “Elmo: Source-Routed Multicast for Cloud Services” by Muhammad Shahbaz et. al, dated February 27, 2018 (citation“arXiv:1802.09815vl”) which is incorporated herein by reference in its entirety.

[0007] In an embodiment, an underlay-network controller generates and distributes match-action rules and the associated multicast group identifiers to the virtual switches as requested by the virtual machine tenants. The match-action rules are described later. The underlay-network controller may install, in the virtual switches, the corresponding special MC headers to be later encoded in the MC packets. That information, along with the network topology information, may be used by the virtual switches to recognize MC packets in data traffic, and forward the MC packets to the intended recipients. Typically, the underlay- network controller already maintains the information about the network topology, switches, and ports. The underlay-network controller may collect that information by performing various monitoring and management tasks such as the VM placement, load balancing, and others.

[0008] In an embodiment, multicast join and leave requests are received by a cloud manager. The cloud manager may receive the request via an API. The cloud providers usually expose such an API to tenants to allow the tenants to request VMs, load balancer services, firewall services, and other services. A multicast group definition may include addresses of tenants’ VMs that belong to the MC group. The manager is usually provided with the information about the physical locations of the tenants’ VMs, as well as with the information about the current network topology. The manager may be also provided with the information about the capabilities of the switches and unique identifiers of the switches that may be used to address the switches. Based on the received information, the manager computes the MC trees for the MC groups defined by the tenants, and uses the functionalities of a high-level language, such as the P4 language, to program the switches. The manger may also use the functionalities of the P4 Runtime language to control the interfaces of the switches and to install match-action rules in the switches. Once programmed, the switches are enabled to identify MC packets in packet flows, and direct the MC packets to appropriate service VMs, including, for example, load balancer VMs and the like. Later, if the cloud manager is notified about occurrences of certain network events, such as communications link failures or group membership changes, the cloud manager may compute new match-action rules and send the updated match-action rules to the affected switches.

[0009] In an embodiment, since the MC membership groups are configured by underlay- network controllers and/or cloud managers, and not by overlay network entities, the process of configuring the MC groups and enabling the routing of MC communications to the members of the MC groups is performed more efficiently and faster than using traditional approaches. For example, since the underlay-network controllers and/or cloud managers are provided with the information about the network topology and the information about the MC groups, they may configure the switches and the special MC headers without generating the MC-related overlay traffic or processing the MC-related overlay traffic. Furthermore, configuring the switches to handle the MC-related traffic involves programing the underlay- network controllers and/or cloud managers with relatively compact program code. This is not achievable using conventional approaches. Moreover, as described below, the presented approach allows including packet forwarding bitmaps in MC packets themselves, and thus eliminates the need for the switches to maintain MC group forwarding tables, which traditionally are large and time-consuming to access.

[0010] A virtual machine tenant may use an API to issue requests in which the tenant may specify the groups it wants to join or leave. Upon receiving the requests via the API, an underlay-network controller or a cloud manager may generate the necessary rules and install the rules on the hypervisor switches. Once this is done, the controller/manager may notify the tenant to cause the tenant to start transmitting the MC traffic using its virtual machines.

[0011] In an embodiment, a highly scalable, low-latency, in-network multicast approach for encoding multicast trees of multicast groups in headers of communications packets is disclosed. In the approach, multicast (“MC”) packets are encapsulated with special MC headers that include MC tree information. The payloads of the MC packets are used to distribute MC data to members of the MC groups. In an embodiment, the payloads of the MC packets are used to distribute load statistics data of virtual machines (“VMs”) from, for example, service VMs executing on host computers to load balancers. Based on the received load statistics data, the load balancers determine how to distribute tasks to the service VMs. The load balancers may be implemented in service virtual machines (“SVMs”).

[0012] The in-network MC approach for encoding MC tree information in special MC headers of MC packets enables the cooperation between an underlay controller, a central control plane, virtual switches and physical switches. While conventionally, a central control plane is used to set up MC groups, in the presented approach, an underlay controller is used to set up the MC groups, including setting up the MC groups for communicating load statistics data. A central control plane is a software program, or a set of programs, that may be configured to, among other things, receive, via an overlay network, multicast group join and leave requests. An underlay controller uses an API to receive, via an underlay network, MC group join and leave requests, and based on the received requests, the underlay controller generates a match-and-action table that includes packets rules to generate special MC headers.

[0013] The special MC headers are used by virtual switches implemented in hypervisors to encapsulate the MC packets. A virtual switch is a software program, or set of programs, that allows VMs, containers, or other workloads, to communicate with each other. The virtual switch may be part of a hypervisor, as it is described in detail later. A hypervisor is computer software, firmware or hardware that creates and runs VMs, which may in turn execute a plurality of containers. A computer on which the hypervisor executes the VMs is called a host. The hypervisor provides guest operating systems and virtual operating platforms to the VMs, and manages the execution of the guest operating systems. A physical switch is usually a programmable physical device configured to switch packets in a physical computer network.

[0014] In an embodiment, a virtual switch of a hypervisor encapsulates an MC packet with a special MC header that includes MC distribution tree information of the MC group. That information otherwise would have to be stored in advance in each physical switch in the network and for each MC group. However, in the presented approach, the MC tree information is encapsulated inside the packets themselves and distributed by the packets as the packets are routed, and thus, the physical switches do not need to generate and maintain data structures for storing the MC group information for millions of MC groups.

[0015] MC tree information may be encoded using bitmaps included in special MC headers included in packets, and physical switches use the bitmaps to switch the packets according to the bitmap information. For example, upon receiving an MC packet that includes a special MC header and an identifier of an MC group, a physical switch uses a bitmap included in the special MC header to determine how to switch the packets to communicate the packet to the members of the MC group.

[0016] In an embodiment, upon receiving an MC packet that includes a special MC header, a physical switch parses the special MC header and determines whether an identifier of the switch is included in the MC header. If it is, then the physical switch extracts a corresponding packet rule (“p-rule”) from the special MC header, determines the output ports based on the p-rule, and replicates the MC packet on the ports indicated by the p-rule.

However, if the switch does not find, in an MC header of the MC packet, a p-rule associated with the switch’s identifier, then the switch may rely on so called s-rules.

[0017] An s-rule is defined by an underlay controller and provided to a physical switch by the controller. The s-rule indicates the ports implemented on the switch on which the switch should relay a packet that belongs to an MC group. The s-rules are used to provide support for the presented approach when a bitmap itself would be too large to be included in a special MC header as the size of MC headers is usually limited. This may happen when the multicast tree includes a vast number of physical switches and a bitmap that would encode all p-rules for all switches would be too large to be included in a special MC header. Hence, if the switch does not find its own identifier (or a p-rule) in a special MC header of the MC packet, then the switch may perform a lookup in a flow table using an identifier of a tenant and a destination IP address included in the header of the MC packet to access and retrieve the corresponding s-rule. Based on the s-rule, the switch may determine the switch’s output ports on which to replicate the MC packet.

BRIEF DESCRIPTION OF THE DRAWINGS

[0018] In the drawings:

[0019] FIG. 1 is a block diagram depicting an example physical implementation view of an example logical network environment for a highly-scalable, software-defined, in-network multicasting of load statistics data.

[0020] FIG. 2A is a block diagram depicting an example arrangement of physical switches.

[0021] FIG. 2B is a block diagram depicting an example special MC header.

[0022] FIG. 3 is a time chart depicting an example process for a highly-scalable, software-defined, in-network multicasting of load statistics data.

[0023] FIG. 4 is an example flow chart for a highly-scalable, software-defined, in- network multicasting of load statistics data.

DETAILED DESCRIPTION

[0024] In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the method described herein. It will be apparent, however, that the present approach may be practiced without these specific details. In some instances, well-known structures and devices are shown in a block diagram form to avoid unnecessarily obscuring the present approach.

[0025] 1. EXAMPLE PHYSICAL IMPLEMENTATIONS

[0026] FIG. 1 is a block diagram depicting an example physical implementation view of an example logical network environment for a highly-scalable, software-defined, in-network multicasting of load statistics data. In the depicted example, environment 100 includes a management plane 1.0, a central control plane 20 (also referred to as an SDN central control plane 20), an underlay controller 21, a plurality of hosts 110, 1110, and a physical network 190. In an embodiment, management plane 10 and central control plane 20 may be implemented as distributed or clustered systems and may be combined into a single combined management/central controller server or cluster.

[0027] Host 110/1110 may include a hypervisor 160/1160, hardware components 180/1180, and other components not depicted in FIG. 1. Hypervisor 160/1160 may be implemented as a software layer that supports execution of multiple virtualized computing instances of virtual machines. Hypervisor 160/1160 may use uplinks 170/1170 to provide connectivity to and from physical network interface controllers (“PNICs”) 182/1182, respectively. Hypervisor 160/1160 may include a virtual switch 140/1140 and may provide connectivity to and from one or more virtual machines, including a VM 120/1120. Each VM may itself host multiple endpoints or workloads, such as namespace containers (not shown). Each workload has one or more overlay IP addresses associated with it and is therefore reachable over a logical overlay network.

[0028] Hardware components 180/1180 may include hardware processors, memory units, data storage units, and physical network interfaces, some of which are not depicted in FIG. 1. Hardware components 180/1180 may also include PNICs 182/1182, respectively, which may provide connectivity to routers and switches of physical network 190.

[0029] Virtual switch 140/1140, or a separate software component (not shown) operating in conjunction therewith, may be configured to monitor and manage data traffic that is communicated to and from hypervisor 160/1160, respectively. Implementations of virtual switch 140/1140 may vary and may depend on a type of product in which the switch is deployed as a virtualization medium. For example, virtual switch 140/1140 may be implemented as part of hypervisor 160/1160, as it is depicted in FIG. 1, and as it is in the vSphere ® and KVM ® hypervisors. Alternatively, although not depicted in FIG. 1, virtual switch 140/1140 may be implemented as a hardware component, or with hardware assistance, or as part of a user space, or within a privileged virtual machine, often referred to as a Domain Zero or Root Partition. Examples of such architectures include the Hyper- V ® and Xen ® hypervisors.

[0030] VMs 120 and 1120 may be realized as complete computational environments. VMs 120 and 1120 conceptually contain virtual equivalents of hardware and software components of the physical computing systems. VMs 120 and 1120 may be instantiated as virtualized computing instances. The instances may be equipped with their own resources, may be assigned their own workloads, and may be configured to perform their own tasks assigned to the workloads. Virtual resources associated with VMs 120 and 1120 may include virtual CPUs, virtual memory, virtual disks, virtual network interface controllers and the like. VMs 120 and 1120 may be configured to execute guest operating systems and guest applications.

[003.1] Physical network 190 may include local area networks and/or wide area networks and may utilize various hardware and software configurations. For example, physical network 190 may include one or more routers (not shown), one or more switches 195, 196 and 197, and one or more switch ports 191, 1191. Physical switches 195, 196 and 197 may be programmable physical switches, and thus may be configured to receive and store s-rules, parse special MC headers, and determine, based on the special MC headers and s-rules whether and how to relay the MC packets on ports of the physical switches.

[0032] 1.1. MANAGEMENT PLANE

[0033] Management plane 10 is a software application, or a set of applications, which, when executed, is used to manage and monitor network services configured on entities of the overlay network. Management plane 10 may be configured to, for example, receive conventional configuration requests from MC group members, and process the requests to generate configuration instructions. A request for a VM on a logical overlay network multicast group may require joining the host, and more precisely, the tunnel endpoint of the host on which the VM runs, to join a corresponding underlay network multicast group.

Details of this relationship are described by way of example in U.S. Patent 9,432,204, invented by Jianjun Shen, et al. and granted August 30, 2016, and which is incorporated herein by reference in its entirety. Management plane 10 may also be configured to send the configuration instructions to central control plane 20 to instruct plane 20 to generate, based on the instructions, MC membership groups, and implement the MC groups on host computers 110 and 1110.

[0034] 1.2. CENTRAL CONTROL PLANE

[0035] Central control plane 20 is a software application, or a set of software

applications, which, when executed, is used to receive configuration files and instructions from management plane 10 and use the received data to configure and control entities in the network.

[0036] 1.3. UNDERLAY CONTROLLER

[0037] Underlay controller 21 is a software application, or a set of software applications, which, when executed, is used to create and modify MC groups and provide information to virtual switches to facilitate communications between members of the MC groups. Underlay controller 21 may receive MC group join and leave requests from virtual machine tenants of MC groups via an API. A request to join an MC group may include, for example, the MC group identifier, network addresses of the recipients of the MC group communications, and the like. The APIs may be used to allow the tenants to request virtual machines, load balancers, firewalls, and other services. Upon receiving the request, underlay controller 21 may include the requestor in the MC group. Using the topology information, underlay controller 21 may also determine information for routing packets to the members of the MC group. Furthermore, underlay controller 21 may generate rule configuration instructions, and generate a match-and-action table, p-rules and s-rules. Underlay controller 21 may transmit the match-and-action table to virtual switches 140 and 1140 to request implementing the tables and rules on the hosts. Furthermore, underlay controller 21 may transmit the s-rules to physical switches 191, 195, 196, 197 and 1191 implemented in physical network 190 via connections 112 and 1112, respectively.

[0038] 2. EXAMPLE ARRANGEMENT OF PHYSICAL SWITCHES

[0039] FIG. 2A is a block diagram depicting an example arrangement of physical switches. The depicted arrangement includes one or more top-of-the rack (“TOR”) physical switches 212, 214, and 216, one or more spine physical switches 222, 223 and 226, and a core physical switch 232. Spine physical switches 222, 223 and 226 may have multiple layers of spine switches. Spine switches 222, 223 and 226 may be connected to core switch 232 via ports (not shown).

[0040] In an embodiment, TOR physical switches 212, 214, and 216 detect packets, such as packets 200, 201, and 202, parse the detected packets to determine output ports - if any, and replicate the received packets on the determined output ports so that spine physical switches 222, 224 and/or 226 may detect the packets.

[0041] 3. MULTICASTING LOAD STATISTICS DATA

[0042] In an embodiment, a highly-scalable, in-network multicasting approach is used by computer hosts to multicast load statistics of VMs to load balancers (not shown). The load statistics of the VMs may include CPU utilization by the VMs. Based on the received load statistics, the load balancers may determine to which VMs the traffic should be directed. The load balancers may be implemented in SVMs or other service devices.

[0043] In an embodiment, agent software modules (not shown in FIG. 1) implemented in hosts 110 and 1110 receive load statistics of VM 120 and 1120 from hypervisors 160 and 1160 as hypervisors 160 and 1160 usually have the load statistics available. The agents generate MC packets, include the load statistics in payloads of the packets, include an MC group identifier of the load balancer MC group in headers of the packets, and multicast the packets to load balancers.

[0044] 4. HANDUNG SPECIAL MC HEADERS BY PHYSICAL SWITCHES

[0045] In an embodiment, a packet comprises a VXLAN header followed by a special MC header. The special MC header is used to program a physical switch to handle MC packets. The special MC header includes the MC distribution tree information that otherwise would have to be provided to each of the physical switches and stored by each of the switches.

[0046] Upon detecting a packet having a special MC header, a physical switch determines whether its own identifier is included in the special MC header. If the switch identifier is included in the special MC header, then the switch retrieves, from the special MC header, a bitmap row associated with the switch identifier, and replicates the packet on the ports that are marked with“1” in the bitmap row. But, if the switch identifier of the switch is not included in the special MC header, then the switch may use s-rules, as described above.

[0047] FIG. 2B is a block diagram depicting an example special MC header 250. A special MC header of an MC packet may include identifiers 252 of physical switches and a bitmap 254. A bitmap row of bitmap 254 is associated with an identifier of identifiers 252. Bitmap 245 encodes an MC tree for an MC group by indicating, to the physical switches, the ports on which the physical switches should relay the MC packet so that the packet can reach the members of the MC groups.

[0048] In an embodiment, a row of bitmap 254 corresponds to a p-rule. A p-rule indicates the ports implemented on a physical switch and on which the switch should relay the packet. For a given identifier of identifiers 252, bitmap 254 includes a row of binary data that indicates on which ports, if any, the corresponding physical switch needs to place a received MC packet. In an embodiment,“1” in n th bit in a row of bitmap 254 that corresponds to a particular physical switch indicates that the particular switch should replicate the received MC packet on its n th output port to relay the packet, while“0” in m th bit in a row of bitmap 254 that corresponds to a particular physical switch indicates that the particular switch should not replicate the received MC packet on its m th output port. [0049] Typically, a switch identifier is assigned to one individual physical switch, and thus a bitmap row corresponding to the physical switch includes instructions for all ports of the switch.

[0050] In other implementations, one switch identifier may be assigned to a plurality of physical switches. If a switch identifier is assigned to a plurality of physical switches, then upon receiving an MC packet having a special MC header that includes such an identifier, the switch switches the packet to the plurality of switches, and the switches in the plurality of switches use their own s-rules to determine whether they need to relay the MC packet on their ports.

[0051] If a physical switch is a Layer 2 (“L2”) programmable switch, then the switch can use the disclosed approach. A programmable L2 switch is configured to parse an Ethernet header of a detected packet and determine whether the Ethernet header has a certain flag set to indicate that the packet includes a special MC header. If that flag is set, then the switch accesses and extracts a special MC header and uses the special MC header to determine how to replicate the detected packet.

[0052] If a physical switch is a legacy switch, then the switch is not configured to identify special MC headers in detected packets. A legacy switch may ignore the special MC headers and relies on its MC group table. Since the legacy switches may maintain relatively small MC tables, the legacy switches cause bottlenecks in implementing the high scalability multicasting.

[0053] 5. HIGHLY-SCALABLE, SOFTWARE-DEFINED, IN-NETWORK

MULTICASTING

[0054] FIG. 3 is a time chart depicting an example process for a highly-scalable, software-defined, in-network multicasting of load statistics data. The processing is performed by underlay controller 21, hypervisor 160/1160, and physical switch

212/214/216/222/224/226/232.

[0055] 5.1. PROCESSING PERFORMED BY AN UNDERLAY CONTROLLER

[0056] In an embodiment, underlay controller 21 receives multicast group join and leave requests from potential and present members of MC groups. Based on the request, underlay controller 21 may generate a match-and-action table that can be indexed using MC group identifiers and include special MC headers containing p-rules. The members of the MC groups may include load balancers that want to receive load statistics of VMs. Underlay controller 21 may send the match-and-action tables with the special MC headers to hypervisors.

[0057] In an embodiment, underlay controller 21 generates s-rules and transmits them to virtual switches 140 and 1140, and physical switches 212, 214, 216, 222, 224, 226, and 232.

[0058] 5.2. PROCESSING PERFORMED BY A HYPERVISOR

[0059] FIG. 4 is an example flow chart for a highly-scalable, software-defined, in- network multicasting of load statistics data. The steps described in FIG. 4 may be performed by a hypervisor, such as hypervisor 160/1160.

[0060] In step 402, a hypervisor receives and stares a match-and-action table that includes special MC headers.

[0061] In step 404, the hypervisor detects a packet. Upon detecting the packet, the hypervisor determines whether the packet is an MC packet. For example, the hypervisor may parse the packet header, and determine whether the header includes an MC group identifier. If, in step 408, the hypervisor determines that the packet is an MC packet, then the hypervisor proceeds to step 410; otherwise, the hypervisor proceeds to step 409, in which the hypervisor processes the packet conventionally.

[0062] In step 410, upon determining that the packet is an MC packet, the hypervisor or a virtual switch implemented in the hypervisor, uses the MC group identifier to retrieve a special MC header from the match-and-action table. The special MC header includes p-rules, which include identifiers of physical switches and bitmap rows associated with the identifiers.

[0063] In step 412, the hypervisor, or the virtual switch of the hypervisor, encapsulates the MC packet with the special MC header.

[0064] In step 414, the hypervisor provides the encapsulated packet to a network interface controller (“NIC”) to allow the NIC to send the packet to physical network 190. Hence, the special MC header is prepended to a packet before the NIC encapsulates the packet with additional headers to facilitate sending the packet to network 190.

[0065] 5.3. PROCESSING PERFORMED B Y A PHYSICAL SWITCH

[0066] Referring again to FIG. 3, upon detecting an MC packet, a programmable physical switch parses an Ethernet header of the packet to determine whether a special MC header is included in the packet. If it is included, then the physical switch extracts the special MC header and determines whether the header includes an identifier of the switch and a p-rule that applies to the switch.

[0067] If the switch determines that the special MC header includes an identifier of the switch and a p-rule that applies to the switch, then the switch uses the p-rule to determine the ports on which the switch should replicate the packet, and thus relay the packet.

[0068] If the physical switch determines that the special MC header does not include an identifier of the switch, then the switch drops the packet, or uses an s-rule, as described above.

[0069] 6. IMPROVEMENTS PROVIDED BY CERTAIN EMBODIMENTS

[0070] In an embodiment, an in-network multicasting of load statistics data provides high scalability for multicasting because it provides the MC tree information to physical switches in MC packets themselves. This relieves the physical switches from a need to maintain huge MC tables which typically are still too small to handle hundreds of thousands of MC tenants.

[0071] Including MC tree information in special MC headers and adding the MC headers to MC packets is less burdensome than having physical switches to maintain the huge MC tables. For example, in a datacenter topology that includes 27K hosts, the datacenter implementing the presented approach by using the special 325-byte-long MC headers added to MC packets may support a million of MC groups while requiring only about 1.1K multicast entries in flow tables maintained by the physical switches.

[0072] In an embodiment, the presented approach solves the multicasting problem in today’s datacenters which often support a vast number of multicast groups. Since typical physical switches do not have enough memory for maintaining such MC tables, without the presented approach, the physical switches could not handle the multicasting in today’s datacenters.

[0073] 7. IMPLEMENTATION MECHANISMS

[0074] The present approach may be implemented using a computing system comprising one or more processors and memory. The one or more processors and memory may be provided by one or more hardware machines. A hardware machine includes a

communications bus or other communication mechanisms for addressing main memory and for transferring data between and among the various components of hardware machine. The hardware machine also includes one or more processors coupled with the bus for processing information. The processor may be a microprocessor, a system on a chip (SoC), or other type of hardware processor.

[0075] Main memory may be a random-access memory (RAM) or other dynamic storage device. It may be coupled to a communications bus and used for storing information and software instructions to be executed by a processor. Main memory may also be used for storing temporary variables or other intermediate information during execution of software instructions to be executed by one or more processors.

[0076] 8. GENERAL CONSIDERATIONS

[0077] Although some of various drawings may illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings may be specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages could be implemented in hardware, firmware, software or any combination thereof.

[0078] The foregoing description, for purpose of explanation, has been described regarding specific embodiments. However, the illustrative embodiments above are not intended to be exhaustive or to limit the scope of the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen to best explain the principles underlying the claims and their practical applications, to thereby enable others skilled in the art to best use the embodiments with various modifications as are suited to the uses contemplated.

[0079] Any definitions set forth herein for terms contained in the claims may govern the meaning of such terms as used in the claims. No limitation, element, property, feature, advantage or attribute that is not expressly recited in a claim should limit the scope of the claim in any way. The specification and drawings are to be regarded in an illustrative rather than a restrictive sense.