Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
VERTICALLY-TIERED CLIENT-SERVER ARCHITECTURE
Document Type and Number:
WIPO Patent Application WO/2014/112975
Kind Code:
A1
Abstract:
Systems and methods of vertically aggregating tiered servers in a data center are disclosed. An example method includes partitioning a plurality of servers in the data center to form an array of aggregated end points (AEPs). Multiple servers within each AEP are connected by an intra-AEP network fabric and different AEPs are connected by an inter-AEP network. Each AEP has one or multiple central hub servers acting as end-points on the inter-AEP network. The method includes resolving a target server identification (ID). If the target server ID is the central hub server in the first AEP, the request is handled in the first AEP. If the target server ID is another server local to the first AEP, the request is redirected over the intra-AEP fabric. If the target server ID is a server in a second AEP, the request is transferred to the second AEP.

Inventors:
CHANG JICHUAN (US)
FARABOSCHI PAOLO (ES)
RANGANANTHAN PARTHASARATHY (US)
Application Number:
PCT/US2013/021568
Publication Date:
July 24, 2014
Filing Date:
January 15, 2013
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HEWLETT PACKARD DEVELOPMENT CO (US)
International Classes:
G06F9/44; G06F15/16; H04L47/22
Domestic Patent References:
WO2009158680A12009-12-30
Foreign References:
US20100195974A12010-08-05
US20020129274A12002-09-12
US20090327412A12009-12-31
US20060206747A12006-09-14
EP2189904A12010-05-26
Other References:
See also references of EP 2946304A4
Attorney, Agent or Firm:
CHANG, Marcia Ramos et al. (Intellectual Property Administration3404 E. Harmony Road,Mail Stop 3, Fort Collins CO, US)
Download PDF:
Claims:
CU MS

1, A method of vertically aggregating tiered servers in a data center, comprising:

partitioning a plurality of servers in the data center to form an array of aggregated end points {AEPs}, wherein multiple servers within each AEP are connected by an intra-AEP network fabric and different AEPs are connected by an jnter-AEP network, and each AEP has one or multiple central hub .servers acting as end-points on the inter- AEP network;

resolving a target server identification (ID) for a request from an AEP- local server at a central hub server In a first AEP;

If the target server ID is the central hub server in th first AEP, handling the request at the central hub server In the first AEP and responding to the requesting server;

if the target server ID is another server local to the first AEP, redirecting the request over the intra-AEP fabric to the server local to the first AEP; and

if the target server ID is a server in a second AEP< transferring the request to the second AEP .

2, The method of claim 1, wherein partitioning the plurality of servers is based on communication patterns in the data center, and wherein partitioning the plurality of servers is statically performed by connecting the servers and AEPs, or dynamically performed wherein a network fabric between servers can be programmed after deployment.

3, The method of claim 1 , further comprising buffering packets at the central hub server and sending multiple buffered packets together as a single request to a serve identified by the target server ID, 4, The method of claim 1 , further comprising at least one of sending the multiple buffered packets based on number of packets accumulated and sending the buffered packets when a latency threshold is satisfied.

5. A system comprising:

a plurality of servers forming an array of aggregated end points {AEPs)t wherein multiple servers within each AEP are connected by an intra-AEP network fabric and different AEPs are connected b an inter-AEP network, and each AEP has one or multiple centrai hub servers acting as end-points on the inter-AEP network;

a centrai hub server in a first AEP, the central hub server resolving a target server identification (ID) for a request from an AEP-locai server at a central hub server in a first AEP:

handling the request at the central huh server in th first AEP and responding to the requesting server, If the target server ID Is the centra! huh server in the first AEP; and

redirecting the request over the intra-AEP fabric to the server local to the first AEP, if the target server ID is another server local to the first

AEP; and

transferring the request to th second AEP if the target server ID Is a server in a second AEP,

8, The system of claim 5, wherein the centrai hub server receives a response to the request from the second AEP after transferring the request to the second AEP to increase networking performance/power efficiency,

7. The system of claim 5, wherein the central hub server further sends a response to a local requesting server within the first AEP.

8, The system of claim 5, wherein individual servers within the AEP are physically co-located in a same chassis or circuit hoard. 9, The system of claim 5S wherein individual servers within the AEP are physically co-located So a same integrated circuit or system-on-chip.

10. The system of claim 5, wherein the central hub server disaggregates an aggregated packet before delivering individual responses.

1 The system of claim 5, wherein the intra-AEP fabric can be a higher performance and better cost/power efficiency fabric than the inter-AEP fabric.

12, The system of claim 5, wherein the plurality of servers is custom partitioned in the AEP to optimise for specific access or traffic patterns.

13. The system of claim S, wherein the central hub server uses application or protoooi-ievei semantics to resolve the target server ID.

14, The system of claim 5, further comprising a buffer-and-fo vard subsystem to aggregate packets before sending the packets together as a single request to a server identified by the i ^gel server ID.

15. The system of claim 5, further comprising sending the buffered packets when a latency threshold is satisfied.

Description:
VERHGA Y-tlgREO CLIENT-SERVER ARCHITECTURE cmmism

p001| Today's scale-out data centers deploy many (e.g., thousands of) servers connected by high-speed network switches. Large web service providers, such as but not limited to search engines, online video distributors, and soda! media sites, may employ a large number of certain kinds of servers (e.g. , frontend servers) while using less of ether kinds of servers {e.g., backend servers). Accordingly, the data center servers may be provided as logical groups. Within each logical group, servers may run the same application hut operate on different data partitions. For example, an entire dataset may be partitioned among the servers within each logical group, sometimes using hashing function for load balancing, to achieve high scalability.

|0d&2| Data center networks typically treat ail servers in different logical groups as direct end points in the network, and thus do not address traffic patterns found in scale-out data centers. For example, state-of-the-art deployments may use Ethernet or InfiniBand networks to connect logical groups having N frontend servers and memcached servers (a total of N+ end- points). These networks use more switches, which cost more in both capital expenditures (e.g. , cos is nonlinear with respect to the number of ports) and operating expenditures {e g,, large switches use significant energy). Therefore, it can be expensive to build a high-bandwidth data center network with this many end-points,

BRIEF DESCRIPTION OF THE DRAWINGS

[0OO3J PiQures 1a-b are diagrams of an example data center which may implement a vertically-tiered client-server architecture,

[0O04| Figures 2a*fe show an example vertically-tiered client-server architecture. POOS] Figures 3a~d illustrate example operations In a vertfcally-tiered client- server architecture,

[0O06| Figure 4 is a flowchart of example operations in a vertically-tiered client-server architecture.

DETAILED DESCRIPTION

[00071 General-purpose distributed memory caching (also known as memeaehed) computing systems are examples of one of the tiers used in scale- out data centers. For example, many web service providers, such as but not limited to search engines, online video distributors, and social media sites, utilize memeaehed computing systems to provide faster access to extensive data stores. Memeaehed computing systems maintain frequently accessed data and objects in a local cache, typically i transient memory fiat can be accessed faster than databases stored in nonvolatile memory. As such, memeaehed servers reduce the number of times the database itself needs to be accessed, and can speed up and enhance the user experience on data-driven sites,

10008] Memeaehed computing systems may be implemented in a client- server architecture, A key-value associative array (e.g., a hash table) may be distributed across multiple servers. Clients use client side libraries to contact the servers. Each client may know ali of the servers, but the servers do not communicate with each other. Clients contact a server with queries (e.g., to store or read data or object). The server determines where to store or read the values. That is. servers maintain the values in transient memory when available. When the transient memor is full th least-used values are removed to free more transient memory. If the queried data or object has been removed from transient memory, then the server may access the data or object from the slower nonvolatile memo¾ typically residing on backend servers. Addressing the cost and power inefficiency of data center networks is at the forefront of data cente design,

[0009] The systems and methods disclosed herein implement hardware and software to provide low-cost, high-throughput networking capabilities for data centers. The data centers may include multiple tiers of scale-out servers. But instead of connecting ail nodes in multiple tiers (e.g., as direct peers and end points in the data center network), the number of end points are reduced by logically aggregating a subgroup of servers from two for more) tiers as a single end point to the network, referred to herein as an Aggregated End Point (AEP). Within the Aggregated End Point {AEP), a group of servers from different tiers can foe connected using low-power, low-cost, yet high-bandwidth and low- latency local fabric within an AEP. For example, the servers may be connected using a PCle bus or other local fabrics that are appropriate for short-distance physical neighborhoods, A global network may then foe used to connect the subgroup end points among the servers. While this is conceptually similar to aggregating multiple functionalities within a single larger server (e.g., a scale-up model), this configuration has the additional advantage of being compatible with a distributed (e.g., a scale-out) model. Scale-out models are more immune to failures than scale-up models, and can leverage multiple smaller and less expensive servers.

[00103 Presenting various numbers and configurations ot servers in different tiers as a vertically aggregated, tiered architecture can achieve benefits of network aggregation without needing special hardware support. In addition, the architecture reduces processing overhead for small packets (typical in merncached servers executing large web applications} by aggregating and forwarding small packets at the protocol- or appSicatson-ievei.

|0O11J Before continuing, it is noted that as used herein, the terms "includes * and Including" mean, hut are not limited to, "includes" or "including" and includes at least" or "including at least." The term "based on" means "based on" and "based at least in part on."

0012| Figures 1a-b are diagrams of an example data center which may implement a verticaiSy-Iered client-server architecture. Figure 1 is a physical representation 100 of the data center. Figure 1b is a topological representation 101 of the data center corresponding to Figure la. The data center 100 ma include server architectures which continue to increase traditional server densities, and as such, may he referred to as "scale-ouf data centers. Scale-out data centers may include any number of components, as illustrated in the figures,

[0O13| Trie data center may be implemented with any of a wide variety of computing devices, such as, but not limited to, servers, storage,, appliances (e.g., devices dedicated to providing a service), and communication devices, to name only a few examples of devices which may b configured for installation in racks. Each of the computing devices may include memor and a degree of data processing capability at feast sufficient to manage a communications connection with one another, either directly (e.g., via a bus) or indirectly (e.g., via a network). At least one of the computing devices is also configured with sufficient processing capability to execute trie program code described herein. P01 | An example architecture may include frontend (FE) servers MOa-c presenting to client devices (not shown). Each of the frontend servers 11Ga«c may be connected via the data center network 120 to backend servers 130a-b (e.g., emcached servers). For purposes of illustration, the data center may execute an online data processing service accessed by the client computing devices (e.g., Internet users). Example services offered by the data center may include general purpose computing services via the backend servers 130a- . For example, services may include access to data sets hosted on the internet or as dynamic data endpoinfs for any number of client applications, such as search engines, online video distributors, and social media sites. Services also include interfaces to application programming interfaces (APIs) and related support infrastructure which were previously the exclusive domain of desktop and local area network computing systems, such as application engines (e.g., online word processing and graphics applications), and hosted business services (e.g., online retailers),

001 S| Clients are not limited to any particular type of devices capable of accessing the frontend servers 110a-c via a network such as the Internet, in one example, the communication network inclu es the Internet or othe mobile communications network {e.g., a 3G or 4G mobile device network). Clients may include by way of illustration, personal computers, tablets, and mobile devices. Th frontend servers l lOa-c may be any suitable computer or computing device capable of accessing the backend servers 13Ga-b. Frontend servers liOa-e may access the baekend servers 130a-b via the data center network 120. such as a local area network (L ) and/or wide area network (WAN). The data center network 120 may also provide greater accessibility in distributed environments, for example, where more than one user may have input and/or receive output from the online service.

[001 §1 As shown In Figure l , the data center 100 may include N frontend <FE) servers and baekend (e.g., memcached) servers. In an example, N is much larger than M, Network communication in the dat center 100 has the following characteristics, infra-tier traffic is very iight because servers within a scale-out tier (e.g., rnerncached) either communicate very little or do not communicate at all. An pair of servers across two tiers can communicate, logically forming a complete bipartite graph. In addition, the sizes of different tiers {e.g., the number of servers within a tier) can he ver different, for example, the ratio of server counts between th web frontend tier and memcached tier can be four or more,

P01?| The systems and methods disclosed herein implement a multi-level aggregation architecture within the date center. Aggregation is illustrated in Figure 1a by dashed lines 140. in an example, the multi-level aggregation architecture is a vertical !y tiered architecture. The term "vertically tiered" architecture refers to tiers which may he collapsed (e.g., two tiers collapsed into one tier) to reduce the architecture height,

|O018J The context for vertically-tiered, client-server archstecture relates to common use-cases. Without losing generality, the architecture may be implemented as a frontend^memcached multi-tier data center, similar to configurations that a !arge web application (e.g., a social media site) employs, in an example, an efficient and high bandwidth local network (e.g., PCIe) may be combined with an Ethernet (or similar} network to provide Sow-overhead and packet aggregation/forwarding. This approach addresses the natwork hardware bandwidth/port-count bottleneck, offers reduced overhead for handling small packets, and enhance memory capacity management and reliability. An example is discussed In more detail wit* reference to Figures 2a-b. £0013J It is noted that the components shown in the figures are provided only for purposes of illustration of an example operating environment, and are not intended to limit Implementation to any particular system,

fOO20J Figures 2a-b show an example vertica!iy-tlered client-server architecture. Figure 2a Is a physical representation 200 of the data center. Figure 2b is a topological representation 201 of the data center corresponding to Figure 2a, In an example, the multi-tier data center is shown as It may be represented as an arra of approximately equal-sized subgroups (e.g., two are shown in the figure, 210 and 212} connected via an inter-AEP fabric. Each of the subgroups 210 and 21 represents a single aggregated end point (AEP) in the data center network.

P021| Within each AEP 210 and 212, one server may serve as a central huh (illustrated hy servers 240 and 242, respectively). The central hub interfaces with other AEPs and serves as the Intra-AEP traffic switch. For example, central hub 220a in AEP 210 may interface with the central hub 242 in AEP 212. Different servers in each AEP 210 and 212 can be interconnected via a local !abric in AEP 210 (and fabric 232 In AEP 212). In an example, the local fabri may be a cost-efficient energy-efficient high-speed fabric such as PCSe, |0022| The traffic patterns among the servers within each AEP and across AEPs are known. As such, the fabric can also be optimized (tuned) to support specific traffic patterns. For example, in a frontend/memeached architecture, frontetid (FE) servers talk to memcached nodes. But there is near-zero traffic between FE servers or between memcached servers. Thus, the memcached servers may be chosen as the hubs within different AEPs,

|0O23| Fo purposes of illustration, the second tier server (the memoaohed server) in each AEP aggregates memcached requests within the AEP using protocol-level semantics to determine the target server. For example, the frontend server may use consistent hashing to calculate the target memcached serve for a given memcached key, value> request. Thes requests are transferred over intra-AEP fabric (e.g. , PCIe links) to the huh node. The hub node calculates a corresponding target server ID, ρδ24| Irs this lustration, if the target is the centra! hub itself, the central bub server handles the request and sends the response back to the AEP-tocal frontend server, if the target server is a remote server (e.g., in one of the AEPs), then the central hub buffers the requests. For example, the request may be buffered based on the target server ID. In another example, the request may be buffered based on the target AEP ID, for example, if multiple servers can be included in one AEP for further aggregation. When the buffer accumulates sufficient packets, the central hu translates these requests into one mufti-get request (at the application protocol level) or a jumbo network packet (at the network protocol level) and forwards the request to the target.

[002S| St is noted that while aggregation need not be Implemented I every instance, aggregation can significantly reduce the packet processing overhead for small packets. However, this can also result in processing delays, for example, if there are not enough small packets for a specific target SD to aggregate into a single request. Thus, a threshold may be Implemented to avoid excessive delay. In an example, if wait time of the oldest packets in the buffer exceeds a user-specified aggregation latency threshold, even If the buffer does not have sufficient packets to aggregate into a single request, the central hub still sends the packets when the threshold is satisfied, for example, to meet latency Quality of Service (QoS) standards.

[0026] In any event, when the target server receives th request packet(s (either separately, or aggregated), the requests are processed and sent back as a response packet to the source. It is noted that the response packets can be sent immediately, or accumulated in a buffer as an aggregated response. The source server receives the response packet, disaggregates the response packet into multiple responses {if previously aggregated}, and sends the response(s) to the requesting frontend servers within the AEP,

[0027] While this example illustrates handling requests according to a memcached protocol, it is noted that other interfaces may also be implemented. For example, the request may be organized: I shards, where the target server can be identified by disambiguating through a static hashing mechanism (such as those used with traditional SQL databases), or other distributed; storage abstractions.

[0028] Before continuing, it is noted that while the figure depicts two tiers of servers for purposes of illustration, the concepts disclosed herein can be extended to any number of muitipie tiers, In addition, the tiers may include any combination of servers, storage, and other computing devices in a single data center and/or multiple data centers. Furthermore, th aggregation architecture described herein may be implemented with any "node" and is not limited to servers (e.g., memcacheci servers). The aggregation may h physical and/or a logical grouping.

[00233 & noted that in Figures 2a-b and 3a-d, the physical network is shown by solid lines linking components, and the communication of packets between nodes is over the physical network and illustrated by dash-dot-dash lines in Figures 3a-d,

[00301 Figures 3a~d illustrate example operations i a vertically-tiered client- server architecture 300. The operations may be implemented, at least in part, by machine readable Instructions (such as but not limited to, software or firmware). The machine-readable instructions may he siored on a non-transient computer readable medium and are executable by one or more processor to perform the operations described herein. Modules can be integrated within a self-standing tool, or may be implemented as agents that run on top of existing devices in the data center. The program code executes the function of the architecture, It is noted, however, that the operations described herein are not limited to any specific implementation with any particular type of program code.

|0Ο31| n the examples in Figures 3a d, a plurality of servers form AEPs 310 and 312, shown here in simplified form, in an example, the plurality of servers is custom partitioned in each AEP 310 and 312 to optimize for specific access patterns. For example, the mix of server types within an AEP can vary based on trie capacity re uirements at different tiers, and the network topology within and across AEPs can be customized to fit server communication patterns.

[00323 £3Ch AEP 310 and 312 is connected via an inter-AEP fabric or network 350 (e.g., Ethernet}. One of the servers in each AEP is designated as a central hub server. For example, a central huts server 320 is shown in AEP 310, and another central hub server 322 Is shown in AEP 312, All of the nodes server are interconnected within each AEP via an intra-AEP fabric (e.g. , PCIe). Fo example, node 240 Is shown connected to central hub 320 in AEP 310; and node 342 is shown connected to central hub 322 in AEP 31 .

00333 8 is noted thai other fabrics may also be implemented, In an example, the intra-AEP fabric is faster than the inter-AEP fabric. During operation, the central hub server 320 receives request 360, The central hub server 320 resolves a target server identification (10} for the request 380, In an example, the central hub server uses protocol-level semantics to resolve the target server ID (for example, using consistent hashing and AEP configuration to calculate the iarget ID in the trontersd+niemcached example illustrated above),

[0834| In Figure 3a ; a use case is illustrated wherein the target server ID is the central hub server 320 itself. In this example use case, the central hub server 320 handles the request and responds. For example, the central hub server may send a response 370,

[9Q3SJ in Figure 3b, a use case Is Illustrated wherein the target server ID is a local server 340 In the first AEP 310, In this example use case, the central hub server transfers the request 360 over the intra-AEP fabric to the local server 340 in the first AEP 31 , it is also possible for the requesting local server to resolve the target ID and, when the target server is a local peer within the AEP to directly communicate with the target local server,

|0δ3δ| in Figure 3c, a use case is llustrated wherein the target server ID is a remote server 342 in the second AEP 312, In this example use case, the central hub server 320 transfers the request to the central hub server 322 i the second AEP 312, After transferring the request to the second AEP 312, the request is handled by a server in the second AEP 3 2 identified by the target server ID. The central hub server 320 in the first AEP 310 then receives a response to the request from the second AEP 312,

|003?3 in Figur 3d, a use case is illustrated wherein a buffer-and-forward subsystem 380 i implemented to aggregate packets 380a-b before sending the packets together as a single aggregated request; 382 to a server identified by the target server ID, Upon receiving an aggregated request 382, the central hub server 322 either processes the aggregated request by itself, or disaggregates tr e aggregated request packet before sending disaggregated requests to the Socai servers identified by the target server ID. Likewise, upon receiving an aggregated response 385, the centra! hub server 320 disaggregates the aggregated response packet 385 before issuing the disaggregated 370a~b responses,

|0O3S| in an example, an aggregation threshold m y be implemented by the buffer -and-forward subsystem 380, The aggregation threshold controls wait time for issuing packets, thereby achieving the benefits of aggregation without increasing latency. By way of illustration, packets may be buffered at the central bub server 320 and tbe buffered packets may then be issued together as a single request 382 to a server identified by tbe target server ID. in an example, aggregation ma be based on number of packets. That is, tbe aggregated packet is sent after a predetermined number of packets are collected in the buffer. in an example, aggregation may be based on latency. That Is, the aggregated packet is sent after a predetermined time, in another example, the aggregation threshold ma be based on both a number of packets and a time, or a combination of these and/or other factors,

|0O39| The architectures described herein may be customized to optimize for specific access patterns in the data center. Protocol-level or application-level semantics may be used to calculate target node IDs and aggregate small packets to further reduce software related overheads. As such, vertical aggregation of tiered scale-out servers reduces the number of end points, and hence reduces the cost and power requirements of data center networks. In addition, using low-cost, high-speed fabric within the AEP improves performance and efficiency for local traffic,

P040| Before continuing, it should be noted that the examples described above are provided for purposes of illustration, and are not intended to be limiting. Other devices and/or device configurations may be utilized to carr out the operations described herein. For example, a "server" could be as simple as a single component on a circuit board, or even a subsystem within an Integrated system-on-chip. The individual server may be co-located in the same chassis, circuit board, Integrated circuit (IC), or system-on-chip, I other words, the implementation of an AEP is not intended t be limited to a physically distributed cluster of individual servers, out could be implemented within a single physical enclosure or component.

[00413 Figure 4 is a flowchart of example operations in a vertically-tiered client-server architecture. Operations 400 may be embodied as logic instructions on one or more computer-readable medium. When executed on a processor, the logic instructions cause a general purpose computing device to b programmed as a special-purpose machine that implements the described operations. I an example, the components and connections depicted in the figures may be used.

[0842J Operation 410 includes partitioning a plurality of servers in the data center to form a first aggregated end point (AEP). The first AEP ma have fewer external connections than the individual servers. Operation 420 includes connecting a central hub server in the first AEP to at least a second AEP via an intra~AEP fabric. Operation 430 includes resolving a target server Identification (ID) for a request at the central hub sea*er,

|0043| if at decision block 440 it is determined that the target server ID is the central hub server, operation 441 includes handling the request at the central hub server, and in operation 442 responding to a froniend (FE) server.

|0O44i if st decision block 460 it is determined tha the target server ID is a server local to the first AEP, operation 451 includes transferring the request over the Intra-AEP fabric to the server local to the first AEP. and in operation 452 responding to the central hub server which then responds to the froniend (FE) server),

[004SJ if at decision block 480 it is determined that the target server ID is a remote server (e.g., a server in the second AEP), operation 481 Includes transferring the request to the second AEP, and in operation 462 responding to the central hub server which then responds to the froniend (FE) server). It is noted that the central hub server at the second AEP may handle th request, or further transfer ihe request within the second AEP. £0046J in an example,, partitioning the servers is based on communication patterns in trie data center- Besides the bipartite topo!ogy exemplified in the frontend+memcached use-case, other examples can include active-active redundancy, server-to-shared-s orage-communication, and others, it is noted that the operations described herein may be implemented to maintain redundancy and autonomy, white increasing speed of the aggregation of all servers after partitioning,

|0δ4?| The operations shown and described herein are provided to illustrate example Implementations, it is noted that the operations are not limited to the ordering shown. Various orders of the operations described herein may be automated or partially automated.

£0848| Stili other operations may also be implemented. In an example, an aggregation threshold may be implemented to control the wait time for issuing packets, to achieve the benefits of aggregation without increasing latency. The aggregation threshold addresses network hardware cost and software processing overhead, while still maintaining latency OoS. Accordingly, the aggregation threshold may reduce cost and improve power efficiency in tiered scaie-out data centers,

|08493 By way of illustration, operations may include buffering packets at the central hub server and sending the buffered packets together as a single request to a server identified by the target server ID. Operations may also include sending the buffered packets based on number of packets in the buffer. And operations may also include sending the buffered packets based on latency of the packets in the buffer.

|0OS8| It is noted that the examples shown and described are provided for purposes of illustration and are not intended to be limiting. Still other examples are also contemplated.