Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
HIERARCHICALLY SCALABLE, FAULT-TOLERANT DATA ACCESS
Document Type and Number:
WIPO Patent Application WO/2023/131812
Kind Code:
A1
Abstract:
A system for distributed shared storage of data records, the system comprising one or more physical nodes and comprising a plurality of logical nodes. Each logical node comprises first and second interfaces for communicating with other logical nodes and client applications, respectively. The plurality of logical nodes consists of multiple disjoint subgroups, each comprising multiple logical nodes. A single logical node from each subgroup is designated as a leader of the respective subgroup, and a single leader is elected leader of the leaders. The leader of the leaders maintains a first log recording allocations to each subgroup of a respective disjoint subset of a set of shared resources and distributes a copy of the first log and updates of the first log to each of the remaining leaders. Each leader maintains a subgroup log recording metainformation for each resource in for the subset of shared resources allocated to the subgroup.

Inventors:
MÁTRAY PÉTER (HU)
GÉHBERGER DÁNIEL (CA)
NÉMETH GÁBOR (HU)
Application Number:
PCT/IB2022/050045
Publication Date:
July 13, 2023
Filing Date:
January 04, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
International Classes:
G06F11/18; G06F11/20; G06F11/30
Foreign References:
US20170024453A12017-01-26
Other References:
ALEKSEY CHARAPKO ET AL: "Scaling Strongly Consistent Replication", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 January 2021 (2021-01-21), pages 1 - 16, XP081863018
MOHAMMAD FAZLALI ET AL: "Raft Consensus Algorithm: an Effective Substitute for Paxos in High Throughput P2P-based Systems", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 4 November 2019 (2019-11-04), pages 1 - 19, XP081525096
D. ONGARO ET AL.: "In search of an understandable consensus algorithm", USENIX ATC, 2014
FP. JUNQUEIRABC. REEDM. SERAFINI.: "207 1 !EEE/!FIP 41st International Conference on Dependable Systems & Networks (DSN)", 2011, IEEE, article "Zab: High-performance broadcast for primary-backup systems"
L. LAMPORT: "Paxos made simple", ACM SIGACT NEWS, vol. 32, no. 4, 2001, pages 18 - 25, XP055323729, DOI: 10.1145/568425.568433
THALER, DAVIDCHINYA RAVISHANKAR: "A Name-Based Mapping Scheme for Rendezvous", UNIVERSITY OF MICHIGAN TECHNICAL
Attorney, Agent or Firm:
HOMILLER, Daniel P. (US)
Download PDF:
Claims:
CLAIMS What is claimed is: 1. A system for distributed shared storage of data records, the system comprising one or more physical nodes and comprising a plurality of logical nodes, wherein each logical node is implemented on a respective one or more of the one or more physical nodes and comprises a first interface configured to communicate with others of the logical nodes and a second interface configured to communicate with one or more client applications, wherein: the plurality of logical nodes consists of multiple disjoint subgroups of logical nodes, each subgroup comprising a plurality of logical nodes; a single logical node from each of the multiple subgroups is designated as a leader of the respective subgroup; a single one of the leaders is designated as leader of the leaders; the leader of the leaders is configured to maintain a first log recording allocations to each subgroup of a respective disjoint subset of a set of shared resources, and is configured to distribute a copy of the first log and updates of the first log to each of the remaining leaders; and the leader of each subgroup is configured to maintain a subgroup log recording metainformation for each resource in for the subset of shared resources allocated to the subgroup. 2. The system of claim 1, wherein the leader of each subgroup is configured to distribute a copy of the first log and updates of the first log to each logical node in the subgroup. 3. The system of claim 1 or 2, wherein the leaders of the subgroups are configured to elect the leader of the leaders from among the leaders of the subgroups, in response to determining that no leader of the leaders has been elected or that a previous leader of the leaders is unreachable. 4. The system of any one of claims 1-3, wherein the leader of the leaders is configured to: detect that a first subgroup has fewer than a threshold number of logical nodes; in response, merge the logical nodes of the first subgroup to one or more others of the multiple subgroups, thereby eliminating the first subgroup; update the first log; and distribute the updated first log or an update of the first log to each of the leaders of subgroups other than the subgroup of the leader of the leaders. 5. The system of any one of claims 1-4, wherein the leader of the leaders is configured to: detect that a first subgroup has more than a threshold number of logical nodes; in response, split the first subgroup into two subgroups; update the first log; and distribute the updated first log or an update of the first log to each of the leaders of subgroups other than the subgroup of the leader of the leaders. 6. The system of any one of claims 1-5, wherein the leader of the leaders is configured to allocate, to each subgroup, a respective one of the disjoint subsets of the shared resources. 7. The system of claim 6, wherein the leader of the leaders is configured to perform said allocation in response to detecting that the number of subgroups has changed. 8. The system of claim 6, wherein the leader of the leaders is configured to perform said allocation using any of: hash values corresponding to the shared resources; lexicographic values corresponding to the shared resources; and a highest random weight algorithm applied to values corresponding to the shared resources. 9. The system of any one of claims 1-8, wherein each leader is configured to report, to the leader of the leaders, a change in logical nodes in the respective subgroup. 10. A node (800) configured to participate as a leader node in distributed shared storage of data records by a plurality of nodes, the node (800) comprising: processing circuitry (810); and interface circuitry (820) operatively coupled to the processing circuitry (810) and configured to communicate with other ones of the plurality of nodes and with one or more clients; wherein the processing circuitry (810) is configured to: participate in an election of a leader of leaders, from among a group of nodes comprising the node and two or more other leader nodes; receive, from the leader of leaders, an allocation of a subset of a set of shared resources, the subset consisting of shared resources to be stored by a subgroup of nodes managed by the node; and manage a consistency algorithm to store the subset of shared resources among the subgroup of the nodes managed by the node, wherein managing the consistency algorithm comprises maintaining a subgroup log recording metainformation for each resource in for the subset of shared resources allocated to the subgroup. 11. The node (800) of claim 10, wherein the processing circuitry (810) is further configured to: receive, from the leader of leaders, a log recording allocations to each of a plurality of subgroups of a respective disjoint subset of the set of shared resources; and distribute the log to the nodes of the subgroup of nodes managed by the node (800).

12. The node (800) of claim 10 or 11, wherein the processing circuitry (810) is further configured to report, to the leader of the leaders, a change in logical nodes in the respective subgroup. 13. The node (800) of any one of claims 10-12, wherein the processing circuitry (810) is further configured to, in response to the node being elected the leader of the leaders: maintain a council log recording allocations to each subgroup of a respective disjoint subset of a set of shared resources, and is configured to distribute a copy of the council log and updates of the council log to each of the remaining leader nodes. 14. The node (800) of claim 13, wherein the processing circuitry (810) is further configured to, while the node is acting as leader of the leaders: detect that a first subgroup has fewer than a threshold number of logical nodes; and, in response, merge the logical nodes of the first subgroup to one or more others of the multiple subgroups, thereby eliminating the first subgroup; update the first log; and distribute the updated first log or an update of the first log to each of the leaders of subgroups other than the subgroup of the node (800). 15. The node (800) of claim 13 or 14, wherein the processing circuitry (810) is further configured to, while the node is acting as leader of the leaders: detect that a first subgroup has more than a threshold number of logical nodes; in response, split the first subgroup into two subgroups; update the first log; and distribute the updated first log or an update of the first log to each of the leaders of subgroups other than the subgroup of the node (800). 16. The node (800) of any one of claims 13-15, wherein the processing circuitry (810) is further configured to, while the node is acting as leader of the leaders, allocate, to each subgroup, a respective one of the disjoint subsets of the shared resources. 17. The node (800) of claim 16, wherein the processing circuitry (810) is configured to perform said allocation in response to detecting that the number of subgroups has changed. 18. The node (800) of claim 16, wherein the processing circuitry (810) is configured to perform said allocation using any of: hash values corresponding to the shared resources; lexicographic values corresponding to the shared resources; and a highest random weight algorithm applied to values corresponding to the shared resource.

19. A method, for distributed shared storage of data records in a system comprising one or more physical nodes and comprising a plurality of logical nodes, wherein each logical node is implemented on a respective one or more of the one or more physical nodes and comprises a first interface configured to communicate with others of the logical nodes and a second interface configured to communicate with one or more client applications, wherein the plurality of logical nodes consists of multiple disjoint subgroups of logical nodes, each subgroup comprising a plurality of logical nodes, and wherein a single logical node from each of the multiple subgroups is designated as a leader of the respective subgroup and a single one of the leaders is designated as leader of the leaders, the method comprising: maintaining (910), by the leader of the leaders, a first log recording allocations to each subgroup of a respective disjoint subset of a set of shared resources and distributing a copy of the first log and updates of the first log to each of the remaining leaders; and maintaining (920), by the leader of each subgroup, a subgroup log recording metainformation for each resource in the subset of shared resources allocated to the subgroup. 20. A method, in a node configured to participate as a leader node in distributed shared storage of data records by a plurality of nodes, the method comprising: participating (1010) in an election of a leader of leaders, from among a group of nodes comprising the node and two or more other leader nodes; receiving (1020), from the leader of leaders, an allocation of a subset of a set of shared resources, the subset consisting of shared resources to be stored by a subgroup of nodes managed by the node; and managing (1030) a consistency algorithm to store the subset of shared resources among the subgroup of the nodes managed by the node, wherein managing the consistency algorithm comprises maintaining a subgroup log recording metainformation for each resource in for the subset of shared resources allocated to the subgroup. 21. The method of claim 20, further comprising: receiving, from the leader of leaders, a log recording allocations to each of a plurality of subgroups of a respective disjoint subset of the set of shared resources; and distributing the log to the nodes of the subgroup of nodes managed by the node. 22. The method of claim 20 or 21, wherein the method further comprises reporting, to the leader of the leaders, a change in logical nodes in the respective subgroup. 23. The method of any one of claims 20-22, wherein the method further comprises, in response to being elected the leader of the leaders: maintaining a council log recording allocations to each subgroup of a respective disjoint subset of a set of shared resources, and is configured to distribute a copy of the council log and updates of the council log to each of the remaining leader nodes. 24. The method of claim 23, wherein the method further comprises, while the node is acting as leader of the leaders: detecting that a first subgroup has fewer than a threshold number of logical nodes; in response, merging the logical nodes of the first subgroup to one or more others of the multiple subgroups, thereby eliminating the first subgroup; updating the first log; and distributing the updated first log or an update of the first log to each of the leaders of subgroups other than the subgroup of the node. 25. The method of claim 23 or 24, wherein the method further comprises, while the node is acting as leader of the leaders: detecting that a first subgroup has more than a threshold number of logical nodes; in response, splitting the first subgroup into two subgroups; updating the first log; and distributing the updated first log or an update of the first log to each of the leaders of subgroups other than the subgroup of the node. 26. The method of any one of claims 23-25, wherein the method further comprises, while the node is acting as leader of the leaders, allocating, to each subgroup, a respective one of the disjoint subsets of the shared resources. 27. The method of claim 26, wherein the method comprises performing said allocation in response to detecting that the number of subgroups has changed. 28. The method of claim 26, wherein the method comprises performing said allocation using any of: hash values corresponding to the shared resources; lexicographic values corresponding to the shared resources; and a highest random weight algorithm applied to values corresponding to the shared resource.

Description:
HIERARCHICALLY SCALABLE, FAULT-TOLERANT DATA ACCESS TECHNICAL FIELD The present disclosure is generally related to data storage in networks and is more particularly related to techniques for providing scalability in fault-tolerant distributed databases. BACKGROUND With the fifth-generation (5G) wireless networks currently being deployed, most telecommunication applications are being transformed to become cloud-native. Traditionally monolithic components are being re-architected into smaller, independent, and reusable functional pieces (microservices), while also being redesigned to become hardware and cloud runtime independent (cloud agnostic). As a result, cloud-based telecom systems are expected to deliver vastly improved flexibility and manageability, without compromising the high level of reliability their native counterparts provide. In other words, applications (or network functions) should become independently scalable, gradually upgradable, while also enduring the failure/termination/relocation of individual application instances without any impact on end user experience. One of the key enablers for achieving all these far-reaching goals is to separate the important application states from the life cycle of the application. This is done by introducing an extra service in the system: a database. Applications can push their state changes (e.g., the metadata on the handled end-user sessions) to an external database and, once the application is scaled, relocated, restarted, or failed over, it can read back those states and resume its task. This design pattern is often referred to as state externalization. As a result of this approach, individual application instances and hardware elements (such as physical servers or switches) may be treated as disposable and exchangeable. The cornerstone of the system’s reliability, however, is the database: the externalized states need to be kept available so that applications can access them whenever they need to. Consequently, the database must be distributed among multiple nodes to achieve resilience, while the stored data items need to be replicated to multiple database nodes to avoid data loss upon the failure of a database instance. To keep replicas continuously in sync while users update data, distributed databases typically apply a consensus protocol, like Raft [2], ZAB [3] or Paxos [4]. The consensus protocol ensures that all state changes are applied in a consistent manner on each distributed node, regardless of potential failures or network hiccups. A fundamental problem with most widely used consensus protocols (including Raft, ZAB, and many practical Paxos variants) is that they do not scale well with cluster size, i.e., with the number of nodes across which the database is distributed. With the distributed memory-sharing system called DAL [1], it has been seen that Raft can serve a typical telco application only if the application is adapted to minimize the number of required Raft operations. Even in that case, the cluster size is limited by Raft to about 10-20 nodes. Consensus protocols naturally involve heavy coordination between cluster nodes, to reach agreement on state changes. To make this coordination mechanism practical, most consensus protocols designate a special coordinator node, the leader, who is responsible for orchestrating the protocol. This singularity of the coordinator node is the reason why consensus quickly becomes a bottleneck when scaling the cluster size. For this very reason, typical strongly consistent data stores suggest using 3, 5 or 7 nodes in a cluster, but not much more, e.g., etcd [5]. For many telco applications however, it is required to deploy clusters with 20+ nodes. For this to work, the database’s consensus layer must be scalable. There are a few published references in this area. The most notable example published approach is CockroachDB [6], which applies a divide and conquer approach. CockroachDB partitions the key space into contiguous ranges (e.g., all keys starting with “B” to “C”) and handles key replication within each range independently. Each key in a range is typically stored on 3 nodes, and those 3 nodes form a consensus group, coordinating consistent reads and writes to keys within the range. Whenever the amount of keys/data in a range increase above a threshold, the range is split into two smaller ranges, and a new consensus group of 3 nodes is created. Hence, CockroachDB can use many independent consensus groups, always keeping the load on one group limited. These consensus groups are then multiplexed onto the nodes in the database cluster, meaning that a single database node typically hosts replicas from multiple ranges, hence taking part in multiple consensus groups. Other popular databases apply the same approach as CockroachDB, such as the TiKV key-value database [7] or the Apache Ozone object store for Hadoop big data applications [8]. Some problems remain, however. Prior approaches do not always scale well with the number of keys/amount of data in the system. These approaches do not support low-latency applications well. In addition, these approaches do not readily support cluster-side consensus. Further, the system survivability is limited in these systems, as system-wide meta information is typically replicated to only a few nodes. SUMMARY This document describes a system and method addressing the outlined scalability, response time and survivability constraints of previous key-value database clusters using consensus algorithms. Solutions described herein build on hierarchical cluster formation, where the lowest level clusters (groups) manage the actual key-value information, while the higher levels manage membership and sharding-related meta information. The solution is described in detail here for two levels, enabling scaling clusters up to hundreds of nodes. However, the approach can be applied to more levels if necessary. Various embodiments of the techniques and systems described in detail below enhance scalability by applying hierarchical cluster formation: a leader is elected in each consensus group and a “president,” i.e., a “leader of the leaders” is independently elected from among the leaders. These embodiments enable cluster-wide consensus by facilitating information sharing between the consensus groups. Global information such as group membership & leadership status and key allocation description can be distributed to all nodes of the cluster. Since metadata about key lookup stays bounded in size, it can be replicated to all nodes with negligible cost. This step helps achieve lower latency and better cluster survivability characteristics. The key space, on the other hand, can be divided into disjoint groups, e.g., based on a range of key hashes. An example system according to some of the embodiments described below is for distributed shared storage of data records, where the system comprises one or more physical nodes and a plurality of logical nodes, each logical node being implemented on a respective one or more of the one or more physical nodes and comprising a first interface configured to communicate with others of the logical nodes and a second interface configured to communicate with one or more client application. In this example system, the plurality of logical nodes comprises multiple disjoint subgroups of logical nodes, each subgroup comprising a plurality of logical nodes. Further, a single logical node from each of the multiple subgroups is designated as a leader of the respective subgroups, and a single one of the leaders is designated as leader of the leaders. The leader of the leaders, which might be called a “president,” for example, is configured to allocate, to each subgroup, a respective disjoint subset of a set of shared resources. The leader of the leaders is configured to maintain a first log recording those allocations and is further configured to distribute a copy of the first log and updates of the first log to each of the remaining leaders. The leader of each subgroup is configured to distribute a copy of the first log and updates of the first log to each logical node in the subgroup and is configured to maintain a subgroup log recording metainformation for each resource in for the subset of shared resources allocated to the subgroup. Also described below are nodes configured to participate as a leader node in distributed shared storage of data records by a plurality of nodes. An example of such a node comprises processing circuitry and interface circuitry operatively coupled to the processing circuitry and configured to communicate with other ones of the plurality of nodes and with one or more clients. The processing circuitry is configured to participate in an election of a leader of leaders, from among a group of nodes comprising the node and two or more other leader nodes, and to receive, from the leader of leaders, an allocation of a subset of a set of shared resources, the subset consisting of shared resources to be stored by a subgroup of nodes managed by the node. The processing circuitry is further configured to manage a consistency algorithm to store the subset of shared resources among the subgroup of the nodes managed by the node, where managing the consistency algorithm comprises maintaining a subgroup log recording metainformation for each resource in for the subset of shared resources allocated to the subgroup. Example methods are also detailed below. One example method corresponds to the system summarized above, and is for distributed shared storage of data records in a system comprising one or more physical nodes and comprising a plurality of logical nodes, where each logical node is implemented on a respective one or more of the one or more physical nodes and comprises a first interface configured to communicate with others of the logical nodes and a second interface configured to communicate with one or more client applications. In this system, the plurality of logical nodes consists of multiple disjoint subgroups of logical nodes, each subgroup comprising a plurality of logical nodes, and a single logical node from each of the multiple subgroups is designated as a leader of the respective subgroup and a single one of the leaders is designated as leader of the leaders. The example method comprises maintaining, by the leader of the leaders, a first log recording allocations to each subgroup of a respective disjoint subset of a set of shared resources and distributing a copy of the first log and updates of the first log to each of the remaining leaders. The example method also comprises maintaining, by the leader of each subgroup, a subgroup log recording metainformation for each resource in for the subset of shared resources allocated to the subgroup. Similarly, methods are also described for a node configured to participate as a leader node in distributed shared storage of data records by a plurality of nodes. An example of such a method comprises participating in an election of a leader of leaders, from among a group of nodes comprising the node and two or more other leader nodes, and receiving, from the leader of leaders, an allocation of a subset of a set of shared resources, the subset consisting of shared resources to be stored by a subgroup of nodes managed by the node. The method further comprises managing a consistency algorithm to store the subset of shared resources among the subgroup of the nodes managed by the node, where managing the consistency algorithm comprises maintaining a subgroup log recording metainformation for each resource in for the subset of shared resources allocated to the subgroup. As will be shown below, various embodiments of the disclosed systems and techniques facilitate scalability of distributed data sharing to large system sizes, by avoiding central processing unit (CPU) and networking hot spots via the distribution of the leader’s work to a group of leaders (the council). Embodiments also facilitate a high survivability of global metadata (e.g., cluster or sharding configuration), and support low-latency applications. Furthermore, leader re-election is not a stop-the-world event, contrary to, for example, a regular Raft system. Speed is improved, as the first lookup of data is one round-trip time (RTT), versus two RTTs in earlier systems. Furthermore, the techniques can be generalized to systems of extreme scales, by introducing more layers into the hierarchy. BRIEF DESCRIPTION OF THE FIGURES Figure 1 illustrates a logical system view. Figure 2 illustrates example nodes, from a logical point of view. Figure 3 illustrates an example scale-out procedure. Figure 4 shows example message for a node’s join process. Figure 5 illustrates an example scale-in procedure. Figure 6 illustrates an example allocation of keys to shard groups, using hash values for the keys. Figure 7 illustrates an example of group and shard allocations. Figure 8 is a block diagram showing an example node for distributed shared storage of data records. Figure 9 is a process flow diagram illustrating an example method for distributed shared storage. Figure 10 is a process flow diagram illustrating another example method for distributed shared storage. DETAILED DESCRIPTION Previously described approaches to distributed data sharing described above split the key space into smaller ranges and manage replication and consensus independently for each range. In this context, the term “key” refers to a shared data element, or “resource,” where each key has a value (which may in turn be a collection of values). The term “shard” is commonly used to refer to a partition of data in a database or search engine – thus each consensus group, which comprises several nodes, may manage a corresponding shard. By dynamically adjusting the number of ranges, and thus the number of shards and corresponding consensus groups, the previously described approaches can manage to keep the load on each consensus group limited. But, despite these desired properties, there are also some unwanted consequences of these previously described approaches. First, these approaches do not always scale well with the amount of keys/data in the system. When there is a lot of user data stored, the number of ranges (and hence, the number of consensus groups) can grow very high. Due to multiplexing many consensus groups onto every database node, the consensus-related CPU overhead on any given host can grow high. Second, these previous approaches do not support low-latency applications well. As multiple, perhaps very many, consensus groups are multiplexed onto the same set of hosts, a significant amount of network communication overhead is introduced between the nodes. To overcome this problem, prior systems typically use batching to optimize internode communication. For example, if consensus group A and consensus group B both require communication from node N1 to node N2 (e.g., sending heartbeats or AppendEntry messages), then instead of sending two separate messages from N1 to N2, a coalescing layer ensures that the two messages are combined into a single transport layer message. This method reduces networking overhead indeed, but it introduces extra latency. For instance, the TiKV documentation says [7]: “TiKV uses an event loop to drive all the processes in a batch manner. It polls all the Raft groups to drive the Raft state machine every 1000ms and accepts the requests from the outside clients and Raft messages sent by other nodes through the notification mechanism of the event loop.” The 100-1000 milliseconds latency introduced by batching is not compatible with many telco or industrial applications. A third problem is that the architecture in the previously described approaches described above does not lend itself to creating cluster-wide consensus, as it applies consensus groups that are independent from one another. Cluster-wide consensus may be beneficial or even required, such as when updating the current cluster or sharding configuration. On the one hand, this may potentially cause consistency/ordering issues in global configuration changes. On the other hand, when clients/nodes access such global metadata, they must reach out to remote hosts to fetch it, introducing extra delays. Namely, the previous approaches store and replicate the meta-information describing the range-to- node mapping in the same way that regular user data is stored and replicated. In the case of Cockroach DB, for example, this means that the mapping (the top level and the second level index) is stored on three-three nodes. When first accessing a key range, the system first must fetch the relevant parts of the top level and second level index to know which nodes store the range of the key. This mechanism is good enough for most applications, but some latency-sensitive apps, like telco data plane functions, might require as low as possible data access latency, even on the first lookup. A fourth problem with previously described systems is limited system survivability. As described immediately above, crucial system-wide meta information is replicated only to a few nodes (three, by default). This means that if the “wrong” three nodes fail, the entire database cluster is rendered unusable, regardless of how many other nodes are left alive. Various embodiments of the techniques and systems described by this document enhance scalability by applying hierarchical cluster formation. According to these techniques, a leader is elected in each consensus group, and a “president” is independently elected from among the leaders. Note that the term “president” is used herein for convenience. The president may be understood as simply a “leader of the leaders.” The leaders of the various consensus groups may be referred to as “the council,” again for convenience. Various embodiments of the techniques described herein enable cluster-wide consensus by facilitating information sharing between the consensus groups. This way, global information like group membership and leadership status, as well as key allocation description, may be distributed to all nodes of the cluster, in some embodiments. Since metadata about key lookup stays bounded in size, it can be replicated to all nodes with negligible cost, even as the size of the database expands. This step helps achieve lower latency and better cluster survivability characteristics. The key space, on the other hand, is divided into disjoint groups (each with its own leader), to reduce the load on any given node. This division may be based on a range of key hashes, for example. In short, the techniques described herein improve scalability of distributed key-value stores by employing a multi-level hierarchical consensus algorithm. The higher level is responsible for distributing the whole key space into disjoint shards, while the lower layer keeps track of metadata handling for each such shard separately. The layers store their states replicated and operate in consensus, where state changes are followed in a definite order. This document describes a two-level hierarchy that enables scaling clusters up to 100-400 nodes (assuming 10-20 nodes in each consensus group). If desired, adding more layers can enable scaling to higher cluster sizes. In the description that follows, the terms “metadata” and “metainformation” are used interchangeably, to refer to data about data. There are different categories of metadata/metainformation discussed herein. In the two-level hierarchy that is the focus of the following discussion, there is metadata/metainformation regarding the division of nodes into groups and regarding the leaders of those – this information is managed at the “president” level. There is also metadata/metainformation regarding the allocation of key ranges to the various groups, i.e., regarding the sharding of the database. Again, this metadata/metainformation is managed at the president level, for the two-level example. Still further, however, there is metadata/metainformation regarding the storage of keys among members of each consensus group, i.e., information indicating where each key is stored. This metadata/metainformation is managed at the group level, by the leader of each group. The systems and techniques described herein build on top of consensus algorithms. One such algorithm is called Raft [2], which is considered one of the best implementable solutions today. All leader-follower style consensus algorithms work with similar components and concepts that will be introduced in this section. The new solutions disclosed here are described in context of the Raft algorithm and the below description highlights the additions required to the original protocol. But, it should be appreciated that the new techniques and systems may be built on different consensus algorithms, as well. For better understanding, an overview of the Raft algorithm as a typical consensus algorithm is presented, and then the new system and terminology is described. Raft is a consensus algorithm where participants (nodes) share a common understanding of some abstract state that is evolved in strictly ordered, unambiguous steps, called log entries. The ordered sequence of entries is called the log. Nodes elect a leader, and only the leader can append new entries to the log and communicate this change to the others (the followers). The log either progresses normally, or, if for some reason the communication between the nodes fails, it does not progress at all. Furthermore, there is always at most one leader. If there is no leader or it is missing due network failures, a new election is held, which either produces a consensus on a new leader or fails and again a new election is started. The solutions described herein uses multiple modified consensus groups to handle the whole data space. Following is a discussion of some of the terminology used in the present description. It should be understood, however, that the inventive techniques and systems described here may be implemented using different terms. Figure 1 shows an exemplary logical view of the new system. The figure shows nine logical nodes, which represent a database cluster in this example case. These logical nodes can be deployed on bare metal servers or in virtualized environments (e.g., virtual machines or containers). A given logical node will generally correspond to a single physical node, such as a server. A given physical node, on the other hand, may correspond to a single logical node, or to two or more. The use of the term “logical node” herein should therefore be understood to imply the existence of a physical node upon which the logical node is instantiated. The example scenario illustrated in Figure 1 contains four consistency groups, organized into two hierarchy levels, where the hierarchy levels are represented by circles. As can be observed, each node in the illustrated hierarchical system participates in one or two consistency groups, compared to the prior art where nodes participate in a high number of consistency groups. Participation in two consistency groups is not a limit on the presently disclosed techniques, but, keeping the number relatively low significantly reduces metainformation related communication overhead. The higher level, which may be referred to as the “council,” is the middle circle, containing nodes 1, 4 and 7. Node 1 is elected from among the leaders, or council members, as president. This higher level is responsible for cluster-wide operations, such as cluster membership and allocation of key shards to lower-level consistency groups. As the figure shows, the leaders of the lower-level groups are part of the council. Details for the operation of the council and allocation of shards are provided below. The introduction of the council enables creating consistent global metainformation, which then can be distributed to all nodes. This results in latency improvements, since there is no need to first query the global meta information from a subset of nodes. Furthermore, this approach makes the solution highly robust. Terminology used herein includes the following: - Key & value: Key-value stores associate a distinct data value to each unique key stored. The key therefore is a handle for setting or getting the value. - Shard/key shard: A group of keys (and associated values) that are handled together. Key shards are disjoint: each key/value pair is present in only one shard. What keys belong to each shard at any moment is not fixed but can and does change in time. - Node (or logical node): A participating entity in the consensus algorithm. Typically, each physical or virtual computer in a server cluster runs a single node process. - Shard group: A cluster of nodes keeping the record for a single key shard. Shard groups operate as described in [8], especially the remote procedure call (RPC) and states in Fig.3.1 of [8]. A shard group may be referred to as a consistency group, a consensus group, or simply a “group.” When considered relative to the council (defined below), the shard group may be considered a “subgroup.” - Leader: A decision-making node in a shard group. There is at most one leader in each shard group at any single time point. If there is no leader detected (either because there was none to start with or due to a network or other failure) the members of the shard group will start an election process. - Council: A consensus group consisting of the leader of each shard group. The council’s log keeps track of changes to key shard ranges and their mapping to shard groups. - President: A decision-making node in the council. Analogous to the leader role; in particular, the system guarantees that there is at most one president at any time. The president may be regarded as the “leader of the leaders.” In implementations using more than two layers of hierarchy, there will be a leader at each level. Each leader (and president, for the council) is selected by the member nodes of each group. Further terms for describing this process include: - Election: The process by which nodes decide on one of them taking the role of the leader. This either results in an unequivocal decision, or no decision at all, in which latter case the election is repeated. - Term: Ordinal number of election processes in a shard group, both those that successfully ended in electing a leader and those that did not. In any case, whenever there is a leader, it has a term associated with it, which is a number starting at 1. The term is per-shard-group, in that each shard group may be on a different term, at any given time. - Epoch: The ordinal number of election processes in the council. Analogous to the term of shard groups. - Log: Ordered sequence of entries that is agreed on by all nodes in the consensus group. The log is written only by the leader for shard groups and by the president for the council. Figure 2 illustrates internals of the node, from a logical perspective, and shows that nodes in the system are communicating over an internal network. The solutions described herein require the network to enable unique identification of nodes, but different technologies can be imagined (IPv4, IPv6, InfiniBand, etc.). Clients of the system connect to the nodes over an external network that might be implemented using the same technology as the internal network. However, certain optimizations might be also applied, such as using shared memory communication between a client and a node if they are deployed on the same physical server. The figure shows that clients might connect to any of the nodes in the system. Turning to the internals of nodes, the consensus algorithm of nodes taking part in the council maintains a council log to store the higher-level metadata of the council and the actual epoch. As will be detailed below, the president allocates nodes to shard groups and also manages the allocation of the key space to shards by splitting and merging shards. The leader of each shard group may add information indicating the allocations of nodes to groups and/or the allocation of the key space to its shard group log as well, and as a result it is distributed to all nodes in the cluster. The shard group log contains the actual term for the respective shard group, and information about the assigned keys. An important aspect of the techniques described herein is ensuring cluster scalability via the introduction of multi-level meta-information handling. For the actual implementation of storing keys and values on the shard group level at least two implementations are possible: - One option is to replicate the keys and values of the assigned shard to all nodes in the group. As a result, any node in the group can service data access requests directly. In practice this implies that all operations along with the data (creation, modification, etc.) are replicated, via the shard group log, to the nodes of the group. - Another option is that the consistency protocol on the shard group level replicates meta information about the actual location of the allocated keys via the shard group log. This solution enables separating the storage and the replication of the meta information (keys) and the actual value (data). This is the solution applied in DAL [1] as well; the replicated meta information about each key is a list of nodes storing the data itself. While this creates another level of indirection, the location can be cached, and this solution enables moving the data freely between nodes to ensure the best possible data locality for applications. Following are details about how the topology information needs to be handled in our multi-level consensus architecture. Different procedures required for the invention to function are described, along with discussion of optional features. Options and implementation suggestions regarding sharding related aspects are then provided. Finally, details are provided for how the Raft [2] algorithm needs to be modified to be applicable in the new system and method. Management of topology information (global metadata): - Beyond identification of servers in the council (IP-address, UUID, etc.), the identification of each leader is accompanied with their respective terms from their own shard group. This information is kept alongside the cluster configuration information, replicated first in the council and then in all shard groups by their leaders, and used for filtering out communication from old leaders. By “cluster configuration information” is mean the metainformation about the shard groups and the key space allocations. - The council handles consensus update of key shard ranges for each shard group. This mapping is updated during scaling events (when the number of nodes changes). The mapping and the identity and role of all nodes is called the topology information of the cluster and is replicated in the same council-then-groups fashion to each leader node as a log entry and then, in some embodiments, to each member of the groups. Details of an example scale-out procedure are illustrated in Figure 3 and Figure 4. Note that in Figure 3, double-lined boxes signify blocks that can have pluggable decision algorithms, examples of which are given below. As shown in Figures 3 and 4, a new node joins the cluster through the president, which assigns it to a shard group. Neither specific details of how nodes can find the president nor the specifics of the decision algorithm are essential to an understanding of the new techniques and systems described herein, but examples are given below. Once learning of its assigned shard group, the new node joins it, and once finished the group’s leader registers it and reports it to the president. The updated topology information is first replicated by the council and then to each node through their respective leaders. If the size of the shard group would increase above a certain threshold as the consequence of adding a new node, the president may initiate a splitting of the group into two. Consequently, the number of groups increases by one, as does the size of the council, and some reshuffling of the key ranges between groups may happen. Examples are given below. Splitting may be executed in two phases. First, the president decides on which nodes should stay in the old group and which should form the new one, and this is logged and replicated as the new topology information on both levels. Second, the possible transfer of key range metadata is done, between participating groups. Further details are provided below. Figure 5 illustrates an example of a scale-in process, i.e., the process for when a node leaves the cluster. Again, double-lined boxes in the figure signify blocks that can have pluggable decision algorithms, examples of which are given below. A node may leave a shard group either gracefully if requested by the president due to decreasing client activity, or because of a compute or networking failure where it simply disappears for the rest of the cluster. In the latter case its leader propagates the information to the president, and the topology information is updated and replicated to all nodes. If the size of the group decreases below a threshold, the president may decide to merge the remainder of the shard group to the other groups. In this case there will be one fewer group (and leader) but otherwise it is done similarly to splitting in that the new topology information is first replicated in the council then in all remaining groups. The nodes of the old group transfer their respective key range metadata to the remaining groups according to some algorithm (see an example below), and the old group’s leader is demoted to follower role. Part of the procedure shown in Figure 5 corresponds to a shard leader change, e.g., where the node leaving the cluster is the leader of a shard. According to this procedure: - If a leader node leaves a group then the rest of the group executes an election as normally. Until it is completed, that shard group does not do any key metadata changes, but metadata lookup and the rest of the groups are unaffected. - [Optional] If the leaving leader was the president too, then the council will elect in parallel. During this process no new nodes can join nor any other topology changes can be logged, like the above leader change. - Once the new leader is elected for the group, this fact is reported to the council as soon as the council has a president, if there was an election there too. In all cases, the fact of the new leader’s identity will eventually be replicated in both the council and all shard groups, as usual. Several options are possible for discovery of the president, as well as for node assignment. New nodes that are joining the cluster can learn of the president through broadcasting on well-known ports, for example, or via any DNS-based service, e.g., as in a Kubernetes implementation. Regarding the assignment of new nodes to a shard group, the president may assign new nodes to the shard group with the least current load, or highest unused compute capacity, or explicit configuration, programmatically or otherwise. Alternatively, the group can be selected based on physical locality, or the lowest network delay between the new node and members of the group. As discussed above, it is the task of the president to assign keys to key shards and then key shards to shard groups. The president makes these decisions and then this information is shared and managed at the council level, and then replicated to the next lower level. Distribution of keys into shards can be done as simply as lexicographically. In this case, however, it is hard to guarantee that the key ranges will contain a roughly equal number of keys, which property is important for keeping the load of metadata operations in the system fair and controlled. Another approach is to use a hash function where each key is assigned a number from a known finite range. Mathematically good hash functions will distribute the keys evenly in the number range, and thus a fair grouping can be done by splitting the range into equal sized chunks. Figure 6 shows a simple range-based sharding example, with keys ranging from 0 to 99. In the illustrated example, with three shard groups: - keys with hash values of 0-33 are assigned to Shard group 1; - keys with hash values of 34-66 are assigned to Shard group 2; and - keys with hash values 67-99 are assigned to Shard group 3. In the council (nodes 1, 4 and 7 in the example shown in Figure 1), this information is replicated, using the consistency protocol, along with the memberships for each group (e.g., information indicating that nodes 1, 2 and 3 are in shard group 1). Then, this information may be added to each shard group log, for replication among the members of each shard group. Having globally agreed consistent metainformation enables other key-to-shard allocation solutions besides the detailed range-based allocation. One such solution uses the Highest Random Weight (HRW) algorithm, which minimizes the changes to existing key ranges upon adding or removing a group. [9] Following are implementation-related details and examples for cluster, group, and shard allocation management. These are recommendations and examples - other possible implementations of these parts can yield functioning systems. One advantage of the techniques described herein is that they enable large database cluster sizes. When the cluster is started it might use a configuration method or discovery protocol to start with a larger number of nodes right away. In other cases, the cluster might grow from a small set of nodes by adding new nodes gradually. For the sake of completeness, this section presents information and limitations about smaller clusters. With small cluster sizes the logical topology requires only a single consistency group to be operational (this is the prior art). For example, having only three nodes in the cluster requires a simple 3-node group with a single leader. In these scenarios the group acts as a single shard handling all the key space. This scenario can be seen in Figure 7, in part (a). As nodes are added to the cluster the council must be created by the single leader at a certain point. Electing a president in the council of exactly two nodes cannot be committed according to the consistency protocol. As a solution, the system should either never have exactly two shard groups or add an extra council member in such cases, at the extra complexity of the latter replicating and acting on council log messages and actions although not being a leader itself. Furthermore, consistency groups are more resilient against networking related separation issues with an odd number of nodes, because with any possible separation scenario, one of the sides will always have the majority of nodes and be functional for running elections. As also seen in Figure 7, in part (b), one opportunity for forming the council is when the single group has nine nodes. At this point the actual leader can create three shard groups with three nodes in each. In the event that the cluster is being scaled in, having three shard groups with three nodes in each is also a realistic cluster size at which the president may decide to merge all key shards into one group. The right side of the figure refers back to the example shard allocation in Error! Reference source not found.. When the groups are created, the nodes of the groups can simply delete key-value information for no longer managed ranges. When the cluster is being scaled in and the council is removed, key-value information needs to be distributed to all nodes in the new single group, via the consistency protocol. Finally, Figure 7, part (c) shows an example for scaling out from three to five shard groups, with 15 nodes in the system. The advantage of this configuration is that the cluster is most stable with both an odd number of groups and an odd number of nodes in each group both before and after the change. As the illustrated example sharding suggests, group-local key-value synchronization is only required in two out of the five groups; in the remaining three, only deletion is necessary. A straightforward application of the methodology described herein on an existing solution can be done using Raft [2]. In the following knowledge of the original Raft protocol is assumed; only additions to it are described. The council elects its president the same way as a Raft cluster. When a new leader is joining the council, the AddServer RPC is executed in a modified way compared to the original Raft algorithm (Fig.4.1 in [8]) as follows: AddServer RPC Received by the president when a new leader is elected in any shard group. Arguments: newServer identity & address of the leader to be added ingroup identity of the shard group the leader belongs to newTerm term in which the leader was elected in its own shard group Results: status OK if the leader was added successfully leaderHint address or identity of a recent president, if known Receiver implementation: o (same as Raft) Reply NOT_LEADER if not president o Check if shard group identified by ingroup is already known (that is, its leader was on the council). If yes, and newTerm is smaller than the previous term stored for inGroup, reply with an error, or optionally ignore the RPC depending on system settings. o (same as Raft from here) Catch up newServer (with council log) for a fixed number of rounds… Council members receive and process log messages from the president in the same way as followers do with leader messages, except that if at any time a server which is not currently leader receives such a message it should silently ignore it. The described method can be extended for clusters with a very large number of nodes, using more grouping tiers. With generalization of the council-vs-shard group organization of nodes, one can introduce multiple council-like layers, all communicating topology changes from the layer above to the one below. The lowest and highest layers retain their extra functionality as described above: the lowest is the shard group that manages key metadata, and the highest is the top-level council that has the president role and is responsible for cluster membership management. Referring back to Figure 1, it will thus be appreciated that it illustrates a system for distributed shared storage of data records, where the system comprises one or more physical nodes and comprises a plurality of logical nodes, shown as logical nodes 1-9 in the figure. Other examples may include fewer or more logical nodes, and the number of logical nodes may vary over time, as was discussed above. Each logical node is implemented on a respective one or more of the one or more physical nodes and comprises a first interface configured to communicate with others of the logical nodes and a second interface configured to communicate with one or more client applications. As seen in the figure, the plurality of logical nodes consists of multiple disjoint subgroups of logical nodes, shown in the figure as shard group 1, shard group 2, and shard group 3, where each subgroup comprises a plurality of logical nodes. Again, there can be more of these subgroups. There can be fewer as well, although three or more may be preferred, to facilitate elections and other processes. In this system, a single logical node from each of the multiple subgroups is designated as a leader of the respective subgroups – these are shown in the figure as leader 1, leader 2, and leader 3. This may be the result of an election process, as described elsewhere in this document. A single one of the leaders is further designated as leader of the leaders – this is leader 1 in the illustration, which refers to this leader of leaders as the “president.” The leader of the leaders is configured to maintain a first log recording allocations to each subgroup of a respective disjoint subset of a set of shared resources (often referred to as “keys”), and is configured to distribute a copy of the first log and updates of the first log to each of the remaining leaders. The leader of each subgroup (leader 1, leader 2, leader 3) is configured to maintain a subgroup log recording metainformation for each resource in for the subset of shared resources allocated to the subgroup. In some embodiments of a system like that shown in Figure 1, the leader of each subgroup is configured to distribute a copy of the first log and updates of the first log to each logical node in the subgroup. In some of these and in some other embodiments, the leaders of the subgroups are configured to elect the leader of the leaders from among the leaders of the subgroups, in response to determining that no leader of the leaders has been elected or that a previous leader of the leaders is unreachable. In some embodiments, the leader of the leaders is configured to manage changes to the number of subgroups. The leader of the leaders may be configured to detect that a first subgroup has fewer than a threshold number of logical nodes and, in response, merge the logical nodes of the first subgroup to one or more others of the multiple subgroups, thereby eliminating the first subgroup. The leader of the leaders may be further configured to then update the first log and distribute the updated first log or an update of the first log to each of the leaders of subgroups other than the subgroup of the leader of the leaders. Similarly, the leader of the leaders may be configured to detect that a first subgroup has more than a threshold number of logical nodes and, in response, split the first subgroup into two subgroups. The leader of the leaders may be configured to then update the first log and distribute the updated first log or an update of the first log to each of the leaders of subgroups other than the subgroup of the leader of the leaders. In various embodiments of the system illustrated in Figure 1 and variants thereof, the leader of the leaders is configured to allocate, to each subgroup, a respective one of the disjoint subsets of the shared resources. In some of these embodiments, the leader of the leaders may be configured to perform this allocation in response to detecting that the number of subgroups has changed. In some of these and in some other embodiments, the leader of the leaders is configured to perform this allocation using any of: hash values corresponding to the shared resources; lexicographic values corresponding to the shared resources; and a highest random weight algorithm applied to values corresponding to the shared resources. In various embodiments of systems like the one illustrated in Figure 1, each leader is configured to report, to the leader of the leaders, a change in logical nodes in the respective subgroup. Figure 8 is a block diagram illustrating an example physical node, which may be configured to participate as a leader node in distributed shared storage of data records by a plurality of nodes. This example node 800, which may be a server, for example, comprises processing circuitry 810 and interface circuitry 820 operatively coupled to the processing circuitry 810 and configured to communicate with other ones of the plurality of nodes and with one or more clients. The processing circuitry 810 may comprise one or more microprocessors 812 or the like, as well as memory 814, with the memory storing program instructions 815 for execution by the microprocessors as well as the metainformation and data records 817 described throughout this document. Processing circuitry 810 configured to participate in an election of a leader of leaders, from among a group of nodes comprising the node and two or more other leader nodes, as well as to receive, from the leader of leaders, an allocation of a subset of a set of shared resources, the subset consisting of shared resources to be stored by a subgroup of nodes managed by the node 800. The processing circuitry 810 is further configured to manage a consistency algorithm to store the subset of shared resources among the subgroup of the nodes managed by the node 800, where managing the consistency algorithm comprises maintaining a subgroup log recording metainformation for each resource in for the subset of shared resources allocated to the subgroup. In some embodiments, the processing circuitry 810 may be configured to receive, from the leader of leaders, a log recording allocations to each of a plurality of subgroups of a respective disjoint subset of the set of shared resources, and to distribute the log to the nodes of the subgroup of nodes managed by the node 800. The processing circuitry may be further configured to report, to the leader of the leaders, a change in logical nodes in the respective subgroup. In some embodiments, the processing circuitry 810 may be further configured to, in response to the node 800 being elected the leader of the leaders, maintain a council log recording allocations to each subgroup of a respective disjoint subset of a set of shared resources and distribute a copy of the council log and updates of the council log to each of the remaining leader nodes. In some embodiments the processing circuitry 810 may be further configured to, while the node 800 is acting as leader of the leaders, detect that a first subgroup has fewer than a threshold number of logical nodes and, in response, merge the logical nodes of the first subgroup to one or more others of the multiple subgroups, thereby eliminating the first subgroup. The processing circuitry 810 in these embodiments may be further configured to update the first log and distribute the updated first log or an update of the first log to each of the leaders of subgroups other than the subgroup of the node 800. In some embodiments, the processing circuitry 810 is further configured to, while the node 800 is acting as leader of the leaders, detect that a first subgroup has more than a threshold number of logical nodes and, in response, split the first subgroup into two subgroups. The processing circuitry 810 in these embodiments may again be further configured to update the first log and distribute the updated first log or an update of the first log to each of the leaders of subgroups other than the subgroup of the node. In some embodiments, the processing circuitry 810 is further configured to, while the node 800 is acting as leader of the leaders, allocate, to each subgroup, a respective one of the disjoint subsets of the shared resources. In some of these embodiments, the processing circuitry 810 is configured to perform said allocation in response to detecting that the number of subgroups has changed. This allocation may be performed using any of: hash values corresponding to the shared resources, lexicographic values corresponding to the shared resources, and a highest random weight algorithm applied to values corresponding to the shared resource, in various embodiments. Figure 9 illustrates an example method as might be implemented by a system like that shown in Figure 1. Figure 9 should be understood as a generalized method that is intended to encompass many, if not all, of the example techniques described above. Thus, where the terminology used to describe this method has minor differences from the terminology used in the discussions above, it should be assumed that the terminology used herein is meant to refer to the same thing or to encompass a more general thing than the corresponding term used above. The method shown in Figure 9 is implemented in a system comprising one or more physical nodes and comprising a plurality of logical nodes, where each logical node is implemented on a respective one or more of the one or more physical nodes and comprises a first interface configured to communicate with others of the logical nodes and a second interface configured to communicate with one or more client applications. The plurality of logical nodes consists of multiple disjoint subgroups of logical nodes, each subgroup comprising a plurality of logical nodes. A single logical node from each of the multiple subgroups is designated as a leader of the respective subgroups and a single one of the leaders is designated as leader of the leaders. As shown at block 910, the method comprises maintaining, by the leader of the leaders, a first log recording allocations to each subgroup of a respective disjoint subset of a set of shared resources and distributing a copy of the first log and updates of the first log to each of the remaining leaders. As shown at block 920, the method comprises maintaining, by the leader of each subgroup, a subgroup log recording metainformation for each resource in for the subset of shared resources allocated to the subgroup. Figure 10 illustrates a corresponding method, as implemented in a node configured to participate as a leader node in system like that shown in Figure 1 (and variants thereof). Again, Figure 10 should be understood as a generalized method that is intended to encompass many of the example techniques described above. So, where the terminology used to describe this method has minor differences from the terminology used in the discussions above, it should be assumed that the terminology used herein is meant to refer to the same thing or to encompass a more general thing than the corresponding term used above. As shown at block 1010, the example method comprises participating in an election of a leader of leaders, from among a group of nodes comprising the node and two or more other leader nodes. As shown at block 1020, the method further comprises receiving, from the leader of leaders, an allocation of a subset of a set of shared resources, the subset consisting of shared resources to be stored by a subgroup of nodes managed by the node. As shown at block 1030, the method still further comprises managing a consistency algorithm to store the subset of shared resources among the subgroup of the nodes managed by the node, where managing the consistency algorithm comprises maintaining a subgroup log recording metainformation for each resource in for the subset of shared resources allocated to the subgroup. In some embodiments or instances, the method may further comprise receiving, from the leader of leaders, a log recording allocations to each of a plurality of subgroups of a respective disjoint subset of the set of shared resources and distributing the log to the nodes of the subgroup of nodes managed by the node. In some embodiments or instances, the method further comprises reporting, to the leader of the leaders, a change in logical nodes in the respective subgroup. In some embodiments or instances, the method further comprises, in response to the node being elected the leader of the leaders, maintaining a council log recording allocations to each subgroup of a respective disjoint subset of a set of shared resources, and is configured to distribute a copy of the council log and updates of the council log to each of the remaining leader nodes. In some of these embodiments or instances, the method further comprises, while the node is acting as leader of the leaders, detecting that a first subgroup has fewer than a threshold number of logical nodes and, in response, merging the logical nodes of the first subgroup to one or more others of the multiple subgroups, thereby eliminating the first subgroup, and updating the first log and distributing the updated first log or an update of the first log to each of the leaders of subgroups other than the subgroup of the node. Similarly, in some other embodiments or instances, the method may further comprise, while the node is acting as leader of the leaders, detecting that a first subgroup has more than a threshold number of logical nodes and, in response, splitting the first subgroup into two subgroups, and updating the first log and distributing the updated first log or an update of the first log to each of the leaders of subgroups other than the subgroup of the node. In some embodiments or instances, the method may comprise, while the node is acting as leader of the leaders, allocating, to each subgroup, a respective one of the disjoint subsets of the shared resources. In some of these embodiments or instances, the method may comprise performing this allocation in response to detecting that the number of subgroups has changed. This allocation may use any of hash values corresponding to the shared resources, lexicographic values corresponding to the shared resources, and a highest random weight algorithm applied to values corresponding to the shared resource, in various embodiments or instances. The detailed examples given above are provided for illustrative purposes, and no particular example is intended to be limiting. The scope of the present invention is at least that of the claims appended hereto. REFERENCES 1. G. Németh, D. Géhberger, and P. Mátray. "DAL: A locality-optimizing distributed shared memory system." 9th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 17). 2017. 2. D. Ongaro et al. "In search of an understandable consensus algorithm." USENIX ATC (2014) 3. FP. Junqueira, BC. Reed, and M. Serafini. "Zab: High-performance broadcast for primary- backup systems." 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN). IEEE, 2011. 4. L. Lamport. "Paxos made simple." ACM Sigact News 32.4 (2001): 18-25. 5. etcd Admin Guide on the recommended cluster size, available at etcd.io/docs/v2.3/admin_guide/ 6. CockroachDB scaling Raft, available at www.cockroachlabs.com/blog/scaling-raft/ 7. TiKV Multi-Raft, available at tikv.org/deep-dive/scalability/multi-raft/ 8. Apache Ozone Multi Raft, available at blog.cloudera.com/multi-raft-boost-up-write- performance-for-apache-hadoop-ozone/ 9. Thaler, David; Chinya Ravishankar. "A Name-Based Mapping Scheme for Rendezvous" (PDF), University of Michigan Technical Report CSE-TR-316-96.