LAYERED DISTRIBUTED STORAGE SYSTEM AND TECHNIQUES FOR EDGE COMPUTING SYSTEMS

Title:

LAYERED DISTRIBUTED STORAGE SYSTEM AND TECHNIQUES FOR EDGE COMPUTING SYSTEMS

Document Type and Number:

WIPO Patent Application WO/2018/217715

Kind Code:

Abstract:

A two-layer erasure-coded fault-tolerant distributed storage system offering atomic access for read and write operations is described. In some embodiments, a class of erasure codes known as regenerating codes (e.g. minimum bandwidth regenerating codes) for storage of data in a backend layer is used to reduce the cost of backend bulk storage and helps reduce communication cost of read operations, when a value needs to be recreated from persistent storage in the backend layer. By separating the functionality of edge layer servers and backend servers, a modular implementation for atomicity using storage-efficient erasure-codes is provided. Such a two-layer modular architecture permits protocols needed for consistency implementation to be substantially limited to interaction between clients and the edge layer, while protocols needed to implement erasure codes are substantially limited to the interaction between edge and backend layers.

Inventors:

KONWAR KISHORI (US)
NARAYANA MOORTHY PRAKASH (US)
MEDARD MURIEL (US)
LYNCH NANCY (US)

Application Number:

PCT/US2018/033844

Publication Date:

November 29, 2018

Filing Date:

May 22, 2018

Export Citation:

Click for automatic bibliography generation Help

Assignee:

MASSACHUSETTS INST TECHNOLOGY (US)

International Classes:

G06F15/173; G06F17/00; G06F21/00

Foreign References:

US20140304513A1	2014-10-09
US20120266044A1	2012-10-18
US20140317206A1	2014-10-23
US20120096546A1	2012-04-19
US20130332814A1	2013-12-12
US201715838966A	2017-12-12

Other References:

GAO ET AL.: "Application Specific Data Replication for Edge Services", WWW 2003, 20 May 2003 (2003-05-20), Budapest, Hungary, pages 449 - 460, XP058239936, Retrieved from the Internet [retrieved on 20180721]
LEI GAO ET AL.: "WWW 2003: The Twelfth International World Wide Web", 20 May 2003, BUDAPEST CONVENTION CENTRE, article "Application specific data replication for edge services", pages: 449 - 460
See also references of EP 3631641A4

Attorney, Agent or Firm:

DALY, Christopher, S. et al. (US)

Download PDF:

View/Download PDF PDF Help

Claims:

What is claimed is:

1. A layered distributed storage (LDS) system comprising:

a plurality of edge layer servers, each of said plurality of edge layer servers including an interface with which to couple to one or more client nodes, a processor for processing read and/or write requests from the client nodes and for generating tag-value pairs, and a storage for storing lists of tag-value pairs;

a plurality of backend layer servers, each of said plurality of backend layer servers comprising an edge-backend layer interface for coupling said backend layer servers to one or more of said edge layer servers, a processor for generating codes and a storage having coded versions of tag-value pairs stored therein.

2. The LDS system of claim 1 wherein the tag-value pairs are coded by at least one of:

(a) erasure codes;

(b) minimum bandwidth regeneration (MBR) codes; or

3. The LDS system of claim 1 wherein:

the storage in at least some of said edge-layer servers is temporary storage; and

the storage in at least some of said backend layer servers is persistent storage.

4. An edge layer server comprising:

an interface with which to couple to one or more client nodes,

a processor for processing read and/or write requests from the client nodes and for generating tag-value pairs; and

a storage configured to store lists of tag-value pairs.

5. A backend layer server comprising:

an edge-backend layer interface configured to couple said server to one or more edge layer servers; a processor for generating codes; and

a storage configured to store coded versions of tag-value pairs stored therein.

6. In a system for coded consistent distributed storage having an edge layer £1 including n1 edge layer servers and a backend layer s including n2 backend layer servers, a method of writing data comprising:

receiving a write request;

each server in the edge layer £i , having access to an object value v, at an appropriate point in the execution, encodes the object value v using a code C2; and sending coded data c_ni+i to server s_ni +i in the backend layer £2 where 1≥ i≥ n2.

7. In a system having a layered architecture for coded consistent distributed storage, a method of reading data comprising:

a server Sj in the edge layer £1 reconstructs coded data q using content from a backend layer £2. wherein coded data q may be considered as part of the code C, and the coded portion q gets reconstructed via a repair procedure invoked by a server q in the edge layer £1 where the d helper servers belong to the backend layer £2.

8. The method of claim 7 wherein said servers operate at the MBR point so as to minimize the cost needed by the server q to reconstruct cj.

9. The method of claim 7 wherein k servers in the edge layer £1 provide to the reader k coded data elements.

10. The method of claim 9 wherein in response to receiving k coded data elements form k servers in the edge layer £1 , the reader uses the code Ci to attempt decoding an object value v.

1 1 . In a system having a layered architecture for coded consistent distributed storage, an edge layer server method comprising:

receiving a read request at one or more edge layer servers; determining whether there exists an overlap with a concurrent write operation or an internal write-to-L2 operation;

in response to an overlap existing, providing a reader tag-value pairs (t, v) served from temporary storage of one or more servers in the edge layer £i _;

in response to an overlap not existing, regenerating one or more tag-coded element pairs (tag, coded-element) in one or more servers in the edge layer £1 ,

12. The method of claim 1 1 further comprising sending edge layer regenerated tag, coded-element pairs to the reader.

13. The method of claim 1 1 wherein regenerating one or more tag-coded element pairs (tag, coded-element) in one or more servers in the edge layer £i , comprises the edge layer servers utilizing information from the backend layer servers.

14. The method of claim 12 wherein utilizing information from the backend layer servers comprises a regenerate-from-L2 operations.

Description:

LAYERED DISTRI BUTED STORAGE SYSTEM AND

TECHNIQUES FOR EDGE COMPUTING SYSTEMS

BACKGROUND

[0001 ] As is known in the art, edge computing systems and applications which utilize such systems are an emerging distributed computing paradigm. In edge- computing, users (or "clients") interact with servers on an edge of a network. Such servers are thus said to form a first layer of servers. The edge servers, in turn, interact with a second layer of servers in a backend of the edge computing system and thus are (referred to as "backend layer servers". While the edge servers are typically in geographic proximity to clients, the backend layer of servers are often provided as part of a data-center or a cloud center which is typically geographically distant from the clients and edge servers. The geographic proximity of the edge servers to clients permits high speed operations between clients and the edge layer, whereas communication between the edge servers and the backend is typically much slower. Such decentralized edge processing or computing is considered to be a key enabler for Internet of Things (loT) technology.

[0002] As is also known, providing consistent access to stored data is a fundamental problem in distributed computing systems, in general, and in edge computing systems, in particular. Irrespective of the actual computation involved, application programs (also referred to simply as "applications" or more simply "apps") in edge computing systems must typically write and read data. In settings where several writers attempt to concurrently or simultaneously update stored data, there is potential confusion on the version of data that should be stored during write operations and returned during read operations. Thus, implementation of strong consistency mechanisms for data access is an important problem in edge computing systems and is particularly important in those systems which handle massive amounts of data from many users.

[0003] To reduce, and ideally minimize, potential confusion with respect to different versions of the same data, consistency policies (or rules) may be imposed and implemented to deal with problems which arise because of concurrent access of data by clients. One well-known and widely acknowledged, and the most desirable form of consistency policy is known as "atomicity" or "strong consistency" which, at an application level, gives users of a distributed system (e.g. an edge computing system) the impression of a single machine executing concurrent read and write operations as if the executions take place one after another (i.e. sequentially). Thus, atomicity, which in simple terms, gives the users of a data service the impression that the various concurrent read and write operations take place sequentially. An ideal consistency solution should complete client operations via interaction only with the edge layer, whenever possible, thereby incurring low latency.

[0004] This is not possible, however, in all situations since practical edge servers have finite resources such as finite storage capacity and in some systems and/or uses the edge layer servers may be severely restricted in their total storage capacity as well as in other resources. For example, in situations where several thousands of files are being serviced, the edge servers typically do not have the capacity to store all the files all the time. In such situations, the edge servers rely upon the backend layer of servers for permanent storage of files that are less frequently accessed. Thus, the servers in the first layer act as virtual clients of the second layer servers.

[0005] Although various consistency policies (often weaker than strong consistency) are widely implemented and used in conventional processing systems, there is a lack of efficient implementations suitable for edge-computing systems. One important challenge in edge-computing systems, as described above, is reducing the cost of operation of the backend layer servers. Communication between the edge layer servers and backend layer servers, and persistent storage in the backends layer contribute to the cost of operation of the backend layer. Thus, cost reduction may be accomplished by making efficient use of the edge layer servers.

SUMMARY

[0006] Described herein are concepts, systems and techniques directed toward a layered distributed storage (LDS) system and related techniques. In one

embodiment, a two-layer erasure-coded fault-tolerant distributed storage system offering atomic access for read and write operations is described. Such systems and techniques find use in distributed storage systems including in edge computing systems having distributed storage.

[0007] The systems and techniques described herein addresses the edge computing challenges of: (1 ) reducing the cost of operation of backend layer servers by making efficient use of edge layer servers by: (a) controlling communication between edge layer servers and backend layer servers; and (b) controlling persistent storage in the backend layer; (2) enforcing/controlling consistency (e.g. atomic access for read and write operations); and (3) completing client operations via interaction only with the edge layer servers, whenever possible.

[0008] The described systems and techniques enable atomicity consistent data storage in edge computing systems for read and write operations while maintaining a desirable level of speed for users. In embodiments, the advantages of the concepts, come from the usage of erasure codes. In embodiments, minimum band width regenerating (MBR) codes may be used. In embodiments, random linear network codes (RLNC) may be used.

[009] Since the techniques and systems described herein can be specifically adapted for edge-computing systems, a number of features can be provided. For example, as may be required by some edge-computing systems, the LDS technique described herein ensure that clients interact only with the edge servers and not with backend servers. In some embodiments, this may be an important requirement for applying the LDS technique to edge-computing systems. By ensuring that clients interact only with the edge servers and not with backend servers, the LDS

techniques described herein allow completion of client-operations by interacting only with the edge layer (i.e. only a client need only interact with one or more edge layer servers). Specifically, a client write-operation (i.e. a client writes data) stores an updated file into the edge-layer and terminates. The client write-operation need not wait for the edge-layer to offload the data to the backend layer. Such a characteristic may be particularly advantageous in embodiments which include high speed links (e.g. links which provide a relatively low amount of network latency) between clients and edge layer servers. For a read operation, the edge-layer may effectively act as proxy-cache that holds the data corresponding to frequently updated files. In such situations, data required for a read maybe directly available at edge layer, and need not be retrieved from the backend layer.

[0010] Also, the LDS system and techniques described herein efficiently use the edge-layer to improve (and ideally optimize) the cost of operation of the backend layer. Specifically, the LDS technique may use a special class of erasure codes known as minimum bandwidth regenerating (MBR) codes to simultaneously improve (and ideally optimize) communication cost between the two layers, as well as storage cost in the backend layer.

[001 1 ] Further still, the LDS technique is fault-tolerant. In large distributed systems, the individual servers are usually commodity servers, which are prone to failures due to a variety of reasons, such as, power failures, software bugs, hardware malfunction etc. Systems operating in accordance with LDS techniques described herein, however, are able to continue to serve the clients with read and write operations despite the fact that some fraction of the servers may crash at

unpredictable times during the system operation. Thus, the system is available as long as the number of crashes does not exceed a known threshold.

[0012] The underlying mechanism used to for fault-tolerance is a form of redundancy. Usually, simple redundancy such as replication increases storage cost, but at least some embodiments described herein use erasure codes to implement such redundancy. The LDS techniques described herein achieves fault-tolerance and low storage and/or communication costs all at the same time.

[0013] In accordance with one aspect of the concepts described herein, a layered distributed storage (LDS) system includes a plurality of edge layer servers coupled to a plurality of backend layer servers. Each of the edge layer servers including an interface with which to couple to one or more client nodes, a processor for processing read and/or write requests from the client nodes and for generating tag- value pairs, a storage for storing lists of tag-value pairs and a backend server layer interface for receiving tag-value pairs from said processor and for interfacing with one or more of the plurality of backend servers. Each of the backend layer servers includes an edge-layer interface for communicating with one or more servers in the edge layer, a processor for generating codes and a storage having stored therein, coded versions of tag-value pairs. In some cases, the tag-value pairs may be coded via erasure coding, MBR coding or random linear network coding techniques. The backend layer servers are responsive to communications from the edge layer servers.

[0014] In preferred embodiments, the storage in the edge-layer servers is temporary storage and the storage in the backend layer servers is persistent storage.

[0015] With this particular arrangement, a system and technique which enables atomic consistency in edge computing systems is provided. Since users (or clients) interact only with servers in the edge layer, the system and technique becomes practical for use in edge computing systems, where the client interaction needs to be limited to the edge. By separating the functionality of the edge layer servers and backend servers, a modular implementation for atomicity using storage-efficient erasure-codes is provided. Specifically, the protocols needed for consistency implementation are largely limited to the interaction between the clients and the edge layer, while those needed to implement the erasure code are largely limited to the interaction between the edge and backend layers. Such modularity results in a system having improved performance characteristics and which can be used in applications other than in edge-computing applications.

[0016] The LDS technique described herein thus provides a means to

advantageously use regeneration codes (e.g. storage-efficient erasure codes) for consistent data storage.

[0017] It should be appreciated that in prior art systems, use of regenerating codes is largely limited to storing immutable data (i.e. data that is not updated). For immutable data, these codes provide good storage efficiency and also reduce network bandwidth for operating the system.

[0018] Using the techniques described herein, however, the advantages of good storage efficiency and reduced network bandwidth possible via regenerating codes can be achieved even for data undergoing updates and where strong consistency is a requirement. Thus, the LDS techniques described herein enable the use of erasure codes for storage of frequently-updated-data. Such systems for supporting frequently-updated-data are scalable for big-data applications. Accordingly, the use of erasure codes as described herein provides edge computing systems having desirable efficiency and fault-tolerance characteristics.

[0019] It is recognized that consistent data storage implementations involving high volume data is needed in applications such as networked online gaming, and even applications in virtual reality. Thus, such applications may now be

implemented via the edge-computing system and techniques described herein.

[0020] In accordance with a further aspect of the concepts described herein, it has been recognized that in systems which handle millions of files, (which may be represented as objects), edge servers in an edge computing system do not have the capacity to store all the objects for the entire duration of execution. In practice, at any given time, only a fraction of all objects (and in some cases, a very small fraction of all objects) undergo concurrent accesses; in the system described herein, the limited storage space in the edge layer may act as a temporary storage for those objects that are getting accessed. The backend layer of servers provide permanent storage for all objects for the entire duration of execution. The servers in the edge layer may thus act as virtual clients of the second layer backend.

[0021 ] As noted above, an important requirement in edge-computing systems is to reduce the cost of operation of the backend layer. As also noted, this may be accomplished by making efficient use of the edge layer. Communication between the edge and backend layers, and persistent storage in the backend layer contribute to the cost of operation of the second layer. These factors are addressed via the techniques described herein since the layered approach to implementing an atomic storage service carries the advantage that, during intervals of high concurrency from write operations on any object, the edge layer can be used to retain the more recent versions that are being (concurrently) written, while filtering out the outdated versions. The ability to avoid writing every version of data to the backend layer decreases the overall write communication cost between the two layers. The architecture described thus permits the edge layer to be configured as a proxy cache layer for data that are frequently read, to thereby avoid the need to read from the backend layer for such data.

[0022] In embodiments, storage of data in the backend layer may be

accomplished via the use of codes including, but not limited to erasure codes and random linear network codes. In some embodiments, a class of erasure codes known as minimum bandwidth regenerating codes may be used. From a storage cost view-point, these may be as efficient as popular erasure codes such as Reed- Solomon codes.

[0023] It has been recognized in accordance with the concept described herein that use of regenerating codes, rather than Reed-Solomon codes for example, provides the extra advantage of reducing read communication cost when desired data needs to be recreated from coded data stored in the backend layer (which may, for example, correspond to a cloud layer). It has also been recognized that minimum bandwidth regenerating (MBR) codes may be utilized for simultaneously optimizing read and storage costs.

[0024] Accordingly, the system and techniques described may herein utilize regenerating codes for consistent data storage. The layered architecture described herein naturally permits a layering of the protocols needed to implement atomicity and erasure codes (in a backend layer e.g. a cloud layer). The protocols needed to implement atomicity are largely limited to interactions between the clients and the edge servers, while those needed to implement the erasure code are largely limited to interactions between the edge and backend (or cloud) servers. Furthermore, the modularity of the implementation described herein makes it suitable even for situations that do not necessarily require a two-layer system.

[0025] The layered distributed storage (LDS) concepts and techniques described herein enable a multi-writer, multi-reader atomic storage service over a two-layer asynchronous network. [0026] In accordance with one aspect of the techniques described herein, a write operation completes after writing an object value (i.e. data) to the first layer. It does not wait for the first layer to store the corresponding coded data in the second layer.

[0027] For a read operation, concurrency with write operations increases the chance of content being served directly from the first layer. If the content (or data) is not served directly from the first layer, servers in the first layer regenerate coded data from the second layer, which are then relayed to the reader.

[0028] In embodiments, servers in the first layer interact with those of the second layer via so-called write-to-backend layer ("write-to-L2") operations and regenerate- from-backend-layer and "regenerate-from-L2 "operations for implementing the regenerating code in the second layer.

[0029] In a system having first and second layers, with the first layer having m servers and the second layer having m servers, the described system may tolerate a number of failures fi, in the first and second layers, respectively corresponding to

[0030] In a system with m = e(m); fi = θ(ηι); - Θ(Π2), the write and read costs are respectively given by θ(ηι) and θ(1) + ηιΙ(δ > 0) where δ is a parameter closely related to the number of write or internal write-to-L2 operations that are concurrent with the read operation. Note that Ι(δ > 0) equates to 1 if δ > 0 and 0 if δ = 0. Note that the symbol a = 6(b) in the context any two variable parameters a and b is used to mean that the value of a is comparable to b and only differs by a fixed percent. The ability to reduce the read cost to θ(1 ), when δ = 0 comes from the usage of minimum bandwidth regenerating (MBR) codes. In order to ascertain the

contribution of temporary storage cost to the overall storage cost, a multi-object (say N) analysis may be performed, where each of the N objects is implemented by an independent instance of the LDS technique. The multi-object analysis assumes bounded latency for point-to-point channels. The conditions on the total number of concurrent write operations per unit time are identified, such that the permanent storage cost in the second layer dominates the temporary storage cost in the first layer, and is given by Θ(Ν). Further, bounds on completion times of successful client operations, under bounded latency may be computed.

[0031 ] The use of regenerating codes enables efficient repair of failed nodes in distributed storage systems. For the same storage-overhead and resiliency, the communication cost for repair (also referred to as "repair-bandwidth"), is substantially less than what is needed by codes such as Reed-Solomon codes. In one aspect of the techniques described herein, internal read operations are cast by virtual clients in the first layer as repair operations, and this enables a reduction in the overall read cost. In one aspect of the techniques described herein, MBR codes, which offer exact repair, are used. A different class of codes known as Random Linear Network Codes (RLNC) may also be used. RLNC codes permit implementation of

regenerating codes via functional repair. RLNC codes offer probabilistic guarantees, and permit near optimal operation of regenerating codes for choices of operating point.

BRIEF DESCRIPTION OF THE DRAWINGS

[0032] The foregoing features may be more fully understood from the following description of the drawings in which:

[0033] Fig. 1 is a schematic diagram of a system having a layered architecture for coded consistent distributed storage coupled to a plurality of client nodes;

[0034] Fig. 1A is a schematic diagram illustrating read and write operations in a network having a layered architecture for coded consistent distributed storage;

[0035] Fig. 2 is a block diagram of an edge layer server;

[0036] Fig. 2A is a block diagram of a backend layer server;

[0037] Fig. 3 is a schematic diagram of a writer client writing content to a backend layer server through an edge layer server;

[0038] Fig. 3A is a schematic diagram illustrating concurrent write operations from a plurality of different writer nodes to an edge layer;

[0039] Fig. 3B is a schematic diagram illustrating two phases of a write operation;

[0040] Fig. 3C is a schematic diagram illustrating an internal write-to-a backend layer server operation (aka "internal write-to-L2" operation);

[0041 ] Fig. 3D is a schematic diagram illustrating the role of a broadcast primitive; [0042] Fig. 3E is a schematic diagram illustrating write operations between a writer client, edge layer servers and backend layer servers;

[0043] Figs. 3F-3H are a series of flow diagrams illustrating a write operation;

[0044] Fig. 4 is a schematic diagram of a reader client reading content from a backend layer server through an edge layer server;

[0045] Fig. 4A is a schematic diagram illustrating a read operation from coded content;

[0046] Fig. 4B is a schematic diagram illustrating two phases of a read operation from a backend layer server;

[0047] Fig. 4C is a schematic diagram illustrating four possibilities of a read operation from backend layer servers (aka "a read-from-L2" action);

[0048] Fig. 4D is a schematic diagram illustrating a third phase of a read operation;

[0049] Fig. 4E is a schematic diagram illustrating read operations between a reader client, edge layer servers and backend layer servers;

[0050] Fig 4F is a flow diagram of a read operation; and

[0051 ] Figs. 4G-4J are a series of flow diagrams illustrating various phases of a read operation.

DETAI LED DESCRI PTION

[0052] Referring now to Fig. 1 a network having a layered architecture for coded consistent distributed storage includes a two-layer erasure-coded fault-tolerant distributed storage system 1 1 comprising a plurality of edge layer servers 14a-14d generally denoted 14 and a plurality of backend layer servers 16a-16e generally denoted 16.

[0053] It should be appreciated that although only four edge layer servers 14a- 14d are illustrated in this particular example, the system 1 1 may include any number of edge layer servers 14. Similarly, although only five backend layer servers 16a- 16e are illustrated in this particular example, the system 1 1 may include any number of backend layer servers 16. In general edge layer L\ may include ni servers while backend layer £2 may include n2 servers.

[0054] A plurality of client nodes 12 (also sometimes referred to herein as "clients" or "users") are coupled to the edge layer servers 14. For clarity, writer clients (i.e., client nodes which want to write content (or data) v1 , v2 to consistent storage in the backend layer 16) are identified with reference numbers 18a, 18b and reader clients (i.e., client nodes which want to read content or data) are identified with reference numerals 20a-20d.

[0055] When system 1 1 is provided as part of an edge computing system, high speed communication paths (i.e. communication paths which provide low network latency between clients 12 and servers 14) may exist between clients 12 and servers 14 in the edge layer £1. Further, backend layer servers 16 may be provided as part of a data center or a cloud center and are typically coupled to the edge layer servers 14 via one or more communication paths 23 which are typically slower than high speed paths 19, 21 (in terms of network latency).

[0056] As illustrated in Fig. 1 , writer clients 18a, 18b may each independently provide two versions (v1 , v2) of the same content to the edge layer servers 14. In a manner to be described below in detail, edge layer servers 14 resolve which is the most recent version of the data and may provide the most recent version (in this case, version v2 of the content) to the backend layer servers 16 via communication path 23 for persistent storage.

[0057] Similarly, one or more of reader clients 20a-20d may each independently request the latest versions of desired content from the edge layer servers 14. In a manner to be described below in detail, edge layer servers 14 provide the most recent version of the content (in this case version v2 of the content) to appropriate ones of the reader clients 20a-20d. Such content is sometimes provided directly from one or more of the edge layer servers 14 and sometimes edge layer servers 14 communicate with backend layer servers 16 to retrieve and deliver information needed to provide the requested content to one or more of the reader clients 20a- 20d.

[0058] Referring now to Fig. 1 A, it is significant that client-edge interactions (i.e. , interactions between client nodes such as writer and reader nodes 18, 20 and edge layer servers 14a-14c) implement consistency in the system 1 1 while the edge- backend interaction (i.e. , interactions between edge layer servers 14 and backend layer servers 16) implement (largely) erasure or other codes (e.g. RLNC). That is, the protocols needed for consistency implementation are largely limited to the interaction between the clients 12 and the edge layer servers 14, while the protocols needed to implement the erasure or other codes are largely limited to the interaction between the edge layer servers 14 and the backend layer servers 16.

[0059] Referring now to Fig. 2, a typical edge layer server 14a includes a client node interface 30 coupled to a processor 32. Processor 32 is coupled to a backend layer interface 34. Thus, edge node layer server can communicate with both client nodes 12 (Fig. 1 ) and backend layer nodes 16 (Fig. 1 ). Significantly, client nodes 12 do not communicate directly with the backend layer servers 16. Each of the edge layer servers 14 also include storage 36 (which may, in preferred embodiments, be provided as temporary storage) in which lists of tag-value pairs (t, v) are stored.

[0060] Referring now to Fig. 2A, a typical backend layer server 40 which may be the same as or similar to backend layer servers 16 described above in conjunction with Fig. 1 , includes an edge-layer interface 42 coupled to a processor 44. Processor 44 is also coupled to a storage 46 (which may, in preferred embodiments, be provided as a temporary storage). Storage 46 is configured to have stored therein one or more lists of tag-value pairs (t, v) which may be stored using

regenerating codes or RLNC's, for example. As will become apparent from the description herein, processor 44 aids in a regeneration process.

[0061 ] Before describing write and read operations which may take place in layered distributed storage (LDS) system (in conjunction with Figs. 3-4J below), an overview as well as some introductory concepts and definitions are provided. It should be appreciated that in the illustrative system and techniques described herein, it is assumed that a distributed storage system comprises asynchronous processes of three types: writers (W), readers (R) and servers (S). The servers are organized into two logical layers £i and £2, with £ consisting of m, i = 1 ; 2 servers.

[0062] Each process has a unique id, and the ids are totally ordered. Client (reader/writer) interactions are limited to servers in £1 , and the servers in £1 in turn interact with servers in £2. Further, the servers in £1 and £2 are denoted by {si, S2, . . ., Sni} and {s _ni + 1, s _ni + 2, ..., s _ni+n2}, respectively.

[0063] It is also assumed that the clients are well-formed, i.e. , a client issues a new operation only after completion of its previous operation, if any. As will be described in detail below, the layer 1 - layer 2 £i -£2 interaction happens via the well defined actions write-to-L2 and regenerate-from-L2. These actions are sometimes referred to herein as internal operations initiated by the servers in £1 .

[0064] Also, a crash failure model is assumed for processes. Thus, once a process crashes, it does not execute any further steps for the rest of the execution.

[0065] The LDS technique described herein is designed to tolerate f; crash failures in layer £i; / = 1 ; 2, where fi < m/2 and < Π2/3. Any number of readers and writers can crash during the execution. The above bounds arise from making sure sufficient servers in each of the layers of servers are active to guarantee a sufficient number of coded elements for a tag in order to allow decoding of the corresponding value. Communication may be modeled via reliable point-to-point links between any two processes. This means that as long as the destination process is non-faulty, any message sent on the link is guaranteed to eventually reach the destination process. The model allows the sender process to fail after placing the message in the channel; message-delivery depends only on whether the destination is non-faulty.

[0066] With respect to liveness and atomicity characteristics, one object, say x, is implemented via the LDS algorithm supporting read/write operations. For multiple objects, multiple instances of the LDS algorithm are executed. The object value v comes from the set V. Initially v is set to a distinguished value vo (e V). Reader R requests a read operation on object x. Similarly, a write operation is requested by a writer W. Each operation at a non-faulty client begins with an invocation step and terminates with a response step. An operation π is incomplete in an execution when the invocation step of π does not have the associated response step; otherwise the operation π is complete. In an execution, an operation (read or write) ττι precedes another operation π2, if the response step for operation ττι precedes the invocation step of operation Ή2. Two operations are concurrent if neither precedes the other.

[0067] "Liveness," refers to the characteristic that during any well-formed execution of the LDS technique, any read or write operation initiated by a non-faulty reader or writer completes, despite the crash failure of any other clients and up to server crashes in the edge layer £1 , and up to server crashes in the backend layer £2. Atomicity of an execution refers to the characteristic that the read and write operations in the execution can be arranged in a sequential order that is consistent with the order of invocations and responses.

[0068] With respect to the use of regenerating codes, a regenerating-code framework is used in which, a file ²⁷ of size B symbols is encoded and stored across n nodes such that each node stores a symbols. The symbols are assumed to be drawn from a finite field F _q, for some q. The content from any k nodes (ka symbols) can be used to decode the original file F.

[0069] For repair of a failed node, the replacement node contacts any subset of d ≥ k surviving nodes in the system, and downloads β symbols from each of the d symbols. The β symbols from a helper node is possibly a function of the a symbols in the node. The parameters of the code, say C, will be denoted as {(n, k, d)( α; β)} having a file-size B upper bounded by B≥.

[0070] Two extreme points of operation correspond to the minimum storage overhead (MSR) operating point, with B = ka and minimum repair bandwidth (MBR) operating point, with a = c/β. In embodiments, codes at the MBR operating point may be used. The file-size at the MBR point may be given by BMBR d - i) β.

[0071 ] In some embodiments, it may be preferable to use exact-repair codes, meaning that the content of a replacement node after repair is substantially identical to what was stored in the node before crash failure. A file ^corresponds to the object value v that is written. In other embodiments, it may be preferable to use codes which are not exact repair codes such as random linear network codes (RLNCs).

[0072] In embodiments (and as will be illustrated in conjunction with Figs. 3-3E below), an {(n = m + m, k, d)( a, β)} MBR code designated as C may be used. The parameters k and d are such that where m = 2fi + k and m = 2 + d, two additional codes Ci and Ci derived from the code C may be defined. The code Ci is obtained by restricting attention to the first m coded symbols of C, while the code C2 is obtained by restricting attention to the last m coded symbols of C. Thus if [a ci ...CM CM+1 . . .CM+n2] Ci e denotes a codeword of C, the vectors [a ci ...CM] and [CM+I . . .CM+n2] will be codewords of Ci and C2, respectively.

[0073] The usage of these three codes is as follows. Each server in the first edge layer £1 , having access to the object value v (at an appropriate point in the execution) encodes the object value v using code C2 and sends coded data CMM to server s _ni +i in Li 1≥ i≥ _Π2. During a read operation, a server Sj in the edge layer £1 can potentially reconstruct the coded data q using content from the backend layer £2. Here, coded data q may be considered as part of the code C, and the coded portion q gets reconstructed via a repair procedure (invoked by server s, in the edge layer £1 ) where the d helper servers belong to the backend layer £2. By operating at the MBR point, it is possible to reduce and ideally minimize the cost needed by the server s, to reconstruct cj. Finally, in the LDS technique described herein, the possibility that the reader receives k coded data elements from k servers in the edge layer £1 , during a read operation is permitted. In this case, the reader uses the code Ci to attempt decoding an object value v.

[0074] An important property of one MBR code construction, which is needed in one embodiment of the LDS technique described herein, is the fact that a helper node only needs to know the index of the failed node, while computing the helper data, and does not need to know the set of other d - 1 helpers whose helper data will be used in repair. It should be noted that not all regenerating code constructions, including those of MBR codes, have this property. In embodiments, a server s, e £i requests for help from all servers in the backend layer £2, and does not know a priori, the subset of d servers £2 that will form the helper nodes. In this case, it is preferred that each of the helper nodes be able to compute its β symbols without the knowledge of the other d - 1 helper servers.

[0075] In embodiments, internal read operations may be cast by virtual clients in the first layer as repair operations, and this enables a reduction in the overall read cost.

[0076] With respect to storage and communication costs, the communication cost associated with a read or write operation is the (worst-case) size of the total data that gets transmitted in the messages sent as part of the operation. While calculating write-cost, costs due to internal write-to-L2 operations initiated as a result of the write may be included, even though these internal write-to-L2 operations do not influence the termination point of the write operation. The storage cost at any point in the execution is the worst-case total amount of data that is stored in the servers in the edge layer £1 and the backend layer s. The total data in the edge layer £1

contributes to temporary storage cost, while that in the backend layer £2 contributes to permanent storage cost. Costs contributed by meta-data (data for book keeping such as tags, counters, etc.) may be ignored while ascertaining either storage or communication costs. Further the costs may be normalized by the size of the object value v in other words, costs are expressed as though size of the object value v is 1 unit. [0077] A write operation will be described below in conjunction with Figs. 3-3H.

[0078] Referring now to Fig. 3, writer 18 seeks to write content (also referred to as an "object value" V) to servers 14 in an edge layer £i , via a communication path 19. Upon receiving the value V (and after doing certain operations - to be described in the other figures), edge layer servers 14 perform an internal "write-to-L2" operation in which the value is written to persistent storage (e.g. storage 46 in Fig. 2A) in backend servers 16 in the backend layer s The write-to-L2 operation is executed over communication path 24 and is described below in conjunction with Fig. 3C.

[0079] Referring now to Fig. 3A, a plurality of writers, here three writers 18a, 18b, 18c, concurrently write three versions vi , V2, V3 of the same content to servers 14 via communication paths 19a-19c. In a manner to be described in detail below, servers 14 determine which version of the content (in this example version V3) to send to persistent storage in backend servers 16 via communication path 24. As illustrated in Fig. 3A, the content is coded with codes c _ni + 1 - c _ni + and distributed among ones of backend servers 16. In this example, the content is distributed among all servers 16.

[0080] In embodiments, the ideal goal is to store the respective coded content in all the back end servers. With this goal in mind, the respective coded elements are sent to all backend servers. It is satisfactory if m - responses are received back (i.e., the internal write operation is considered complete if we know for sure that the respective coded elements are written to at least m - backend layer servers).

[0081] Referring now to Fig. 3B, an edge layer 14 of an edge computing system includes five edge layer servers 14a-14e (i.e. n. = 5) and a write-client 18 (or more simply "writer" 18) which initiates a write of content v. In general, the write operation has two phases, and aims to temporarily store the object value v in the edge layer £1 such that up to fi failures of servers in the edge layer £1 does not result in loss of the value. In the illustrative example, the value of fi is set to one (i.e. fi = 1) and the value of k is set to three (i.e. k=3). The values of fi. and are selected based upon the values of m and m respectively, and k and d are dependent on the parameters of the selected codes as described above. [0082] During the first phase (also sometimes referred to as the "get tag" phase), the writer 18 determines a new tag for the value to be written. A tag comprises a pair of values: a natural number, and an identifier, which can be simply a string of digits or numbers, for example (3, "id"). One tag is considered to be larger or more recent than another if either the natural number part of the first tag is larger than the other, or if they are equal, the identifier of the first tag is lexicographically larger (or later) than that of the second tag. Therefore, for any two distinct tags there is a larger one, and in the same vein, in a given set of tags there is a tag that is the largest of all. Note that such a tag is used in lieu of an actual timestamp.

[0083] In the second phase (also referred to as the "put data" phase), the writer sends the new tag-value pair to all severs in the edge layer £i , which add the incoming pair to their respective local lists (e.g. one or more lists in temporary storage 36 as shown in Fig. 2). Any server that adds a new tag-value pair to its list also broadcasts a corresponding data-reception message (e.g. a meta-data) 56a - 56b to other servers in the edge layer L\ . Servers 15 each send an acknowledgment 58a - 58d back to the writer 18 only after they hears broadcast messages from at least fi + k of the servers 14, including itself. It should be appreciated that in this example edge layer server 14e does not sent an acknowledgment message - i.e. there is one failure (fi = 1). Subsequently, each server that receives the fi + k messages for this tag initiates the write-to-L2 action with an aim to offload the tag- value pair to permanent storage in backend layer s.

[0084] It is important to note that the writer is not kept waiting for completion of the internal write-to-L2 operation. That is, no communication with the backend layer £2 is needed to complete a write operation. Rather, the writer terminates as soon as it receives a threshold number of acknowledgments (e.g. fi + k acknowledgments) from the servers in the edge layer £1 . Once a server (e.g. server 14a) completes the internal write-to-L2 operation, the value associated with the write operation is removed from the temporary storage of the server 14a (e.g. removed from storage 36 in Fig. 2). The server may also take this opportunity to clear any old entry from its list which may help to eliminate entries corresponding to any failed writes from the list. [0085] In the techniques described herein, a broadcast primitive 56 is used for certain meta-data message delivery. The primitive has the property that if the message is consumed by any one server in the edge layer £i , the same message is eventually consumed by every non-faulty server in the edge layer £i . One implementation of the primitive, on top of reliable communication channels is described in co-pending application no. 15/838,966 filed on December 12, 217 and incorporated herein by reference in its entirety. In this implementation, the process that invokes the broadcast protocol first sends, via point-to-point channels, the message to a fixed set Sn +i of fi+1 servers in the edge layer £i . Each of these servers, upon reception of the message for first time, sends the message to all the servers in the edge layer £i , before consuming the message itself. The primitive helps in the scenario when the process that invokes the broadcast protocol crashes before sending the message to all edge layer servers.

[0086] Referring now to Fig. 3C, an internal write-to-L2 operation is shown. Each server in the edge layer £i uses a committed tag to keep track of the latest tag- value pair that it writes to the backend layer £2. A server initiates the internal write- to-L2 operation for a new tag-value pair (t,v) only if the new tag t is more recent than the committed tag f _c (i.e. t > t _c), else the new tag-value pair (t,v) is simply ignored.

[0087] In the technique described herein, each server in the backend layer £2 stores coded data corresponding to exactly one tag at any point during the execution. A server in the backend layer s that receives tag-coded-element pair (t,c) as part of an internal write-to-L2 operation replaces the local a tag-coded- element pair {tt, ct) with the incoming pair one if the new tag value t is more recent than the local tag value tt (i.e. t > tt). The write-to-L2 operation initiated by a server s in the edge layer £1 terminates after it receives acknowledgments from fi + d servers in the backend layer s. It should be appreciated that in this approach no value is stored forever in any non-faulty sere in the edge layer £1 . The equations for selection of k, d are provided above.

[0088] Referring now to Fig. 3D, the role of a broadcast primitive is described. It should be appreciated that the role of a broadcast primitive is such that either all (non-faulty) servers 14 in the edge layer L\ receive a message or no servers 14 in the edge layer L\ receive a message. In one embodiment, the broadcast primitive uses an 0(n ²) communication cost protocol. In some portions of the process described herein the following mechanism is required for the process to work as desired. Suppose a client wants to send a message to all n servers then the desired result is that either (a) all of the non-faulty servers receive the messages eventually or (b) none of the servers receive it, which is applicable when the client fails while sending these messages. In other words, it is not desirable to have a scenario in which some of the non-faulty servers receives the message and some other set of non-faulty servers does not receive it. Such as guarantee is achieved by using a broadcast primitive, i.e., a protocol that can achieve this guarantee.

[0089] As illustrated in Fig. 3D, a writer transmits a tag-value pair (tw,v) to each edge layer server 14. Upon reception of the tag-value pair (tw,v) each server broadcasts a message in the edge layer indicating that it has received the tag-value pair (tw, v). For example, server 14a broadcasts to servers 14b-14e. Once each server receives a sufficient number of broadcast messages (e.g. fi + k broadcast messages) the server sends an acknowledgment to the writer 18.

[0090] The broadcast primitive serves at least two important purposes. First, it permits servers 14 in edge layer L\ to delay an internal write-to-L2 operation until sending an acknowledgment ack to the writer; and second, the broadcast primitive avoids the need for a "writing back" of values in a read operation since the system instead writes back only tags. This is important to reduce costs to 0(1 ) while reading from servers in the backend layer s (since MBR codes are not enough). 0(1 ) refers to a quantity that is independent of the system size parameters such as m or m

[0091 ] Referring now to Fig. 3E, all processes for a write operation are shown. A writer 18 determines a new tag (t) for a value (v) to be written and transmits 50 the new tag-value pair (t,v) to a server 14a. The server 14a broadcasts reception of the new tag-value pair to edge layer servers 14b-14e. Once each server receives a desired/sufficient number of broadcast messages (e.g. fi + k broadcast messages), the server commits the new tag value pair to storage 17 and then initiates an internal write-to-L2 operation as described above in conjunction with Fig. 3C. [0092] In one embodiment the LDS technique for a writer w e W and reader r e R includes a writer, executing a "get-tag" operation which includes sending a QUERY- TAG to servers in the edge layer L\. The writer then waits for responses from fi + k servers, and selects the most recent (or highest) tag t. The writer also performs a "put-data" operation which includes creating a new tag t _w = (t:z + 1;w) and sending (PUT-DATA, (t _w; v)) to servers in L\ . The client then waits for responses from fi + k servers in £1 , and terminates.

[0093] It should be appreciated that in one embodiment, tags are used for version control of the object values. A tag t is defined as a pair (z, w), where z e N and w e W I D of a writer; or a null tag which we denote by 1. We use T to denote the set of all possible tags. For any two tags U; ti e ywe say tz > U if (i) tz.z > ti. _z or (ii) tz.z - ti. _z and tz .w > tl.w or (ii) U - i and tz≠i.

[0094] Each server s in the edge layer L\ maintains the following state variables: a) a list L C Tx V, which forms a temporary storage for tag-value pairs received as part of write operations, b) Γ C £x T, which indicates the set of readers being currently served. The pair (r; treq) e Γ indicates that the reader r requested for tag treq during the read operation, c) t _c committed tag at the server, d) K a key-value set used by the server as part of internal regenerate-from-L2 operations. The keys belong to J?, and values belong to Tx ?f. Here ¥ ^"denotes the set of all possible helper data corresponding to coded data elements {c _s(v), v e V). Entries of ¥ ^"belong to. In addition to these, the server also maintains a three counter variable for various operations. The state variable for a server in the backend layer £2 comprises one (tag, coded-element) pair. For any server s, the notation s,y is used to refer to its state variable y. Thus, the notation s.y\r represents the value of s.y at point T of the execution. It should be appreciated that an execution fragment of the technique is simply an alternating sequence of (the collection of all) states and actions. An "action," refers to a block of code executed by any one process without waiting for further external inputs. [0095] Figs. 3F-3H are a series of flow diagrams which illustrate processing that can be implemented within an edge computing system and clients coupled thereto (e.g. , within client, edge layer servers and backend layer servers shown in Fig.1 ). Rectangular elements (typified by element 70 in Fig. 3F), herein denoted "processing blocks," represent computer software instructions or groups of instructions. Diamond shaped elements (typified by element 78 in Fig.3F), herein denoted "decision blocks," represent computer software instructions, or groups of instructions, which affect the execution of the computer software instructions represented by the processing blocks.

[0096] Alternatively, the processing and decision blocks may represent steps performed by functionally equivalent circuits such as a digital signal processor (DSP) circuit or an application specific integrated circuit (ASIC). The flow diagrams do not depict the syntax of any particular programming language but rather illustrate the functional information of one of ordinary skill in the art requires to fabricate circuits or to generate computer software to perform the processing required of the particular apparatus. It should be noted that many routine program elements, such as initialization of loops and variables and the use of temporary variables may be omitted for clarity. The particular sequence of blocks described is illustrative only and can be varied without departing from the spirit of the concepts, structures, and techniques sought to be protected herein. Thus, unless otherwise stated, the blocks described below are unordered meaning that, when possible, the functions represented by the blocks can be performed in any convenient or desirable order.

[0097] Referring now to Figs. 3F-3H in an embodiment, processing begins in processing block 70 where a writer node (e.g. one of nodes 1 8 in Fig. 1 ) initiates a write request. This may be accomplished, for example, by sending a query-tag to all servers in the edge layer £1 . And of course, only the servers in £1 which will not crash (i.e. , non-faulty) will eventually receive the message. Processing then proceeds to block 72 where the writer determines a new tag (t) (e.g. a time stamp) for a value (v) (i.e. content or data) to be written. Processing blocks 70 and 72 comprise phase I of a write operation. [0098] Phase II of the write operation begins in processing block 74 in which the writer sends the new tag-value pair (t, v) to servers in the edge-layer L\ . Preferably the tag-value pair is sent to all servers in the edge layer L\ . Processing then proceeds to processing block 76 in which each server in the edge-layer that receives the tag-value pair (t,v) sends a data reception broadcast message (e.g. a metadata) to all servers in the edge layer L\ .

[0099] Processing then proceeds to decision block 78 in which it is determined, whether the tag-value pair corresponds to a new tag-value pair for that server (i.e. is the newly received tag-value pair more recent that already committed tag-value pair tc in that server). If a decision is made that the tag pair is not a new tag pair then an acknowledgment is sent to the writer as shown in processing block 96.

[00100] If in decision block 78 a decision is made that the tag pair does correspond to a new tag pair, then processing proceeds to block 80 in which the servers in the edge-layer add the incoming tag-value pair (t, v) to their respective local lists (e.g. as stored in each layer server storage 36 in Fig. 2). Processing then proceeds to decision block 82 in which a decision is made as to whether the server received a broadcast message from a predetermined number of other edge layer servers (e.g. at least fi + k edge layer servers, including itself. The number of broadcast messages which must be received is selected such that it is "safe" for the writer to terminate the write operation before waiting for the edge layer to off-load the coded data to the back-end layer. By "safe," it is meant that it is guaranteed that the at least one edge-server will successfully complete write-to-L2 operation. It should be noted that in practice this is implemented as an interrupt driven procedure rather than as a decision loop. The loop implemented by decision block 82 is shown only as a matter of convenience in order to promote clarity in the description of the drawings and the explanation of the broad concepts sought to be protected herein.

[00101 ] In response to an edge layer server receiving broadcast messages from a predetermined number of servers, processing proceeds to processing block 84 in which each of the edge layer server sends an acknowledgment back to the writer. The writer needs fi + k ACKS, so at least fi+ k servers must send an

acknowledgment (ACK). [0102] Processing then proceeds to decision block 86 in which a decision is made as to whether the tag is more recent than an already existing tag in the server (i.e. a committed tag t _c) and whether the tag-value pair (t,v) is still in the tag-value pair list for that edge layer server.

[0103] If the tag is not more recent or if the tag is not still in the list, then processing ends.

[0098] Otherwise, processing then proceeds to processing block 88 in which a committed tag tc is updated to tag and all outstanding read requests are served with a tag-pair value (tc,vc) having a treq value which is less than or equal to the committed tag value tc. As also illustrated in processing block 88, these reads are removed from the list of outstanding read requests. Further, and as also illustrated in processing block 88, the values associated with tag-value pairs in the list for tags which are less than the value of tc, are removed. Processing then proceeds to processing block 90 in which the edge layer server offloads the tag-value pair to permanent storage in the backend layer £2. This may be accomplished, for example, by the server initiating a write-to-L2 action as described above).

[0104] Processing then proceeds to decision block 92 in which a decision is made as to whether the server completed the internal write-to-L2 operation. Although block 92 is illustrated as a decision block which implements a loop, in practice, this would be implemented as an interrupt driven procedure and thus processing block 94 would be implemented only upon completion of an internal write-to-L2 operation.

[0105] Once the sever completes the internal write-to-L2 operation, then processing flows to processing block 94 in which the edge-layer node server removes the value associated with the write operation from its temporary storage. Optionally, the server may also clear any old entries from its list. Processing then ends.

[0106] A read operation is described below in conjunction with Figs. 4-4J. [0107] Referring now to Fig. 4, a reader node 20 wishes to retrieve content vi . In general overview, during a read operation, reader 20 gets served tag-value pairs (t, v) from temporary storage in the edge layer £i , if it overlaps with concurrent write or internal write-to-L2 operations. If not, servers in the edge layer L\ regenerate tag- coded-element pairs via regenerate-from-L2 operations (to be described below), which are then sent to the reader 20.

[0108] Referring now to Fig. 4A, in the case where edge layer servers regenerate tag-coded element pairs, so-called "nodes" 16 provide portions of coded content, an edge layer helper server (e.g. server 14a) and the reader 20 needs to decode the value v using the code Ci. Referring now to Fig. 4B, in embodiments, a read operation comprises three phases: a "get committed tag" phase; a "get-data" phase; and a "put-tag" phase.

[0109] During the first or "get committed tag" phase, the reader identifies the minimum tag, treq, whose corresponding value it can return at the end of the operation.

[01 10] During the second or "get-data" phase, the reader sends treq to all the servers in £i , awaits responses from U + k distinct servers such that 1 ) at least one of the responses contains a tag-value pair, say (t _r,v _r) or 2) at least k of the responses contains coded elements corresponding to some fixed tag, say t _r such that t _r≥ treq. In the latter case, the reader uses the code Ci to decode the value v _r corresponding to tag tr. A server s e £i upon reception of the get-data request checks if either (treq, vreq) or (tc,v _c); tc > treq is its list; in this case, s responds immediately to the reader with the corresponding pair. Otherwise, s adds the reader to its list of outstanding readers, initiates a regenerate-from-L2 operation in which s attempts to regenerate a tag-coded data element pair (f; c' _s); t'≥ treq via a repair process taking help from servers in £2. If regeneration from 1 fails, the server s simply sends (1, 1) back to the reader.

[01 1 1 ] It should be noted that irrespective of whether the regeneration operation succeeds, the server does not remove the reader from its list of outstanding readers. In the LDS technique the server s is allowed to respond to a registered reader with a tag-value pair, during the broadcast-resp action as well. It is possible that while the server awaits responses from £2 towards regeneration, a new tag t gets committed by s via the broadcast-resp action; in this case, if t≥ t _c, server s sends (t, v) to r, and also unregisters r from its outstanding reader list.

[01 12] Referring now to Fig. 4C, there are four possibilities resulting from a read- from-L2 operation. A first possibility is that a server commits a tag-value pair (t _c, v _c) as part of a concurrent write operation such that t _c is > treq. In this case, the server sends the tag-value pair (t _c,v _c) to the reader and unregisters the reader (does not wait for read-from-L2 response).

[01 13] A second possibility is that the server regenerates a tag-value pair (t,a) such that t is > treq. In this case the server sends the tag-value pair (t,a) to the reader and does not unregister the reader.

[01 14] A third possibility is that the server regenerates a tag-value pair (t,a) such that t is < treq. In this case, the server sends the null set tag-value pair (1, 1) to the reader and does not unregister the reader.

[01 15] A fourth possibility is that the server does not regenerate any tag-coded element pair (tag, coded-element) due to concurrent write-to-L2 actions. In this case, the server sends the null set tag-value pair (1, 1) to the reader and does not unregister the reader.

[01 16] It should be appreciated that the reader expects responses from a predetermined number of servers (e.g. fi + k servers) such that either one of them is tag-value pair (tag, value) in which tag > treq or a predetermined number of them (e.g. k of them) are tag-coded element pairs for the same tag, i.e. tag > treq (decode value in this case).

[01 17] Referring now to Fig. 4D, in the third phase, the reader writes-back tag t _r corresponding to a returned value v _r, and ensures that at least fi + k servers in £2 have their committed tags at least as high as t _r, before the read operation complete. However, the value v _r is not written back in this third phase, and this is important to decrease the read cost. The third phase also helps in unregistering the reader from the servers in L\ .

[01 18] Referring now to Fig. 4E, the overall read operation is shown. A reader performs a "get-committed tag" operation in which the reader sends QUERY-COMM- TAG to servers in the edge layer L\ . The reader then awaits fi + k responses, and selects highest tag, say treq. The reader also performs a "get-data" operation in which the reader sends (QUERY-DATA; treq) to servers in L\ and awaits responses from fi + k servers such that at least one of them is a (tag, value) pair, or at least k of them are (tag, coded-element) pairs corresponding to some common tag. In the latter case, decode corresponding value using code Ci. Select the (t _r; v _r) pair corresponding to the highest tag, from the available (tag, value) pairs. The reader also performs a put-data operation in which the reader sends (PUT-TAG, t _r) to servers in the edge layer L\ and awaits responses from fi + k servers in L\ and then returns the requested value v _r.

[01 19] Referring now to Fig. 4F, in general overview, a read operation begins as shown in processing block 140, in which a read request is received at one or more edge layer servers. Processing then proceeds to decision block 142 in which a decision is made as to whether there exists an overlap with a concurrent write operation or an internal write-to-L2 operation. If a decision is made that an overlap exists, then processing proceeds to processing block 144 in which the reader receives tag-value pairs (t,v) served from temporary storage of one or more servers in the edge layer L\ (e.g. from temporary storage in one or more of the edge layer servers).

[0120] If in decision block 142 a decision is made that no overlap exists, then processing proceeds to processing block 146 in which one or more servers in the edge layer £i , regenerate tag-coded element pairs (tag, coded-element). In this scenario, the edge layer servers utilize information from the backend layer servers. This may be accomplished, for example, via regenerate-from-L2 operations as described above in conjunction with Fig. 4C. Edge layer regenerated tag, coded- element pairs are then sent to the reader. [0121 ] Processing then proceeds to processing block 150 where the reader decodes the value V using the appropriate code. Processing then ends.

[0122] Referring now to Figs. 4G-4J, a read operation comprises three phases with a first phase beginning at processing block 160 in which a read request is received at one or more edge layer servers. Processing then proceeds to processing block 162, in which the reader identifies the minimum tag treq whose corresponding value it can return at the end of the operation. A "minimum tag" refers to the smallest tag as defined by the rules of tag comparison explained above. This concludes the phase I processing.

[0123] The phase II processing begins in processing block 164 in which the reader sends the minimum tag value treq to all of the servers in the edge layer.

[0124] Processing then proceeds to decision block 166 in which a decision is made as to whether the reader received responses from a predetermined number of distinct edge layer servers (including itself) such that at least one of the following conditions is true: (A) responses contain a tag-value pair (t _r,v _r) or (B) at least one of the responses contain coded elements corresponding to some fixed tag t _r. That is, some tag greater than or equal to the requested tag (which may or may not be the requested tag-value pair), which means that the tag-value pair was stored in that servers local storage) or must return coded elements (which means that no appropriate tag-value pair was stored in that server's local storage and thus the server had to communicate with I2 to get coded elements). In embodiments, the predetermined number of distinct edge layer servers may be at least fi + k distinct edge layer servers.

[0125] Once one of the conditions is true, then decision blocks 170 and 173 determine which of the conditions A or B is true. If in decision block 170 a decision is made that condition A is not true, then condition B must be true and processing proceeds to block 176 the reader uses the coded elements to decode the value vr corresponding to tag tr. Processing then proceeds to block 175 where the reader writes back tag tr corresponding to value _vr, and ensures that at least fi + k servers L\ have their committee tags at least as high as tr, before the read operation completes.

[0126] If in decision block 170 a decision is made that condition A is true, processing proceeds to block 172 where a tag-value pair is selected corresponding to the most recent (or "maximum") tag. Processing then proceeds to decision block 173 in which a decision is made as to whether condition B is also true. If condition B is not also true, then the tag-value pair (tr,vr) is set as the tag-value pair (t, v).

Processing then proceeds to block 175 as above.

[0127] If in decision block 173 a decision is made that condition B is also true, then processing proceeds to block 174 where the reader uses the code C2 to decode the value vr corresponding to tag fr and if the tag t is more recent that the tag tr, then the tag-value pair (t,r) is renamed as (tr,vr) (i.e. f f > tr, rename (t,v) as (tr,vr)).

[0128] Referring now to Fig. 4I, an illustrative phase two server protocol begins in block 192 in which a server s receives a read request from a reader along with a tag request treq. Processing then flows to decision block 194 in which a decision is made as to whether (treq, vreq) or (tc,vc) tc > treq is in the servers list? If one of the conditions in true, server s responds immediately to the reader with the

corresponding tag-value pair as shown in processing block 196. Processing then ends.

[0129] If the condition in decision block 194 is not true, then processing proceeds to block 198 in which the server s adds the reader to its list of outstanding readers, along with treq. Processing then proceeds to processing block 200 in which server s initiates a regenerate-from -L2 operation in which the server s attempts to regenerate a tag-coded data element pair (tl,cl), tl≥ treq via a repair process taking help from servers in £2.

[0130] Processing then proceeds to decision block 202 in which a decision is made as to whether regeneration from the backend layer £2 failed. If the

regeneration failed, then processing flows to block 204 in which the server s simply sends a null set (- ¹-, back to the reader. It should be noted that irrespective of whether regeneration succeeds, the server does not remove the reader from its list of outstanding readers. That is, even though individual regenerations succeed, the regeneration might not succeed in a collection of k servers in the edge such that all these servers regenerate the same tag. This happens because of concurrent write operations. In such situation, by not removing the reader from the list of outstanding readers of a server, we allow the server to relay a value directly to the server (even after individual successful regeneration, but collective failure) so that the read operation eventually completes. Phase two processing then ends.

[0131 ] If in decision block 202 a decision is made that the regeneration did not fail, then processing flows to block 206 in which edge layer L\ regenerated tag- coded-element pairs are sent to the reader. Phase two processing then ends.

[0132] Below are described several interesting properties of the LDS technique. These may be found useful while proving the liveness and atomicity properties of the algorithm. The notation Sa C £i, \S _a\ = fi + k is used to denote the set of + k servers in £i that never crash fail during the execution. Below are lemmas only applicable to servers that are alive at the concerned point(s) of execution appearing in the lemmas.

[0133] For every operation π in Π corresponding to a non-faulty reader or writer, there exists an associated (tag, value) pair that denoted as (tag( ), value( ))). For a write operation ττ, we the (tag( ), value( ))) pair may be defined as the message (t _w,v) which the writer sends in the put-data phase. If π is a read, the (tag( ), value (π))) pair is defined as (t _r, v) where v is the value that gets returned, and t _r is the associated tag. In a similar manner tags may also be defined for those failed write operations that at least managed to complete the first round of the write operation. This is simply the tag fw that the writer would use during a put-data phase, if it were alive. As described, writes that failed before completion of the first round are ignored.

[0134] For any two points Ti, Ji in an execution of LDS, we say Ti < Ji if Ti occurs earlier than Ti in the execution. The following three lemmas describe properties of committed tag t _c, and tags in the list. [0135] Lemma IV.1 (Monotonicity of committed tag). Consider any two points TV and Ji in an execution of LDS, such that TV < Ti. Then, for any server s e £i,

[0136] Lemma IV.2 (Garbage collection of older tags). For any server s e £i, at any point T in an execution of LDS, if (t, v) e s.L, we have t≥s.t _c.

[0137] Lemma IV.3 (Persistence of tags corresponding to completed operations). Consider any successful write or read operation φ in an execution of LDS, and let T be any point in the execution after φ completes. For any set S' of fi + k servers in £ that are active at T, there exists s e S' such that s.t _C\T≥tag((j)) and max {t : (t;*) e S.L\ T} ≥ tag(<l>).

[0138] The following lemma shows that an internal regenerate-from-L2 operation respects previously completed internal write-to-L2 operations. Our assumption that h < nz/3 is used in the proof of this lemma.

[0139] Lemma IV.4 (Consistency of Internal Reads with respect to Internal Writes). Let 02 denote a successful internal whte-to-L2( v operation executed by some server in £i. Next, consider an internal regenerate-from-L2 operation Ή2, initiated after the completion of 02, by a server s e £ such that a tag-coded-element pair, say (t',c') was successfully regenerated by the server s. Then, t> t; i.e., the regenerated tag is at least as high as what was written before the read started.

The following three lemmas are central to prove the liveness of read operations.

[0140] Lemma IV.5 (If internal regenerate-from-L2 operation fails). Consider an internal regenerate-from-L2 operation initiated at point T of the execution by a server si e / such that s1 failed to regenerate any tag-coded-element pair based on the responses. Then, there exists a point e in the execution such that the following statement is true: There exists a subset Sb of Sa such that \Sb\ = k, and vs' e Sb () e s'. where = max _se£i s.t _c. [0141 ] Lemma IV.6 (If internal regenerate-from-L2 operation regenerates a tag older than the request tag). Consider an internal regenerate-from-L2 operation initiated at point 7 ^" of the execution by a server si e £i such that si only manages to regenerate (t,c) based on the responses, where t < treq. Here treq is the tag sent by the associated reader during the get-data phase. Then, there exists a point in the execution such that the following statement is true: There exists a subset Sb of Sa such that \Sb\ = k, and vs' e Sb () e s'.. where = max _se£i s.t _c.

[0142] Lemma IV.7 (If two Internal regenerate-from-L2 operations regenerate differing tags ). Consider internal regenerate-from-L2 operations initiated at points T and T of the execution, respectively by servers s' and s' in £i. Suppose that s and s' regenerate tags t and V such that t < t'. Then, there exists a point in the execution such that the following statement is true: There exists a subset Sb of Sa such that \Sb\ = k, and vs' e Sb () e s'.. where = max _se.ci s.t _c.

[0143] Theorem IV.8 (Liveness). Consider any well-formed execution of the LDS algorithm, where at most < ni/2 and < ni/3 servers crash fail in layers £ and £2, respectively. Then every operation associated with a non-faulty client completes.

[0144] Theorem IV.9 (Atomicity). Every well-formed execution of the LDS algorithm is atomic.

[0145] Storage and communication costs associated with read/write operations, and also carry out a latency analysis of the algorithm, in which estimates for durations of successful client operations are provided. We also analyze a multi- object system, under bounded latency, to ascertain the contribution of temporary storage toward the overall storage cost. We calculate costs for a system in which the number of nodes in the two layers are of the same order, i.e., m = Q(m). We further assume that the parameters k,d of the regenerating code are such that k = Θ(Α?2); d = Θ(Α?2). This assumption is consistent with usages of codes in practical systems.

[0146] In this analysis, we assume that corresponding to any failed write operation ττ, there exists a successful write operation ττ' such that tag(n) > tag(n). This essentially avoids pathological cases where the execution is a trail of only unsuccessful writes. Note that the restriction on the nature of execution was not imposed while proving liveness or atomicity.

[0147] Lemma V.1 (Temporary Nature of L\ Storage). Consider a successful write operation π e β. Then, there exists a point of execution Τ _β(ττ) in β such that for all T≥ Γβ(π) in β, we have s.t _c\r≥ tag(n) and (t,v) £ s.L|r, l s e£i, t≤ tag(n).

[0148] For a failed write operation π e β let ττ' be the first successful write in β such that tag(Tr') > tag(Tr). Then, it is clear that for all T > T _e (ττ') in β, we have (t, v) £ s.L|r, l s e£i, t≤ tag(n), and thus Lemma V.1 indirectly applies to failed writes as well. Further, for any failed write π e β, we define the termination point TendCrr) of π as the point Γβ(ττ') obtained from Lemma V.1 , where ττ'.

[0149] Definition 1 (Extended write operation). Corresponding to any write operation π e β, we define a hypothetical extended write operation π _β such that tag(n _e) = iag(n), Tstart(n _e) = T _start(Tr) and Tend(T _e) = max(Tend(TT); Γ _β(ττ)), where Γ _β(ττ) is as obtained from Lemma V.1 .

[0150] The set of all extended write operations in β shall be denoted by n _e.

[0151 ] Definition 2 (Concurrency Parameter δ _Ρ). Consider any successful read operation p e β, and let π _β denote the last extended write operation in β that completed before the start of p. Let∑ = {o _e e ^~\ _e\tag(a) > tag(n _e) and o _e overlaps with p}. We define concurrency parameter δ _Ρ as the cardinality of the set∑.

[0152] Lemma V.2 (Write, Read Cost). The communication cost associated with any write operation in β is given by m + mm = Q(ni). The communication cost associated with any successful read operation p in β is given by ni(1 + ) +ni l(5 _P > 0) = Θ(1 )+ηι Ι(δ _Ρ > 0). Here, I (δ _Ρ > 0) is 1 if δ _Ρ > 0, and 0 if δ _Ρ = 0.

[0153] It should be noted that the ability to reduce the read cost to Θ(1 ) in the absence of concurrency from extended writes comes from the usage of regenerating codes at MBR point. Regenerating codes at other operating points are not guaranteed to give the same read cost, depending on the system parameters. For instance, in a system with equal number of servers in either layer, also with identical fault-tolerance (i.e. , m = m; fi = f ), it can be shown that usage of codes at the MSR point will imply that read cost is Q(m) even if δ _Ρ = 0.

[01 54] Lemma V.3 (Single Object Permanent Storage Cost). The (worst case) storage cost in £2 at any point in the execution of the LDS algorithm is given by = 0(1 ).

[01 55] Remark 2. Usage of MSR codes, instead of MBR codes, would give a storage cost of

= 0(1 ). For fixed m; k,d, the storage-cost due to MBR codes is at most twice that of MSR codes. As long as we focus on order-results, MBR codes do well in terms of both storage and read costs; see Remark 1 as well.

[01 56] For bounded latency analysis, delay on the various point-to-point links are assumed to be upper bounded as follows: 1 ) TV, for any link between a client and a server in £1, 2) T2, for any link between a server in £1 and a server in £2, and 3) TO, for any link between two servers in £1. We also assume the local computations on any process take negligible time when compared to delay on any of the links. In edge computing systems, T2 is typically much higher than both TV and To.

[01 57] Lemma V.4 (Write, Read Latency). A successful write operation in β completes within a duration of 4τι+2το. The associated extended write operation completes within a duration of 4τι+ 2το). A successful read operation in _ completes within a duration of max(6ri + 272; 5τι + 2το + Τ2).

[01 58] 1 ) Impact of Number of Concurrent Write Operations on Temporary Storage, via Multi-Object Analysis: Consider implementing N atomic objects in the two-layer storage system described herein, via N independent instances of the LDS algorithm. The value of each of the objects is assumed to have size 1 . Let Θ denote an upper bounded on the total number of concurrent extended write operations experienced by the system within any duration of TV time units. Under appropriate conditions on Θ, it may be shown that the total storage cost is dominated by that of permanent storage in £2. The following simplifying assumptions are made: 1 ) system is symmetrical so that m = m; fi = _>( = k = d) 2) TO = τι, and 3) all the invoked write operations are successful. It should be noted that it is possible to relax any of these assumptions and give a more involved analysis. Also, let μ = Τ2ΓΓΙ.

[0159] Lemma V.5 (Relative Cost of Temporary Storage). At any point in the execution, the worst case storage cost in £ and £z are upper bounded by [5 + 2μ] θηΐ and. Specifically, if Θ «, the overall storage cost is dominated by that of permanent storage in £2, and is given by Θ(Ν).

[0160] Described above is a two-layer model for strongly consistent data-storage which supports read/write operations. The system and LDS techniques described herein were motivated by the proliferation of edge computing applications. In the system, the first layer is closer (in terms of network latency) to the clients and the second layer stores bulk data. In the presence of frequent read and write operations, most of the operations are served without the need to communicate with the backend layer, thereby decreasing the latency of operations. In that regard, the first layer behaves as a proxy cache. As described herein, in one embodiment regenerating codes are used to simultaneously optimize storage and read costs. In embodiments, it is possible to carry out repair of erasure-coded servers in the backend layer s. The modularity of implementation possibly makes the repair problem in the backend of layer £2 simpler than in prior art systems. Furthermore, it is recognized that the modularity of implementation could be advantageously used to implement a different consistency policy like regularity without affecting the implementation of the erasure codes in the backend. Similarly, other codes from the class of regenerating codes including, but not limited to the use of random linear network codes (RLNCs) in the backend layer, may also be used without substantially affecting client protocols.

Previous Patent: MOIST DETERGENT WIPE

Next Patent: CONSTRUCTION MATERIALS, COMPOSITIONS AND METHODS OF MAKING SAME