METHOD AND APPARATUS FOR REMOTE COPY BETWEEN ENTERPRISE STORAGE AND COMMODITY HARDWARE BASED SOFTWARE STORAGE

Title:

METHOD AND APPARATUS FOR REMOTE COPY BETWEEN ENTERPRISE STORAGE AND COMMODITY HARDWARE BASED SOFTWARE STORAGE

Document Type and Number:

WIPO Patent Application WO/2016/153497

Kind Code:

Abstract:

A primary storage system manages a source storage volume. A source journal volume stores a plurality of journals each including a sequence number and data stored in the source storage volume. Plural secondary storage systems manage destination journal volumes and a destination storage volume. A computer system comprises: a memory; and a processor configured to receive, from each secondary storage system, a bitmap indicative of a status as to whether each of the journals is stored in each of the destination journal volumes in said each secondary storage system and a hash value of each journal which is stored in each of the destination journal volumes in said each secondary storage system; determine, based on the received bitmaps and hash values for the secondary storage systems, a range of one or more journals which can be reflected from the destination journal volumes to the destination storage volume without collision.

Inventors:

DEGUCHI AKIRA (US)

Application Number:

PCT/US2015/022403

Publication Date:

September 29, 2016

Filing Date:

March 25, 2015

Export Citation:

Click for automatic bibliography generation Help

Assignee:

HITACHI LTD (JP)

International Classes:

G06F3/06

Foreign References:

EP2759937A1	2014-07-30
US20070266213A1	2007-11-15
US20060069865A1	2006-03-30
US8069322B2	2011-11-29
US20120246429A1	2012-09-27

Attorney, Agent or Firm:

LEUNG, Chun-Pok et al. (US)

Download PDF:

View/Download PDF PDF Help

Claims:

WHAT IS CLAIMED IS:

1. A computer system coupled to a primary storage system and a plurality of secondary storage systems, wherein the primary storage system manages a source storage volume and a source journal volume storing a plurality of journals each including a sequence number and data stored in the source storage volume, wherein the plurality of secondary storage systems manage a plurality of destination journal volumes and a destination storage volume, the computer system comprising:

a memory; and

a processor configured to

receive, from each secondary storage system of the plurality of secondary storage systems, a bitmap indicative of a status as to whether each of the plurality of journals is stored in each of the plurality of destination journal volumes in said each secondary storage system and a hash value of each journal which is stored in each of the plurality of destination journal volumes in said each secondary storage system;

determine, based on the received bitmaps and hash values for the secondary storage systems, a range of one or more journals which can be reflected from the destination journal volumes to the destination storage volume without collision, where the one or more journals within the

determined range have sequence numbers without omission and have no collision of hash values; and instruct the plurality of secondary storage systems to reflect data in the one or more journals within the determined range to the destination storage volume.

2. The computer system according to claim 1 , wherein the processor is configured to:

receive, from each secondary storage system of the plurality of secondary storage systems, operation suspend and failure information indicating whether there is operation suspend having a suspend sequence number or not and whether there is failure or not in said secondary storage system;

if there is no failure but there is operation suspend, compare the suspend sequehce number with a largest sequence number in the determined range of one or more journals;

if the suspend sequence number is larger than the largest sequence number in the determined range of one or more journals, instruct the plurality of secondary storage systems to reflect data in the one or more journals within the determined range to the destination storage volume; and

if the suspend sequence number is not larger than the largest sequence number in the determined range of one or more journals, instruct the plurality of secondary storage systems to reflect, to the destination storage volume, data in one or more journals within the determined range having a sequence number smaller than the suspend sequence number.

3. The computer system according to claim 2, wherein the processor is configured to:

if there is failure in a failed secondary storage system of the plurality of secondary storage systems, determine a failure suspend sequence number which is a maximum sequence number reported by the failed secondary storage system prior to failure, and compare the failure suspend sequence number with a largest sequence number in the determined range of one or more journals;

if the failure suspend sequence number is larger than the largest sequence number in the determined range of one or more journals, instruct the plurality of secondary storage systems to reflect data in the one or more journals within the determined range to the destination storage volume; and if the failure suspend sequence number is not larger than the largest sequence number in the determined range of one or more journals, issue a failure suspend command to the plurality of secondary storage systems with the failure suspend sequence number to suspend due to failure instead of reflecting data to the destination storage volume.

4. A system which includes the computer system and the secondary storage systems of claim 3, wherein each of the secondary storage systems includes a secondary memory and a secondary processor, the secondary processor being configured to:

receive instruction from the computer system to reflect data in the one or more journals to the destination storage volume; for each journal of the one or more journals, make data according to a data protection method, determine processing target data from the data and the made data, identify a destination storage system from the plurality of secondary storage systems for the determined data, and issue a data write command to the destination storage system to write the determined data.

5. A system which includes the computer system and the secondary storage systems of claim 1 , wherein each of the secondary storage systems includes a secondary memory and a secondary processor, the secondary processor being configured to:

receive instruction from the computer system to reflect data in the one or more journals to the destination storage volume; and

for each journal of the one or more journals, starting with the journal having a smallest sequence number, determine whether the destination storage volume is in the same secondary storage system which has received the instruction from the computer system, and if yes, then copy the data from the journal volume for said each journal to the destination storage volume, and if no, then issue a write command to another secondary storage system which has the destination storage volume to write the data from the journal volume for said each journal to the destination storage volume in said another secondary storage system.

6. A system which includes the computer system and the secondary storage systems of claim 1 , wherein the computer system is one of the plurality of secondary storage systems.

7. A system which includes the primary storage system, the computer system, and the secondary storage systems of claim 1 , wherein the primary storage system is an enterprise storage system, and wherein the computer system and the plurality of secondary storage systems are commodity hardware based software storage systems.

8. A method of operating a computer system coupled to a primary storage system and a plurality of secondary storage systems, wherein the primary storage system manages a source storage volume and a source journal volume storing a plurality of journals each including a sequence number and data stored in the source storage volume, wherein the plurality of secondary storage systems manage a plurality of destination journal volumes and a destination storage volume, the method comprising:

receiving, from each secondary storage system of the plurality of secondary storage systems, a bitmap indicative of a status as to whether each of the plurality of journals is stored in each of the plurality of destination journal volumes in said each secondary storage system and a hash value of each journal which is stored in each of the plurality of destination journal volumes in said each secondary storage system;

determining, based on the received bitmaps and hash values for the secondary storage systems, a range of one or more journals which can be reflected from the destination journal volumes to the destination storage volume without collision, where the one or more journals within the determined range have sequence numbers without omission and have no collision of hash values; and

instructing the plurality of secondary storage systems to reflect data in the one or more journals within the determined range to the destination storage volume.

9. The method according to claim 8, further comprising:

receiving, from each secondary storage system of the plurality of secondary storage systems, operation suspend and failure information indicating whether there is operation suspend having a suspend sequence number or not and whether there is failure or not in said secondary storage system;

if there is no failure but there is operation suspend, comparing the suspend sequence number with a largest sequence number in the determined range of one or more journals; .

if the suspend sequence number is larger than the largest sequence number in the determined range of one or more journals, instructing the plurality of secondary storage systems to reflect data in the one or more journals within the determined range to the destination storage volume; and if the suspend sequence number is not larger than the largest sequence number in the determined range of one or more journals, instructing the plurality of secondary storage systems to reflect, to the destination storage volume, data in one or more journals within the determined range having a sequence number smaller than the suspend sequence number.

10. The method according to claim 9, further comprising:

if there is failure in a failed secondary storage system of the plurality of secondary storage systems, determining a failure suspend sequence number which is a maximum sequence number reported by the failed secondary storage system prior to failure, and comparing the failure suspend sequence number with a largest sequence number in the determined range of one or more journals;

if the failure suspend sequence number is larger than the largest sequence number in the determined range of one or more journals, instructing the plurality of secondary storage systems to reflect data in the one or more journals within the determined range to the destination storage volume; and if the failure suspend sequence number is not larger than the largest sequence number in the determined range of one or more journals, issuing a failure suspend command to the plurality of secondary storage systems with the failure suspend sequence number to suspend due to failure instead of reflecting data to the destination storage volume.

11. The method according to claim 10, further comprising:

receiving, by each secondary storage system of the plurality of secondary storage systems, instruction from the computer system to reflect data in the one or more journals to the destination storage volume;

for each journal of the one or more journals, making data according to a data protection method, determining processing target data from the data and the made data, identifying a destination storage system from the plurality of secondary storage systems for the determined data, and issuing a data write command to the destination storage system to write the determined data.

12. The method according to claim 8, further comprising:

receiving, by each secondary storage system of the plurality of secondary storage system, instruction from the computer system to reflect data in the one or more journals to the destination storage volume; and

for each journal of the one or more journals, starting with the journal having a smallest sequence number, determining whether the destination storage volume is in the same secondary storage system which has received the instruction from the computer system, and if yes, then copying the data from the journal volume for said each journal to the destination storage volume, and if no, then issuing a write command to another secondary storage system which has the destination storage volume to write the data from the journal volume for said each journal to the destination storage volume in said another secondary storage system.

13. A non-transitory computer-readable storage medium storing a plurality of instructions for controlling a data processor to operate a computer system coupled to a primary storage system and a plurality of secondary storage systems, wherein the primary storage system manages a source storage volume and a source journal volume storing a plurality of journals each including a sequence number and data stored in the source storage volume, wherein the plurality of secondary storage systems manage a plurality of destination journal volumes and a destination storage volume, the plurality of instructions comprising:

instructions that cause the data processor to receive, from each secondary storage system of the plurality of secondary storage systems, a bitmap indicative of a status as to whether each of the plurality of journals is stored in each of the plurality of destination journal volumes in said each secondary storage system and a hash value of each journal which is stored in each of the plurality of destination journal volumes in said each secondary storage system;

instructions that cause the data processor to determine, based on the received bitmaps and hash values for the secondary storage systems, a range of one or more journals which can be reflected from the destination journal volumes to the destination storage volume without collision, where the one or more journals within the determined range have sequence numbers without omission and have no collision of hash values; and

instructions that cause the data processor to instruct the plurality of secondary storage systems to reflect data in the one or more journals within the determined range to the destination storage volume.

14. The non-transitory computer-readable storage medium according to claim 13, wherein the plurality of instructions further comprise:

instructions that cause the data processor to receive, from each secondary storage system of the plurality of secondary storage systems, operation suspend and failure information indicating whether there is operation suspend having a suspend sequence number or not and whether there is failure or not in said secondary storage system;

instructions that cause the data processor, if there is no failure but there is operation suspend, to compare the suspend sequence number with a largest sequence number in the determined range of one or more journals; instructions that cause the data processor, if the suspend sequence number is larger than the largest sequence number in the determined range of one or more journals, to instruct the plurality of secondary storage systems to reflect data in the one or more journals within the determined range to the destination storage volume; and

instructions that cause the data processor, if the suspend sequence number is not larger than the largest sequence number in the determined range of one or more journals, to instruct the plurality of secondary storage systems to reflect, to the destination storage volume, data in one or more journals within the determined range having a sequence number smaller than the suspend sequence number.

15. The non-transitory computer-readable storage medium according to claim 14, wherein the plurality of instructions further comprise:

instructions that cause the data processor, if there is failure in a failed secondary storage system of the plurality of secondary storage systems, to determine a failure suspend sequence number which is a maximum sequence number reported by the failed secondary storage system prior to failure, and compare the failure suspend sequence number with a largest sequence number in the determined range of one or more journals; instructions that cause the data processor, if the failure suspend sequence number is larger than the largest sequence number in the determined range of one or more journals, to instruct the plurality of secondary storage systems to reflect data in the one or more journals within the determined range to the destination storage volume; and

instructions that cause the data processor, if the failure suspend sequence number is not larger than the largest sequence number in the determined range of one or more journals, to issue a failure suspend command to the plurality of secondary storage systems with the failure suspend sequence number to suspend due to failure instead of reflecting data to the destination storage volume.

Description:

METHOD AND APPARATUS FOR REMOTE COPY BETWEEN ENTERPRISE STORAGE AND COMMODITY HARDWARE BASED

SOFTWARE STORAGE

BACKGROUND OF THE INVENTION

[0001] _. The present invention relates generally to storage systems and, more particularly, to remote copy in a heterogeneous storage environment including, for example, enterprise storage systems and commodity server based storage.

[0002] Recently, commodity based storage has gained popularity. For example, commodity based storage is used as destination storage of a remote copy system. U.S. Patent No. 8,375,223 discloses a commodity server based storage system. It is configured from a commodity server which includes CPU, HDD, memory, etc. Ceph, Redhat storage, and Swift are examples of commodity server based storage. U.S. Patent No. 7,529,950 discloses remote copy processing. More specifically, it describes journal technology to assure data consistency in a remote copy system. U.S. Patent No. 8,200,928 discloses remote copy from multiple storage systems to multiple storage systems. Multiple one to one remote copy systems are tied in this technology.

[0003] There is a performance gap or capacity gap between enterprise storage and commodity hardware based software storage node. Remote copy from one enterprise storage to multiple commodity hardware nodes is assumed in this disclosure. When the technology disclosed in U.S. Patent No. 8,200,928 is used, configuration changes are needed. With such technology, multiple one to one remote copy systems are tied to an extended remote copy group. Thus, a buffer area (or journal volume) to store the data to be transferred and computing resource for job executing remote copy processing are needed. Also, operation suspend in order to allow changes the configuration is needed. One of the purposes of commodity hardware use is cost reduction. However, the purpose is not realized with the traditional technology, since additional resource is needed.

BRIEF SUMMARY OF THE INVENTION

[0004] Exemplary embodiments of the invention provide improved remote copy between a storage system and multiple commodity storage nodes. The commodity based storage nodes are used as destination storage nodes of the remote copy system. The remote copy processing is executed by cooperation of multiple storage nodes. The effective use of commodity storage as remote copy storage results in cost savings. Remote copy methods for different system configurations are described.

[0005] An aspect of the present invention is directed to a computer system coupled to a primary storage system and a plurality of secondary storage systems. The primary storage system manages a source storage volume and a source journal volume stores a plurality of journals each including a sequence number and data stored in the source storage volume.

The plurality of secondary storage systems manage a plurality of destination journal volumes and a destination storage volume. The computer system comprises: a memory; and a processor configured to receive, from each secondary storage system of the plurality of secondary storage systems, a bitmap indicative of a status as to whether each of the plurality of journals is stored in each of the plurality of destination journal volumes in said each secondary storage system and a hash value of each journal which is stored in each of the plurality of destination journal volumes in said each secondary storage system; determine, based on the received bitmaps and hash values for the secondary storage systems, a range of one or more journals which can be reflected from the destination journal volumes to the destination storage volume without collision, where the one or more journals within the

[0006] In some embodiments, the processor is configured to: receive, from each secondary storage system of the plurality of secondary storage systems, operation suspend and failure information indicating whether there is operation suspend having a suspend sequence number or not and whether there is failure or not in said secondary storage system; if there is no failure but there is operation suspend, compare the suspend sequence number with a largest sequence number in the determined range of one or more journals; if the suspend sequence number is larger than the largest sequence number in the determined range of one or more journals, instruct the plurality of secondary storage systems to reflect data in the one or more journals within the determined range to the destination storage volume; and if the suspend sequence number is not larger than the largest sequence number in the determined range of one or more journals, instruct the plurality of secondary storage systems to reflect, to the destination storage volume, data in one or more journals within the determined range having a sequence number smaller than the suspend sequence number.

[0007] In specific embodiments, the processor is configured to: if there is failure in a failed secondary storage system of the plurality of secondary storage systems, determine a failure suspend sequence number which is a maximum sequence number reported by the failed secondary storage system prior to failure, and compare the failure suspend sequence number with a largest sequence number in the determined range of one or more journals; if the failure suspend sequence number is larger than the largest sequence number in the determined range of one or more journals, instruct the plurality of secondary storage systems to reflect data in the one or more journals within the determined range to the destination storage volume; and if the failure suspend sequence number is not larger than the largest sequence number in the determined range of one or more journals, issue a failure suspend command to the plurality of secondary storage systems with the failure suspend sequence number to suspend due to failure instead of reflecting data to the destination storage volume.

[0008] In some embodiments, each of the secondary storage systems includes a secondary memory and a secondary processor, the secondary processor being configured to: receive instruction from the computer system to reflect data in the one or more journals to the destination storage volume; for each journal of the one or more journals, make data according to a data protection method, determine processing target data from the data and the made data, identify a destination storage system from the plurality of secondary storage systems for the determined data, and issue a data write command to the destination storage system to write the determined data.

[0009] In specific embodiments, each of the secondary storage systems includes a secondary memory and a secondary processor, the secondary processor being configured to: receive instruction from the computer system to reflect data in the one or more journals to the destination storage volume; and for each journal of the one or more journals, starting with the journal having a smallest sequence number, determine whether the destination storage volume is in the same secondary storage system which has received the instruction from the computer system, and if yes, then copy the data from the journal volume for said each journal to the destination storage volume, and if no, then issue a write command to another secondary storage system which has the destination storage volume to write the data from the journal volume for said each journal to the destination storage volume in said another secondary storage system.

[0010] In some embodiments, the computer system is one of the plurality of secondary storage systems. The primary storage system is an enterprise storage system. The computer system and the plurality of secondary storage systems are commodity hardware based software storage systems.

[0011] Another aspect of the invention is directed to a method of operating a computer system coupled to a primary storage system and a plurality of secondary storage systems. The primary storage system manages a source storage volume and a source journal volume storing a plurality of journals each including a sequence number and data stored in the source storage volume. The plurality of secondary storage systems manage a plurality of destination journal volumes and a destination storage volume. The method comprises: receiving, from each secondary storage system of the plurality of secondary storage systems, a bitmap indicative of a status as to whether each of the plurality of journals is stored in each of the plurality of destination journal volumes in said each secondary storage system and a hash value of each journal which is stored in each of the plurality of destination journal volumes in said each secondary storage system;

[0012] Another aspect of this invention is directed to a non-transitory computer-readable storage medium storing a plurality of instructions for controlling a data processor to operate a computer system coupled to a primary storage system and a plurality of secondary storage systems. The primary storage system manages a source storage volume and a source journal volume storing a plurality of journals each including a sequence number and data stored in the source storage volume. The plurality of secondary storage systems manage a plurality of destination journal volumes and a destination storage volume. The plurality of instructions comprise: instructions that cause the data processor to receive, from each secondary storage system of the plurality of secondary storage systems, a bitmap indicative of a status as to whether each of the plurality of journals is stored in each of the plurality of destination journal volumes in said each secondary storage system and a hash value of each journal which is stored in each of the plurality of destination journal volumes in said each secondary storage system; instructions that cause the data processor to determine, based on the received bitmaps and hash values for the secondary storage systems, a range of one or more journals which can be reflected from the destination journal volumes to the destination storage volume without collision, where the one or more journals within the determined range have sequence numbers without omission and have no collision of hash values; and instructions that cause the data processor to instruct the plurality of secondary storage systems to reflect data in the one or more journals within the determined range to the destination storage volume.

[0013] These and other features and advantages of the present invention will become apparent to those of ordinary skill in the art in view of the following detailed description of the specific embodiments.

BRIEF DESCRIPTION OF THE DRAWINGS

[0014] FIG. 1 shows an example of an enterprise storage system.

[0015] FIG. 2 shows an example of a commodity node.

[0016] FIG. 3 shows an example of a remote copy configuration according to a first embodiment of the present invention. [0017] FIG. 4 shows an example of the storage control information unit and storage program unit in the storage system.

[0018] FIG. 5 shows an example of the program and control information in the software storage on the commodity node.

[0019] FIG. 6 shows an example of a flow diagram illustrating the process for a write program.

[0020] FIG. 7 shows an example of a flow diagram illustrating journal transfer processing.

[0021] FIG. 8 shows an example of a received bitmap.

[0022] FIG. 9 shows an example of a multiple commodity nodes to illustrate a method to decide which journal should be restored.

[0023] FIG. 10 shows an example of a flow diagram illustrating a process for a restore method.

[0024] FIG. 1 1 shows an example of a flow diagram illustrating a restore processing executed in all nodes which receive the restore range notification in step S309 of FIG. 10.

[0025] FIG. 12 shows another example of the journal read program which detects a special journal.

[0026] FIG. 13 shows another example of a flow diagram illustrating a process for the restore method performed by the restore control program and report program.

[0027] FIG. 14 shows another example of a flow diagram illustrating a restore processing for the journal restore program which considers data protection method and issues data write command to the node having storage media. [0028] FIG. 15 shows an example of a remote copy configuration having a remote copy node according to a second embodiment of the present invention.

[0029] FIG. 16 shows an example of a remote copy node table. This table is managed in the management server, as part of control information therein.

[0030] FIG. 17 shows an example of a flow diagram illustrating a process for a pair creation program.

[0031] FIG. 18 shows an example of a volume table to manage virtual volume and relationship between virtual volume and mapped external volume.

[0032] FIG. 19 shows an example of a flow diagram illustrating a process for a virtualization program which is executed in the storage node.

[0033] FIG. 20 shows an example of a flow diagram illustrating a process for a resync program which is executed in a storage node.

[0034] FIG. 21 shows an example of a flow diagram illustrating a process for a fail over program which changes the configuration to one with remote copy node.

DETAILED DESCRIPTION OF THE INVENTION

[0035] In the following detailed description of the invention, reference is made to the accompanying drawings which form a part of the disclosure, and in which are shown by way of illustration, and not of limitation, exemplary embodiments by which the invention may be practiced. In the drawings, like numerals describe substantially similar components throughout the several views. Further, it should be noted that while the detailed description provides various exemplary embodiments, as described below and as illustrated in the drawings, the present invention is not limited to the embodiments described and illustrated herein, but can extend to other embodiments, as would be known or as would become known to those skilled in the art. Reference in the specification to "one embodiment," "this embodiment," or "these

embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention, and the appearances of these phrases in various places in the specification are not necessarily all referring to the same embodiment. Additionally, in the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one of ordinary skill in the art that these specific details may not all be needed to practice the present invention. In other circumstances, well-known structures, materials, circuits, processes and interfaces have not been described in detail, and/or may be illustrated in block diagram form, so as to not unnecessarily obscure the present invention.

[0036] Furthermore, some portions of the detailed description that follow are presented in terms of algorithms and symbolic representations of operations within a computer. These algorithmic descriptions and symbolic representations are the means used by those skilled in the data processing arts to most effectively convey the essence of their innovations to others skilled in the art. An algorithm is a series of defined steps leading to a desired end state or result. In the present invention, the steps carried out require physical manipulations of tangible quantities for achieving a tangible result. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals or instructions capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, instructions, or the like. It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as

"processing," "computing," "calculating," "determining," "displaying," or the like, can include the actions and processes of a computer system or other information processing device that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system's memories or registers or other information storage, transmission or display devices.

[0037] The present invention also relates to an apparatus for performing the operations herein. This apparatus may be specially

constructed for the required purposes, or it may include one or more general- purpose computers selectively activated or reconfigured by one or more computer programs. Such computer programs may be stored in a computer- readable storage medium including non-transitory medium, such as, but not limited to optical disks, magnetic disks, read-only memories, random access memories, solid state devices and drives, or any other types of media suitable for storing electronic information. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs and modules in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform desired method steps. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein. The instructions of the programming language(s) may be executed by one or more processing devices, e.g., central processing units (CPUs), processors, or controllers.

[0038] Exemplary embodiments of the invention, as will be described in greater detail below, provide apparatuses, methods and computer programs for remote copy between enterprise storage and commodity hardware based software storage.

[0039] First Embodiment

[0040] FIG. 1 shows an example of an enterprise storage system. The enterprise storage system 200 includes cache unit 201 , storage l/F 202, processor 203, disk l/F 204, volume 205, one or more disks 206, storage control information unit 207, and storage program unit 208. The storage l/F

202 is coupled to other nodes of a system of storage nodes via a network, and mediates communication with the other nodes. The processor 203 executes a wide variety of processing by executing a wide variety of programs that have been stored into the storage program unit 208. Moreover, the processor 203 executes a wide variety of processing by using a wide variety of information that has been stored into the storage control information unit 207. The disk l/F 204 is coupled to at least one disk 206 via a bus. The volume 205 that is configured to manage data is configured by at least one storage region of the disk 206. One example of disk 206 is HDD (Hard Disk Drive). The disk 206 is not restricted to an HDD and can also be an SSD (Solid State Drive) or a DVD, for instance. Moreover, at least one disk 206 can be collected up in a unit of a parity group, and a high reliability technique such as a RAID (Redundant Arrays of Independent Disks) can also be used. The volume 205 is provided as disk 206 to an operating system using the volume. The storage control information unit 207 stores a wide variety of information used by a wide variety of programs. The storage program unit 208 stores a wide variety of programs, such as read processing program, write processing program and so on. The cache 201 unit caches the data stored in the disk 206 for performance boost.

[0041] FIG. 2 shows an example of a commodity node. The commodity node 100 has program 101 , memory 104, CPU 102, HDD 105, control information 132, and network interface (IF) 106. This can be a physical configuration of a management server, a compute node, and a storage node.

The compute node may have a large number of the processors and the storage node may have a large amount of the HDD capacity. A management program may be installed in the management server.

[0042] FIG. 3 shows an example of a remote copy configuration according to a first embodiment of the present invention. In the configuration of a remote copy system, the storage system on the left side of FIG. 3 is the source storage. The source storage has three source volumes and one journal volume in the example. Each source volume is a volume storing the data and is accessed from the server. A journal has transfer data and the journal volume is a storage area to store the journals temporarily. There are three storage nodes on the right side of FIG. 3. These three nodes comprise the destination storage system. Each node has a journal volume and a destination volume. The journals are transferred from the journal volume in the source storage to the journal volumes in the destination storage. Finally, the data included in the journals is written or reflected to the destination volumes. It is called restore processing. The restore processing may occur across the nodes.

[0043] FIG. 4 shows an example of the storage control information unit and storage program unit in the storage system. When the storage system is used as source storage system of remote copy, these programs and control information are needed. Sequence number, volume table, and remote copy node table are stored in the storage control information unit. Write program and journal transfer program are stored in the storage program unit. Details of the tables and programs are described herein below.

[0044] FIG. 5 shows an example of the program and control information in the software storage on the commodity node. When the software storage 400 is used as target of remote copy, these program and control information are needed. Received bitmap, hash value list, head sequence number, and volume table are stored in the control information 103. Journal read program, restore control program, journal restore program, report program, and resync program are stored in the storage program 101. [0045] FIG. 6 shows an example of a flow diagram illustrating the process for a write program. The write program is executed in the storage system and receives a write command from the server or virtual machine

(5100) . Then, the program obtains the sequence number and increments it

(5101) . The program creates the journal and stores it to the journal volume

(5102) . After that, the program sends the completion message to the requester and terminates the processing (S103). The journal has data and metadata. The data is written data to be transferred to the destination storage for data replication. The metadata includes information about the destination storage ID, volume ID, address, etc. After storing the write data to the volume in the destination storage, the journal is deleted from the journal volume.

[0046] FIG. 7 shows an example of a flow diagram illustrating journal transfer processing. The journal transfer processing is realized by the journal read program executed in the destination storage and the journal transfer program executed in the source storage system. First, the journal read program issues a journal read command to the source storage system (S200) and waits for data transfer from the source storage system (S201). When communication is failed for a certain period of time or a certain number of times (preset), a failure is recorded (S202). On the source storage side, the journal transfer program receives the journal read command (S203) and determines the journals to be transferred (S204). One or more determination methods can be considered. For example, a predetermined number is chosen from the minimum sequence number, or a rule which can calculate the sequence numbers for processing is created by using job ID or the like.

Then, the journal transfer program reads the journals from the journal volume and transfers it to the requester (S205), and terminates the processing (S206).

[0047] Back on the destination storage side, the journal read program receives journals from the source storage system and stores them to the journal volume in the destination storage (S207). The program updates the received sequence number bitmap (S208) and the hash value calculated from the write destination address (S209). The received bitmap has a bit for each sequence number which is illustrated in FIG. 8. The program obtains the sequence numbers by checking the metadata of the journals and changes the bits corresponding to the obtained sequence numbers. Hash value is managed for each of the bits of the received bitmap. The program calculates the hash value from the destination address of the journal and stores the calculated hash value. Finally, the program terminates the processing (S210).

[0048] FIG. 8 shows an example of a received bitmap. The received bitmap displays the sequence numbers of the journal which is received by the destination storage. If the destination storage has multiple commodity nodes, all commodity nodes manage the received bitmap. FIG. 9 shows an example of multiple commodity nodes. The bitmap 901 manages the arrival status of the journal (JNL). "ON" means that the JNL is already received. The JNL with sequence number "100," "101 ," "104," and "106" are already received and the JNL with sequence number "102," "103," etc. are not received yet. The head sequence number 900 on the left side manages to what sequence number the first bit of the received bitmap corresponds. Since the head sequence number is "100" in the example shown, the first bit of the bitmap manages the arrival status of JNL with sequence number "100" in this example. The hash value 902 manages hash values which are calculated from the destination address. Valuable hash value (i.e., having a value) is managed for bits having "ON" in the bitmap field. In this example, hash value is managed for the sequence numbers "100," "101 ," "104," and "106." This hash value is used to avoid inconsistency caused by multiple node write (remote copy) the data to the same address. Details of this are shown in FIG. 9.

[0049] FIG. 9 shows an example of a multiple commodity nodes to illustrate a method to decide which journal should be restored. A restore process refers to copy to the destination volume. In this example, there are three commodity nodes. Node 1 has bitmap 903 and hash value 904, node 2 has bitmap 905 and hash value 906, and node 3 has bitmap 907 and hash value 908. First, journals should be restored in the order of the sequence number in order to avoid data inconsistency. Thus, the range for which the journals are received without omission is detected. In the example of FIG. 9, the range 907 is shown for journals received without omission. Second, the method to avoid data inconsistency caused by old data overwrite is described.

The method detects a range 908 to avoid such data inconsistency. In this example, if node 2 restores the journal having sequence number "102" and hash value "eee" after restoring of journal having sequence number "103" and hash value "eee" by node 3, the data will be inconsistent. This hash value collision occurs when two identical hash values are in two different journals having different sequence numbers, indicating potential data inconsistency due to old data overwrite. The method detects the range 908 and all nodes can restore the journals in the range 908 in parallel without data inconsistency. The performance according to the number of the nodes which means performance scalability can be achieved because each node can execute restore processing in parallel.

[0050] FIG. 10 shows an example of a flow diagram illustrating a process for the restore method mentioned above. The restore control program is executed in one of the nodes or another node which does not belong to the destination storage of remote copy. The report program is executed in all nodes which belong to the destination storage of remote copy.

[0051] First, the restore control program issues a read request for received bitmap and hash value (S300) and waits for a response from all nodes (S303). Each node which receives the read request executes the report program. The report program obtains received bitmap and hash value and reports them (S301). Then, the report program terminates the processing (S302). On the restore control program side, the program decides the range 907 for which the journals are received without omission by checking all bitmaps obtained from all nodes (S304). Then, the program checks the hash value in the decided range 907 (S305) and checks whether there is hash value collision (S306). If the result is "no" which means there is no hash value collision, the program decides the range 907 as range 908 (S307). If the result is "yes" which means there is hash value collision, the program makes the range without hash value collisions as range 908 by checking hash values obtained from all nodes (S308). Finally, the program notifies the restore range to all nodes (S309). In particular, the biggest sequence number in the range 908 is notified. [0052] FIG. 1 1 shows an example of a flow diagram illustrating a restore processing executed in all nodes which receive the restore range notification in step S309 of FIG. 10. The journal restore program receives the restore command with sequence number from the restore control program (S400). The program determines on the journal which has the smallest sequence number (S401 ) and checks if a restore destination volume is in the same node (S402). If the result is "yes," the program copies the data from the journal volume to the secondary volume (S403). If the result is "no," the program issues a write command to the commodity node which has the restore destination volume (S404). After step S403 or S404, the program checks if all journals are processed (S405). If the result is "no," the program returns to step S401 and executes the same steps for the next journal. If the result is "yes," the program terminates the processing (S406).

[0053] Next, copy processing of remote copy is described. Then, the operation suspend and failure suspend processing is described.

Communication or link failure between source storage and destination storage is recorded in the step S202 of FIG. 12 as described below. It is one of the failures. Other failures can be recorded. Then, operation suspend is explained. First, the source storage which receives the suspend operation from the user or management software makes a special journal specifying operation suspend. The destination storage can detect the operation suspend command and changes the status of the remote copy pair by receiving the special journal. In this embodiment of the invention, since there are multiple nodes, the detection of special journal should be notified to other nodes. [0054] FIG. 12 shows another example of the journal read program which detects a special journal. Steps S200-S209 in FIG. 12 are the same as those of the journal read program in FIG. 7. Steps S500-S502 are different steps. Step S500 is executed after step S209 to check if a journal specifying suspend is included. The metadata of the journal has information describing the journal type. If the result is "no," the program terminates the processing (S502). If the result is "yes," the program records the sequence number of the journal specifying suspend (S501) and then terminates the processing (S502). The following describes how to use the failure information obtained in step S202 and the sequence number of suspend.

[0055] FIG. 13 shows another example of a flow diagram illustrating a process for the restore method performed by the restore control program and report program. Steps S302-S309 in FIG. 13 are the same as those in FIG.

10. Steps S600-S606 are different steps. In step S600, the restore control program issues a read request for received operation suspend and failure information in addition to the received bitmap and hash value. The report program sends the operation suspend and failure information as well as the received bitmap and hash value (S601). The restore control program performs steps S303-S307. In step S602, the restore control program checks for failure or operation suspend detected just after step S307. If the result is

"no," the program proceeds to step S309 and notifies the restore range with the biggest sequence number. If the result is "yes" which means failure or suspend is detected, the restore control program checks for failure or suspend

(S603). If the result is "no" to failure which means operation suspend is notified, the program checks if the sequence number of suspend is bigger than the restore sequence number which is the biggest sequence number in the range 908 (S604). If the result is "yes," the program proceeds to step S309. If the result is "no," the program decides the journal which has a sequence number smaller than the suspend sequence number as restore range and notifies it to all nodes with the suspend sequence number and command (S606). If the result of step S603 is "yes" to failure, the program notifies the restore range and failure suspend command to all nodes (S605) to suspend due to failure instead of reflecting data to the destination storage volume.

[0056] In the method described above, failure suspend is executed just after failure detection. However, as long as there are journals which can be processed, it is possible to delay the status change. In that case, the restore control program records the maximum sequence number reported (prior to failure) by the failed node which reported failure. The sequence number is called failure suspend sequence number. Steps similar to steps S604 and

S605 are performed. Therefore, if the failure suspend sequence number is larger than the biggest sequence number in range 908, the program proceeds to step S309 which notifies the restore range and does not notify failure suspend status change. Otherwise, the program proceeds to step S605.

[0057] In restore processing, the data is read from the journal volume and written to the destination volume. This is described above in connection with steps S403 and S404. Generally, data protection technology, such as

RAID (Redundant Arrays of Inexpensive Disks), triplication, or erasure coding, is used to improve data reliability in a storage system. Therefore, the data written to the destination volume can be stored in multiple storage media. In the software storage based on commodity node, data protection is realized across the multiple nodes because the reliability of a commodity node is low or the number of storage media installed in a commodity node is small. With the triplication, two copy data are stored in other nodes. In the case with execution of step 404, the data is transferred to another node which manages destination volume and copied data or parity data are made. The copied data or parity data are transferred to other nodes again. The method to avoid wasteful transfer is explained herein below.

[0058] FIG. 14 shows another example of a flow diagram illustrating a restore processing for the journal restore program which considers data protection method and issues data write command to the node having storage media. Only a few steps are similar to those of FIG. 11. First, the journal restore program receives the restore command with sequence number (S700) and determines one journal (S701). Then, the program makes data according to a data protection method (S702). If the data protection method is mirror, the copy data is made. If the data protection method is RAID5, the parity data is made in this step. After that, the program determines processing target data from the written data and made data and identifies a destination node of determined data (S703). In the case of mirror, step S704 is executed two times to store the written data and mirror data. In the case of triplication, step

S704 is executed three times to store the written data and two mirror data.

For example, the destination of copied data is determined. Next, the program issues a data write command to the determined destination node (S704).

This data write command includes the address of storage media such as HDD and does not include the address of volume. In step S705, the program checks if all data are processed. With the mirror case, write completion of original data and copy data are checked, for instance. If the result is "no," the program returns to step S703 and executes S703 and S704 for other data. If the result is "yes," the program checks if all journals are processed (S706). If the result is "no," the program returns to step S701 and executes the same steps for the next journal. If the result is "yes," the program terminates the processing (S707).

[0059] Second Embodiment

[0060] In the method of the first embodiment, the remote copy processing is realized by cooperation of multiple nodes. In the second embodiment, one representative node which is called the remote copy node executes all remote copy processing.

[0061] FIG. 15 shows an example of a remote copy configuration having a remote copy node according to a second embodiment of the present invention. In this example, the destination storage has commodity nodes. There are three nodes having destination volumes on the right side of the figure. These nodes store the remote copy data physically, but they do not execute the remote copy processing. There is a remote copy node at the center of the figure. This remote copy node has a journal volume and executes the remote copy processing, but it does not store the remote copy data. There are three volumes in the remote copy node. These volumes are virtual volumes corresponding to the volumes in the commodity nodes on the right side.

[0062] Generally, enterprise storage has volume virtualization technology which maps volume to another storage system. The mapped volume is defined as virtual volume and provided to the server. If the storage receives read or write request for virtual volume from the server, the read or write request is issued to the storage managing physical data. In the second embodiment, the source storage is the same as that for the first embodiment. The copy processing executed by the remote copy node is similar to the traditional technology. However, the method which builds the configuration according to the pair operation, such as pair creation, has not been disclosed. A management server manages the remote copy node and issues the remote copy pair operation.

[0063] FIG. 16 shows an example of a remote copy node table. This table is managed in the management server, as part of control information therein. The remote copy node table manages consistency group ID and remote copy node ID as attributes. Consistency group is a unit of remote copy processing. One or more remote copy pair is included in a consistency group. Data consistency is assured for remote copy pair included in the same consistency group. Thus, a sequence number is managed for each consistency group. The processes described above for the first embodiment are executed for each consistency group. Consistency group ID manages identification of consistency group. Remote copy node ID manages identification of node which executes remote copy processing. In this example, the remote copy node for consistency group "0" is node "1." There is no remote copy node for consistency group "1" because consistency group "1" is not used.

[0064] FIG. 17 shows an example of a flow diagram illustrating a process for a pair creation program. This program manages the remote copy table and issues volume virtualization command and remote copy pair creation command to the storage. First, the pair creation program receives the pair creation command from a user via management terminal or the like

(S800). Then, the program checks if there is a remote copy node (S801). If the result is "no" which means there is no remote copy node, the program issues the pair creation command with ordinary method to the node which has the volume (S803). The program records the node as the remote copy node

(S804). With this method, when the first remote copy pair is created, the node having the volume of first remote copy pair becomes the remote copy node.

In this step, the remote copy node table is updated. If the result is "yes" which means there is a remote copy node, the program checks if the remote copy node has a target volume which is specified by the pair creation program

(S802). If the result is "yes" which means the specified volume is in the remote copy node, the program executes ordinary pair creation processing in steps S803 and S804. If the result is "no" which means the specified volume is not in the remote copy node, the program issues the command to make the virtual volume and maps the specified volume to the virtual volume (S805).

After this step, the virtual volume is in the remote copy node. A process to make virtualization volume is described in connection with FIG. 19. Then, the program issues the pair creation command to the remote copy node (S806).

The storage node executes ordinary remote copy creation processing for the virtual volume. Finally, the program terminates the processing (S807). Next, table and program to virtualize the volume are described.

[0065] FIG. 18 shows an example of a volume table to manage virtual volume and relationship between virtual volume and mapped external volume. On the left side of FIG.18 is a table stored in storage node 1. On the right side of FIG. 18 is a table stored in storage node 2. Each table has storage ID, volume ID, and physical area as attributes. Storage ID is an identification of the storage node. Volume ID is an identification of volume. Physical area manages which location is used for the volume physically. In the example shown, storage node 1 has volume 0, 1 , 2 and so on. The data of volume 0 is stored in the storage media belonging to RAID group 1 physically. Storage node 2 has volume 100, 101 , 0 and so on. The data of volume 0 is stored in the volume 0 in storage 1 physically. The volume 0 in storage 1 is mapped to the storage 2. The volume ID 0 is assigned to the mapped volume which is provided to the server. Next, the method to virtualize the volume is described.

[0066] FIG. 19 shows an example of a flow diagram illustrating a process for a virtualization program which is executed in the storage node. First, the virtualization program receives the mapping command from the management server (S900). Then, the program obtains external volume information by issuing a query command such as an inquiry command of SCIS to the external storage (S901 ). After that, the program creates virtual volume (S902). It means updating the volume table as well. Finally, the program sends the completion message to the requester and terminates the processing (S903).

[0067] The method to virtualize the destination volume of remote copy when the pair creation command is issued is described. The journal volume can be virtualized by processing similar to FIGS. 18 and 19. When the journal volume is added to the consistency group, the virtualization program is executed. [0068] Two configurations are described for the first and second embodiments. These two configurations can be mixed in one remote copy system because the direction of remote copy can be swapped. For example, the destination storage will be the source storage during main server maintenance or failure. Configuration without remote copy node is used when the software storage based on commodity node is used as destination storage. Configuration with remote copy node is used when the software storage based on commodity node is used as source storage, for instance.

[0069] FIG. 20 shows an example of a flow diagram illustrating a process for a resync program which is executed in a storage node. Resync processing swaps direction of remote copy. In this example, the configuration is changed from one without remote copy node to one with remote copy node.

First, the resync program receives a resync command from the management server (S1000) and changes the status of the node to a remote copy processing node (S1001). It means that the node which receives the resync command becomes the remote copy node. The program makes the virtual volumes and maps the journal volumes and destination volume in other nodes to the made virtual volumes (S1002). Then, the program initializes the sequence number (S1003) and obtains remote copy information, such as pair information and differential bitmap information from other nodes (S1004).

After that, the program notifies the status of paths to the server (S1005). The server receives the notification from step S1005 and issues the command to get multipath state information (S1006). The storage node sends the current path information to the server (S1008). The server receives the current path information and updates the path information in the server (S1007). After that, the server issues an 10 request to the remote copy node. Back on the storage node side, the resync program reports the completion to the requester

(S1009) and starts remote copy processing asynchronously (S1010).

[0070] Some storage can run both server program and storage program. It is called a converged system. With the case of converged system, the node running the server program should be the remote copy node so as to minimize communication between nodes.

[0071] FIG. 21 shows an example of a flow diagram illustrating a process for a fail over program which changes the configuration to one with remote copy node. The program is executed in a commodity node as server processing or management server processing. First, the program obtains the fail over destination of a server process (S). The information can be obtained from fail over software. Then, the program issues a command to change the status to remote copy node to the node which was obtained in step S1100. In step S1101 , the node which receives the command in step S1100 executes steps S1001 to S1004 of FIG. 20 to change the status and virtualize the volumes. Steps S1005 to S1007 are not needed because the server processing and remote copy processing are executed in the same node.

Finally, the program terminates the processing (S1102).

[0072] Of course, the system configurations illustrated in FIGS. 3 and

15 are purely exemplary of information systems in which the present invention may be implemented, and the invention is not limited to a particular hardware configuration. The computers and storage systems implementing the invention can also have known I/O devices (e.g., CD and DVD drives, floppy disk drives, hard drives, etc.) which can store and read the modules, programs and data structures used to implement the above-described invention. These modules, programs and data structures can be encoded on such computer-readable media. For example, the data structures of the invention can be stored on computer-readable media independently of one or more computer-readable media on which reside the programs used in the invention. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include local area networks, wide area networks, e.g., the Internet, wireless networks, storage area networks, and the like.

[0073] In the description, numerous details are set forth for purposes of explanation in order to provide a thorough understanding of the present invention. However, it will be apparent to one skilled in the art that not all of these specific details are required in order to practice the present invention. It is also noted that the invention may be described as a process, which is usually depicted as a flowchart, a flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged.

[0074] As is known in the art, the operations described above can be performed by hardware, software, or some combination of software and hardware. Various aspects of embodiments of the invention may be implemented using circuits and logic devices (hardware), while other aspects may be implemented using instructions stored on a machine-readable medium (software), which if executed by a processor, would cause the processor to perform a method to carry out embodiments of the invention. Furthermore, some embodiments of the invention may be performed solely in hardware, whereas other embodiments may be performed solely in software. Moreover, the various functions described can be performed in a single unit, or can be spread across a number of components in any number of ways. When performed by software, the methods may be executed by a processor, such as a general purpose computer, based on instructions stored on a computer-readable medium. If desired, the instructions can be stored on the medium in a compressed and/or encrypted format.

[0075] From the foregoing, it will be apparent that the invention provides methods, apparatuses and programs stored on computer readable media for remote copy between enterprise storage and commodity hardware based software storage. Additionally, while specific embodiments have been illustrated and described in this specification, those of ordinary skill in the art appreciate that any arrangement that is calculated to achieve the same purpose may be substituted for the specific embodiments disclosed. This disclosure is intended to cover any and all adaptations or variations of the present invention, and it is to be understood that the terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with the established doctrines of claim interpretation, along with the full range of equivalents to which such claims are entitled.

Previous Patent: HOLLOW PISTON

Next Patent: TEMPERATURE CONTROL PRODUCT HAVING A CONTROLLABLE VOLUME