Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
GENERATING TEST DATA BASED ON HISTOGRAM STATISTICS
Document Type and Number:
WIPO Patent Application WO/2016/195728
Kind Code:
A1
Abstract:
Some examples relate to generating test data based on histogram statistics. In an example, histogram intervals for each column of a table in a database may be obtained, and data values for each histogram interval of a column may be generated. In an example, generating data values may comprise: a) generating a respective set of random values for each histogram interval of the column, wherein number of random values generated for a respective set is equivalent to number of unique values identified within a histogram interval corresponding to the respective set; b) replicating the respective set of random values for each histogram interval until total number of data values in the respective set is equivalent to row count of the histogram interval corresponding to the respective set; and c) combining the respective set of data values for each histogram interval in a temporary data file for the column. a), b), and c) may be iterated for each column of the table to generate a respective temporary data file for the each column. Temporary data files may be merged to generate a final output file.

Inventors:
DANGE ASHISH (IN)
SATI KAILASH (IN)
MATHEW SHINE (IN)
Application Number:
PCT/US2015/044311
Publication Date:
December 08, 2016
Filing Date:
August 07, 2015
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HEWLETT PACKARD ENTPR DEV LP (US)
International Classes:
G06F17/30
Foreign References:
EP1302871A22003-04-16
US20140201129A12014-07-17
US20050108199A12005-05-19
US20080133582A12008-06-05
US20130139003A12013-05-30
Attorney, Agent or Firm:
FEBBO, Michael A. et al. (3404 E. Harmony RoadMail Stop 7, Fort Collins CO, US)
Download PDF:
Claims:
Claims:

1 . A method of generating test data based on histogram statistics, comprising: obtaining histogram intervals for each column of a table in a database; generating data values for each histogram interval of a column, wherein the generating comprises:

a) generating a respective set of random values for each histogram interval of the column, wherein number of random values generated for a respective set is equivalent to number of unique values identified within a histogram interval corresponding to the respective set;

b) replicating the respective set of random values for each histogram interval until total number of data values in the respective set is equivalent to row-count of the histogram interval corresponding to the respective set; and

c) combining the respective set of data values for each histogram interval in a temporary data file for the column;

iterating a), b), and c) for the each column of the table to generate a respective temporary data file for the each column; and

merging temporary data files to generate a final data file.

2. The method of claim 1 , wherein the histogram intervals for each column of the table in the database are obtained from a metadata table of the database.

3. The method of claim 1 , wherein the histogram intervals for each column of the table in the database are obtained from an Extensible Markup Language (XML) file.

4. The method of claim 1 , further comprising using data values in the final data file to populate a table in the database.

5. The method of claim 1 , further comprising using data values in the final data file for performing a test related to the database.

6. A system for generating test data based on histogram statistics, comprising: an extraction module to extract histogram intervals for each column of a table in a database from histogram statistics for the table;

a data generator module to:

generate data values for each histogram interval of a column, wherein the generation comprises:

a) generating a respective set of random values for each histogram interval of the column, wherein number of random values generated for a respective set is equivalent to number of unique entry counts (UECs) identified within a histogram interval corresponding to the respective set; b) replicating the respective set of random values for each histogram interval until total number of data values in the respective set is equivalent to row-count of the histogram interval corresponding to the respective set; and

c) combining the respective set of data values for each histogram interval in a temporary data file for the column; and

iterate a), b), and c) for the each column of the table to generate a respective temporary data file for the each column; and

a merge module to merge two or more of temporary data files to generate a test data file.

7. The system of claim 6, wherein the histogram statistics include one of number of records in the table, the histogram intervals in the table, row counts in each of the histogram intervals, and unique entry counts (UECs) in each of the histogram intervals.

8. The system of claim 6, wherein data values in the test data file are used for carrying out a performance benchmarking against a user's data in the database.

9. The system of claim 6, wherein the histogram statistics are obtained from an equi-height histogram and/or equi-width histogram.

10. The system of claim 6, wherein the histogram statistics for the table are present in an Extensible Markup Language (XML) file.

1 1 . A non-transitory machine-readable storage medium comprising instructions to generate test data based on histogram statistics, the instructions executable by a processor to:

obtain histogram intervals for each column of a table in a database;

generate data values for each histogram interval of a column, wherein the generating comprises:

a) generating a respective set of random values for each histogram interval of the column, wherein number of random values generated for a respective set is equivalent to number of unique values identified within a histogram interval corresponding to the respective set;

b) replicating the respective set of random values for each histogram interval until total number of data values in the respective set is equivalent to row-count of the histogram interval corresponding to the respective set; and

c) combining the respective set of data values for each histogram interval in a temporary data file for the column;

iterate a), b), and c) for the each column of the table to generate a respective temporary data file for the each column; and

merge temporary data files to generate a test data file.

12. The storage medium of claim 1 1 , further comprising instructions to load data values in the test data file into the database.

13. The storage medium of claim 1 1 , further comprising instructions to execute a query on data values in the test data file.

14. The storage medium of claim 1 1 , wherein the histogram intervals are obtained from histogram statistics for the table that are stored in a metadata table of the database.

15. The storage medium of claim 1 1 , further comprising instructions to use data values in the test data file for testing.

Description:
GENERATING TEST DATA BASED ON HISTOGRAM STATISTICS

Background

[001] Databases have become an integral part of modern day computing.

Whether it is a small start-up or a large enterprise, organizations may need to deal with a vast amount of data these days, which could range from a few terabytes to multiple petabytes of data. Databases provide a useful way of organizing data. Such data may be accessed via a database management system that may allow entry, storage and retrieval of data from a database.

Brief Description of the Drawings

[002] For a better understanding of the solution, embodiments will now be described, purely by way of example, with reference to the accompanying drawings, in which:

[003] FIG. 1 is a block diagram of a computing environment for generating test data based on histogram statistics, according to an example;

[004] FIG. 2 is a block diagram of an example computer system for generating test data based on histogram statistics;

[005] FIG. 3 is a flowchart of an example method of generating test data based on histogram statistics; and

[006] FIG. 4 is a block diagram of an example computer system for generating test data based on histogram statistics. Detailed Description

[007] Data management is vital to success of an organization. Whether it is a private company, a government undertaking, an educational institution, a hospital, or a new start-up, managing data (for example, customer data, vendor data, patient data, etc.) in an appropriate manner is crucial to existence and growth of an enterprise. Computer databases play a useful role in this regard, and a database management system (for example, a Relational Database Management System (RDBMS)) may help a user derive useful information from a database.

[008] Users or customers often do not share their data with database management system vendors since it may be confidential. This may not be an ideal scenario from a vendor's perspective and may lead to significant differences in product behavior during in-house testing and in the actual customer environment. One of the reasons for the difference could be the unavailability of large data sets and data resembling a real world customer's data during product testing. This could also be a challenge, for instance, during defect fixing and performance benchmarking. Therefore, there could be many scenarios where it may be desirable to generate data that imitates data distribution and data pattern of a customer's data.

[009] To address this issue, the present disclosure describes various examples for generating test data based on histogram statistics. In an example, histogram intervals for each column of a table in a database may be obtained, and data values for each histogram interval of a column may be generated. In an example, generating data values may comprise: a) generating a respective set of random values for each histogram interval of the column, wherein number of random values generated for a respective set is equivalent to number of unique values identified within a histogram interval corresponding to the respective set; b) replicating the respective set of random values for each histogram interval until total number of data values in the respective set is equivalent to row count of the histogram interval corresponding to the respective set; and c) combining the respective set of data values for each histogram interval in a temporary data file for the column. Stages a), b), and c) may be iterated for the each column of the table to generate a respective temporary data file for the each column. The temporary data files may be merged to generate a final output file.

[0010] FIG. 1 is a block diagram of a computing environment for generating test data based on histogram statistics, according to an example. Computing environment 100 may include computing devices 102, 104, 106, a database management system 108 and a database 1 10. Aforementioned components of the computing environment i.e. 102, 104, 106, 108 and 1 10 may be in communication with each other, for example, via a computer network 1 12. Such a computer network 1 12 may be a wireless or wired network. Computer network 1 12 may include, for example, a Local Area Network (LAN), a Wireless Local Area Network (WAN), a Metropolitan Area Network (MAN), a Storage Area Network (SAN), a Campus Area Network (CAN), or the like. Further, computer network 1 12 may be a public network (for example, the Internet) or a private network (for example, an intranet). Although only three computing devices 102, 104, 106 and one database 1 10 are shown in FIG. 1 , other examples of this disclosure may include less or more than three computing devices and more than one database.

[0011] Computing devices 102, 104, 106 may each represent any type of computing system capable of reading machine-executable instructions. Examples of computing devices 102, 104, 106 may include, without limitation, a server, a desktop computer, a notebook computer, a tablet computer, a thin client, a mobile device, a personal digital assistant (PDA), a phablet, and the like.

[0012] Database 1 10 may be a repository that stores an organized collection of data. Computing devices 102, 104, 106 may each perform a task or function related to data stored in database 1 10 through database management system 108. For example, computing devices 102, 104, 106 may each store, modify, retrieve, and delete data in database 1 10. Computing devices 102, 104, 106 may each access data from database 1 10 by using a query language such as Structured Query Language (SQL).

[0013] In an example, database 1 10 may be a relational database. In a relational database data may be organized into one or more tables of rows and columns. Each entity type described in a database may have its own table. The rows in table may represent instances of that type of entity and the columns may represent values attributed to that instance.

[0014] Database management system 108 may manage data stored in the database 1 10. In an example, database management system 108 may be present on one or more of the computing devices 102, 104 and 106. In another example, database management system 108 may be present on another computing device, which may be coupled to network 1 12.

[0015] In an example, database management system 108 may be a Relational Database Management System (RDBMS). The data stored in a RDBMS may be accessed by a user-defined query by using a query language such as Structured Query Language (SQL).

[0016] In an example, database management system 108 may derive various types of statistics from the data stored in the database 1 10. In an instance, the statistics may include histogram statistics for a customer's or user's tables in the database 1 10. These statistics may include, for example, number of records in a table, different data ranges or histogram intervals (aka boundary values i.e. low-value and high-value) in a table, row counts, and number of unique values or unique entry counts (i.e. UECs) in various data ranges i.e. number of unique values represented within any particular interval of a histogram. In an instance, a histogram may be created by sorting and dividing the records in a column into intervals (or histogram intervals).

[0017] The histogram statistics may be generated as an equi-height histogram or an equi-width histogram. In the equi-height histograms, the row count may be similar or same for different histogram intervals, whereas in the equi- width histograms the value ranges may be same. In an example, database management system 108 may store the histogram statistics in a metadata table(s) of the database 1 10. In another example, database management system may convert the histogram statistics into an Extensible Markup Language (XML) format, which may be stored in a file.

[0018] In an example, database management system 108 may include an extraction module 1 14, a data generator module 1 16, and a merge module 1 18. In another example, extraction module 1 14, data generator module 1 16, and merge module 1 18 may be present external to database management system 108, for example, on a computing device (such as 102, 104, 106) or another computing device. In such case, extraction module 1 14, data generator module 1 16, and merge module 1 18 may be capable of communicating with database management system 108. In a further example, database management system 108 may include extraction module 1 14, but data generator module 1 16, and merge module 1 18 may be present external to database management system 108.

[0019] The term "module" may refer to a software component (machine readable instructions), a hardware component or a combination thereof. A module may include, by way of example, components, such as software components, processes, tasks, co-routines, functions, attributes, procedures, drivers, firmware, data, databases, data structures, Application Specific Integrated Circuits (ASIC) and other computing devices. [0020] The functionalities performed by extraction module 1 14, data generator module 1 16, and merge module 1 18 are described in reference to FIG. 2 below.

[0021] FIG. 2 is a block diagram of an example computing system 200 for generating test data based on histogram statistics. In an example, computing system 200 may host database management system 108 of FIG.

1 , in which like reference numerals correspond to the same or similar, though perhaps not identical, components. For the sake of brevity, components or reference numerals of FIG. 2 having a same or similarly described function in FIG. 1 are not being described in connection with FIG.

2. Said components or reference numerals may be considered alike.

[0022] Computing system 200 may be a server, a desktop computer, a notebook computer, a tablet computer, a mobile phone, a personal digital assistant (PDA), and the like.

[0023] In an example, computing system 200 may include an extraction module 1 14, a data generator module 1 16, and a merge module 1 18.

[0024] Extraction module 1 14 may obtain or extract histogram statistics from the data stored in a database (for example, 1 10). In an instance, the statistics may include histogram statistics for a customer's or user's tables in the database (for example, 1 10). These statistics may include, for example, number of records in a table, different data ranges or histogram intervals in a table, row counts, and unique entry counts (i.e. UECs) in various data ranges. The histogram statistics may be generated as an equi-height histogram or an equi-width histogram. In the equi-height histograms, the row count may be similar or same for different histogram intervals, whereas in the equi-width histograms keep the value ranges may be same. In an example, extraction module 1 14 may extract histogram statistics from a metadata table(s) in the database (for example, 1 10). In another example, extraction module 1 14 may extract histogram statistics for a table from an XML file. In an example, extraction module 1 14 may extract histogram intervals for each column of a table in a database (for example, 1 10).

[0025] Data generator module 1 16 may generate data values for each histogram interval of the column extracted by extraction module 1 14. In an example, data generator module 1 16 may generate data values for each histogram interval of the column by generating a respective set of random values for each histogram interval of the column. The number of random values generated for a respective set may be equivalent to number of unique values identified within a histogram interval (i.e. UEC) corresponding to the respective set. Data generator module 1 16 may then replicate the respective set of random values for each histogram interval until total number of data values in the respective set is equivalent to row count of the histogram interval corresponding to the respective set. Data generator module 1 16 may combine the respective set of data values for each histogram interval in a temporary data file for the column.

[0026] In a like manner, data generator module 1 16 may generate data values for each column of the table to generate a respective temporary data file for each column of the table. The data in these separate temporary files may thus correspond to data distribution of each of the histogram interval for the column.

[0027] In an instance, the respective set of data values generated for each histogram interval may be stored in a cache memory before they are combined by the data generator module 1 16 in a temporary file. In another instance, data generator module 1 16 may combine the respective set of data values for each histogram interval and store these values in a storage medium such as cache, memory, and/or register.

[0028] To provide an illustration, let's consider a table wherein the range for one of the histogram intervals is 20 - 30, the row count for this interval is 10 and UEC is 4. In this case, data generator module 1 16 may generate any 4 random values (since, in this example, UEC is 4) between data range (histogram interval) 20 - 30. In an instance, instead of generating values randomly, data generator module may follow a guided pattern. In any case, data generator module 1 16 may then replicate the generated values (4 in this example) to achieve a total row count of 10. In an instance, instead generating values randomly, data generator module may follow a guided pattern. This provides the data values for the histogram interval, which matches the row count and UEC of the interval. In a similar manner, data generator module 1 16 may generate data values for each histogram interval of the column of the table.

[0029] Merge module 1 18 may merge two or more of the temporary data files for each column of the table to generate a final data file (or test data file). In an instance a 'combine' function may be used to merge the temporary data files. Merge module 1 18 may combine each of those files record-by-record while separating them with a field-delimiter (for example, a comma). The final output file may be used to load the data into a table of a database. The data values in the final data file may be used, for example, for performing a test related to the database and for carrying out a performance benchmarking against a user's data in the database. In an example, data values in the test data file may be accessed via a user-defined query by using a query language such as Structured Query Language (SQL).

[0030] In an instance, if table includes NULL values, data generator module may not generate any value to match the row count of NULL values. However, such information may be shared in the final output file. When this newly generated data will be populated in a table (for example, for testing or benchmarking), these records (with no values) may be converted to NULL by the database. In another example, NULLs may be represented by special symbols or characters. [0031] FIG. 3 is a flowchart of an example method 300 for generating a test data based on histogram statistics. The method 300, which is described below, may at least partially be executed on a computing device 200 of FIG. 2. However, other computing devices may be used as well. At block 302, histogram intervals for each column of a table in a database may be obtained. At block 304, data values for each histogram interval of a column may be generated. In an example, generating data values for each histogram interval of a column may comprise: a) generating a respective set of random values for each histogram interval of the column, wherein number of random values generated for a respective set is equivalent to number of unique values identified within a histogram interval corresponding to the respective set; b) replicating the respective set of random values for each histogram interval until total number of data values in the respective set is equivalent to row count of the histogram interval corresponding to the respective set; and c) combining the respective set of data values for each histogram interval in a temporary data file for the column. At block 306, stages a), b), and c) may be iterated for the each column of the table to generate a respective temporary data file for the each column. At block 308, temporary data files may be merged to generate a final output file.

[0032] FIG. 4 is a block diagram of an example system 400 for generating a test data based on histogram statistics. System 400 includes a processor 402 and a machine-readable storage medium 404 communicatively coupled through a system bus. Processor 402 may be any type of Central Processing Unit (CPU), microprocessor, or processing logic that interprets and executes machine-readable instructions stored in machine-readable storage medium 404. Machine-readable storage medium 404 may be a random access memory (RAM) or another type of dynamic storage device that may store information and machine-readable instructions that may be executed by processor 402. For example, machine-readable storage medium 404 may be Synchronous DRAM (SDRAM), Double Data Rate (DDR), Rambus DRAM (RDRAM), Rambus RAM, etc. or a storage memory media such as a floppy disk, a hard disk, a CD-ROM, a DVD, a pen drive, and the like. In an example, machine-readable storage medium 404 may be a non-transitory machine-readable medium. Machine-readable storage medium 404 may store instructions 406, 408, 410, and 412. In an example, instructions 406 may be executed by processor 402 to obtain histogram intervals for each column of a table in a database. Instructions 408 may be executed by processor 402 to generate data values for each histogram interval of a column. In an example, generating data values for each histogram interval of a column may comprise: a) generating a respective set of random values for each histogram interval of the column, wherein number of random values generated for a respective set is equivalent to number of unique values identified within a histogram interval corresponding to the respective set; b) replicating the respective set of random values for each histogram interval until total number of data values in the respective set is equivalent to row-count of the histogram interval corresponding to the respective set; and c) combining the respective set of data values for each histogram interval in a temporary data file for the column. Instructions 410 may be executed by processor 402 to iterate a), b), and c) for each column of the table to generate a respective temporary data file for the each column. Instructions 412 may be executed by processor 402 to merge the temporary data files to generate a test data file. 33] For the purpose of simplicity of explanation, the example method of FIG. 3 is shown as executing serially, however it is to be understood and appreciated that the present and other examples are not limited by the illustrated order. The example systems of FIGS. 1 , 2 and 4, and method of FIG. 3 may be implemented in the form of a computer program product including computer-executable instructions, such as program code, which may be run on any suitable computing device in conjunction with a suitable operating system (for example, Microsoft Windows, Linux, UNIX, and the like). Embodiments within the scope of the present solution may also include program products comprising non-transitory computer-readable media for carrying or having computer-executable instructions or data structures stored thereon. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer. By way of example, such computer-readable media can comprise RAM, ROM, EPROM, EEPROM, CD-ROM, magnetic disk storage or other storage devices, or any other medium which can be used to carry or store desired program code in the form of computer-executable instructions and which can be accessed by a general purpose or special purpose computer. The computer readable instructions can also be accessed from memory and executed by a processor. 34] It should be noted that the above-described examples of the present solution is for the purpose of illustration only. Although the solution has been described in conjunction with a specific embodiment thereof, numerous modifications may be possible without materially departing from the teachings and advantages of the subject matter described herein. Other substitutions, modifications and changes may be made without departing from the spirit of the present solution. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.