Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DEVICE, SYSTEM AND METHOD FOR PARALLEL DATA SORTING
Document Type and Number:
WIPO Patent Application WO/2019/170961
Kind Code:
A1
Abstract:
A sorting device, system and method, in which a parallel sorting circuitry (410, 412, 420, 430) perform ordering of two alternating adjacent sorting register pairs of more than a million sorting registers (410, 412) until no changes are made in either of the two alternating adjacent register pairs. A designated memory access interface (148) reads input data from an input array (112) of a memory (110) of a host computer for the parallel sorting circuitry (140) and writes sorted data to an output array (118) of the memory (110) of the host computer. A controller (144; 310) receives instructions from at least one processor (120) of the host computer and responsively control the parallel sorting circuitry (140) to perform parallel sorting according to the received instructions.

Inventors:
KEROLA TEEMU (FI)
Application Number:
PCT/FI2019/050171
Publication Date:
September 12, 2019
Filing Date:
March 04, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV HELSINKI (FI)
International Classes:
G06F7/26; G06F7/24
Foreign References:
US5511189A1996-04-23
US5142687A1992-08-25
Other References:
IOULIIA SKLIAROVA: "Parallel Data Processing in ReconfigurableSystems - Parallel data sort High-level synthesis", 1 January 2015 (2015-01-01), XP055587266, Retrieved from the Internet [retrieved on 20190509]
DINA BITTON ET AL: "A taxonomy of parallel sorting", ACM COMPUTING SURVEYS, ACM, NEW YORK, NY, US, US, vol. 16, no. 3, 2 September 1984 (1984-09-02), pages 287 - 318, XP058351548, ISSN: 0360-0300, DOI: 10.1145/2514.2516
Attorney, Agent or Firm:
ESPATENT OY (FI)
Download PDF:
Claims:
Claims:

1. A sorting device comprising:

a parallel sorting circuitry (410, 412, 420, 430) configured to perform ordering of two alternating adjacent sorting register pairs of a plurality sorting registers (410, 412) until no changes are made in either of the two alternating adjacent register pairs of the plurality sorting registers (410, 412); characterized in that

the plurality sorting registers comprise more than one million sorting registers; a designated memory access interface for reading of input data from an input array (112) of a memory (110) of a host computer for the parallel sorting circuitry (140) and for writing of sorted data to an output array (118) of the memory (110) of the host computer;

a controller (144; 310) configured to receive instructions from at least one processor (120) of the host computer and to responsively control the parallel sorting circuitry (140) to perform parallel sorting according to the received instructions.

2. The sorting device of claim 1, characterized in that the designated memory access interface is a shared bus interface. 3. The sorting device of claim 1 or 2, characterized in that the designated memory access interface is a direct memory access interface (148).

4. The sorting device of claim 1, characterized in that the designated memory access interface is a point-to-point interconnection memory interface.

5. The sorting device of any one of preceding claims, characterized in that the parallel sorting circuitry (140) is formed in an application specific integrated circuit.

6. The sorting device of any one of preceding claims, characterized in that the parallel sorting circuitry (140) is a coprocessor for the host computer.

7. The sorting device of any one of preceding claims, characterized in that the parallel sorting circuitry (140) is integrated with at least one central processing unit for the host computer. 8. The sorting device of any one of preceding claims, characterized by a plurality of the parallel sorting circuitries (140) configured to perform parallel sorting simultaneously and independently of other parallel sorting circuitries (140).

9. The sorting device of any one of preceding claims, characterized in that the controller (310) is configured to manage operation of a plurality of the parallel sorting circuitries (140).

10. The sorting device of any one of preceding claims, characterized in that the sorting device comprises a plurality of the controllers (310) each configured to manage operation of different parallel sorting circuitries (140).

11. The sorting device of claim 9 or 10, characterized in that the controller (114; 310) is further configured to merge outputs of the parallel sorting circuitries (140). 12. The sorting device of any one of preceding claims, characterized in that the controller (144; 310) is configured to obtain the location indicators (412) associated with key values (410) from the input array (112), and to write the sorted key values with the associated location indicators to the output array (118). 13. The sorting device of any one of preceding claims, characterized in that the parallel sorting circuitry (140) comprises:

a plurality of sorting registers (410) each configured to store information comprising at least one key value;

a first set of parallel sorting units (420) each configured to order key values stored in respective first adjacent sorting register pairs; and

a second set of parallel sorting units (430) each configured to order key values stored in respective second adjacent sorting register pairs after the first set of parallel sorting units have (420) ordered the key values stored in the respective first adjacent sorting register pairs;

wherein the second adjacent sorting register pairs are interlaced with the first adjacent sorting register pairs.

14. The sorting device of any one of preceding claims, characterized in that the first and second sets of parallel sorting units (420, 430) are further configured to issue a change signal if the ordering of the key values changes the key value in any of the first or second adjacent sorting register pairs.

15. The sorting device of any one of preceding claims, characterized in that the first and second sets of parallel sorting units (420, 430) are further configured to order respective first and second adjacent sorting register pairs until no change signal is issued by any of the first and second sets of parallel sorting units.

16. The sorting device of any one of preceding claims, characterized in that the second adjacent sorting register pairs are interlaced with the first adjacent sorting register pairs such that each of the second adjacent sorting register pairs comprises one sorting register of each of the first adjacent sorting register pairs.

17. A sorting system (100), comprising:

a host computer comprising at least one processor (120) and a memory (110); characterized in that the sorting system (100) further comprises the sorting device of any one of preceding claims.

18. The sorting system (100) of claim 17, characterized in that:

the sorting system (100) comprises a plurality of the sorting circuitries (140) configured to sort different blocks of data in parallel with other sorting circuitries (140) of the sorting system (100);

at least one processor (120, 310) of the host computer configured to merge outputs of the parallel sorting circuitries (140).

19. A sorting method (100), comprising: requesting (510) a parallel sorting circuitry (140) comprising two alternating adjacent sorting register pairs of a plurality sorting registers (410, 412) to sort data with the sorting registers (410, 412) until no changes are made in either of the two alternating adjacent register pairs of the plurality sorting registers (410, 412); characterized by

the plurality sorting registers comprise more than one million sorting registers; the requesting is performed by a device driver (122) of a host computer; and performing by the parallel sorting circuitry:

using a designated memory access interface (148) for reading (530) of input data from an input array (112) of a memory (110) of a host computer for the parallel sorting circuitry (140) and for writing (550) of sorted data to an output array (118) of the memory (110) of the host computer; and

performing the parallel sorting according to the requesting (510).

Description:
DEVICE, SYSTEM AND METHOD FOR PARALLEL DATA SORT1NG

TECHN1CAL F1ELD

The present invention generally relates to parallel data sorting.

BACKGROUND ART

This section illustrates useful background information without admission of any technique described herein representative of the state of the art. There are numerous algorithms for sorting a large data set (n items) according to some specific field (key) in it. The algorithms can be classified as comparison and non comparison based algorithms. Comparison algorithms are based on comparing keys to each other, and they have average time complexity at least 0(n log n). Such algorithms are (e.g.) Quicksort, Heapsort, Mergesort, Timsort, and Bubble sort. Bubble sort is the slowest of these with time complexity 0(n2). However, it has the advantage that it can terminate quickly (in time 0 (n) if the data set is already sorted. Timsort has that same advantage.

Sorting has been performed also with multi-core and multiprocessor systems. However, resulting speedup has been modest: with only few cores or processors, the cost of splitting work and combining results of different cores or processor reduces the advantage of the parallel processing achieved with using multiple cores or processors.

There is a need for faster and/or more efficient sorting devices, systems and methods.

SUMMARY

According to a first example aspect of the invention there is provided a sorting device according to appended claim 1. ln this document, ordering may refer to swapping one or more pairs of items if necessary to arrange wanted order among said one or more pairs. Sorting may refer to bringing a plurality of adjacent items to a wanted order. The received instructions may define the input array. The received instructions may define the output array. The designated memory access interface may be or comprise a direct memory access interface. The designated memory access interface may be or comprise a shared bus interface. The designated memory access interface may be or comprise a point-to- point interconnection memory interface. The designated memory access interface may be or comprise a QuickPath lnterconnect memory interface. The designated memory access interface may be or comprise a Peripheral Component lnterconnect Express memory interface.

The parallel sorting circuitry may be implemented as or in an application specific integrated circuit. The parallel sorting circuitry may be implemented as a coprocessor for the host computer. The parallel sorting circuitry may be integrated with at least one central processing unit for the host computer.

The sorting device may comprise a plurality of the parallel sorting circuitries configured to perform parallel sorting simultaneously and independently of other parallel sorting circuitries.

The controller may be configured to manage operation of a plurality of the parallel sorting circuitries. The sorting device may comprise a plurality of controllers each configured to manage operation of different parallel sorting circuitries.

The controller may be further configured to merge outputs of the parallel sorting circuitries.

The at least one processor of the host device may merge outputs of the parallel sorting circuitries and/or merging results produced by the controller on merging the outputs of the parallel sorting circuitries. The at least one processor may perform the merging based on computer program code comprised by a sorting application and / or a sorting device driver.

The sorting registers may be configured to store a key value and an associated location indicator. The location indicator may comprise an index. The location indicator may comprise a pointer. The location index may refer to a data item associated with the key value.

The controller may be configured to obtain the data items associated with the key values and to write the obtained data items in the output array together with the sorted key values using the associated location indexes.

The parallel sorting circuitry may comprise:

a plurality of sorting registers each configured to store information comprising at least one key value;

a first set of parallel sorting units each configured to order key values stored in respective first adjacent sorting register pairs; and

a second set of parallel sorting units each configured to order key values stored in respective second adjacent sorting register pairs after the first set of parallel sorting units have ordered the key values stored in the respective first adjacent sorting register pairs;

wherein the second adjacent sorting register pairs are interlaced with the first adjacent sorting register pairs. The plurality of sorting registers may each be configured to store information comprising at least one key value and at least one location indicator.

The first and second sets of parallel sorting units may physically be the same units that control alternate adjacent sorting register pairs. The first and second sets of parallel sorting units may be further configured to issue a change signal if the ordering of the key values changes the key value in any of the first or second adjacent sorting register pairs. The first and second sets of parallel sorting units may be further configured to order respective first and second adjacent sorting register pairs until no change signal is issued by any of the first and second sets of parallel sorting units.

The second adjacent sorting register pairs may be interlaced with the first adjacent sorting register pairs such that each of the second adjacent sorting register pairs comprises one sorting register of each of the first adjacent sorting register pairs.

Advantageously, the device of the first example aspect may enable sorting in linear 0(n) time and space, assuming that n is not larger than the number of registers in specialized sorting processor. The device of the first example aspect may have a simple structure, and the number of sorting registers can be greater than 10 6 or greater than 10 9 . The sorting registers may be co-located in a single component. The sorting registers may be located in a single chip. The chip may be a silicon chip. According to a second example aspect of the invention there is provided a sorting device comprising:

a parallel sorting circuitry configured to perform ordering of two alternating adjacent sorting register pairs of a plurality sorting registers until no changes are made in either of the two alternating adjacent register pairs;

a direct memory access interface for direct reading of input data from an input array of a memory of a host computer for the parallel sorting circuitry and for direct writing of sorted data to an output array of the memory of the host computer;

a controller configured to receive instructions from at least one processor of the host computer and to responsively control the parallel sorting circuitry to perform parallel sorting according to the received instructions. According to a third example aspect of the invention there is provided a system comprising a host computer and the sorting device or the parallel sorting circuitry of the first or second example aspect. ln the system of the second example aspect, merging may be implemented in a parallel sorting application, sorting device driver, or in a sorting device controller. ln the system of the second example aspect, the parallel sorting application running on the host computer may define and extract input array from its own data set. The parallel sorting application may invoke the parallel sorting device driver to sort the input array in wanted order using the parallel sorting circuitry. The parallel sorting application may use the parallel sorting output array to sort its original data set in a wanted order. According to a fourth example aspect of the invention there is provided a method according to appended claim 19.

According to a fifth example aspect of the invention there is provided a method for sorting data, comprising requesting a parallel sorting circuitry comprising two alternating adjacent sorting register pairs of a plurality sorting registers to sort data with the sorting registers until no changes are made in either of the two alternating adjacent register pairs. The requesting may be performed by a device driver of a host computer. The parallel sorting circuitry may: use a direct memory access interface for direct reading of input data from an input array of a memory of a host computer for the parallel sorting circuitry and for direct writing of sorted data to an output array of the memory of the host computer; and perform the parallel sorting according to the requesting.

Different non-binding example aspects and embodiments of the present invention have been illustrated in the foregoing. The embodiments in the foregoing are used merely to explain selected aspects or steps that may be utilized in implementations of the present invention. Some embodiments may be presented only with reference to certain example aspects of the invention lt should be appreciated that corresponding embodiments may apply to other example aspects as well.

BR1EF DESCRIPTION OF THE DRAW1NGS

Some example embodiments of the invention will be described with reference to the accompanying drawings, in which:

Fig. 1 shows a schematic drawing of a system 100 according to an embodiment of the invention;

Figs. 2 and 3 show two other example embodiments of parallel bubble sorting (PBS) circuitry implementations with a plurality of PBS circuitries;

Fig. 4 shows a detail of PBS circuitry of an embodiment; and

Fig. 5 shows a flow chart according to an embodiment of the invention. DETA1LED DESCRIPTION

ln the following description, like reference signs denote like elements or steps.

Fig. 1 shows a schematic drawing of a system 100 according to an embodiment of the invention. The system comprises a memory 110, a processor CPU 120, a memory bus or data channel 130 and a parallel sorting circuitry 140, also referred to as a parallel bubble sort (PBS) circuitry. The memory 110, processor 120 and memory bus 130 may be those of a normal computer and they are drawn with dashed lines in Fig. 1. Within the memory 110 there are particular memory areas allocated for a PBS input array 112 (data to be sorted), PBS parameters 114 (parameters defining e.g. in which direction the sorting should be carried out i.e. ascending or descending), PBS output array 118 to which a sorting result will be output by the PBS circuitry 140 and PBS result 116 to which the number of steps needed for the sort may be placed ln an alternative embodiment, the number of steps needed is returned as (device driver) function value. ln an embodiment, the PBS output array 118 is stored partially or entirely in a mass memory. The PBS output array 118 comprises in an example embodiment keys 410 (Fig. 4) and location indicators 412 (Fig. 4) in the wanted order. Alternatively, the PBS output array 118 comprises only the keys 410 or the location indicators 412 of the data items in the wanted order. ln an embodiment, the PBS input array 114 is stored partially or entirely in a mass memory.

The processor CPU 120 executes applications, including a PBS application 124 that is an application that at least sorts data and a PBS device driver (DD) 122 that is e.g. loaded on starting up an operating system also comprised by the system 100. The PBS application 124 may be any application that calls the PBS DD 122 for sorting data. Fig. 1 is drawn to simply show an application 124. ln this document, the PBS application term is simply used to help the reader to understand reference being made to the one application that is currently running the sorting as described out of many applications that may be run by the CPU 120. The PBS application 124 calls the PBS circuitry 140 through the PBS device driver 122 in a manner in which applications may call functions of various hardware through their device drivers. The device driver 122 may also be implemented as operating systems service routine.

The PBS circuitry 140 comprises a plurality of PBS registers 142, a PBS controller 144 and PBS control and status registers 146 for parallel bubble sorting. The PBS circuitry 140 may further comprise a direct memory access (DMA) interface for communication with the memory 110. This operation will be described with more detail with reference to Fig. 5. Figs. 2 and 3 show two other example embodiments of PBS circuitry implementations with a plurality of PBS circuitries.

Fig. 4 shows a detail of the PBS circuitry 140, i.e. a plurality of sorting registers 142 each configured to store information comprising at least one key value 410;

a first set of parallel sorting units 420 each configured to order key values stored in respective first adjacent sorting register pairs (e.g. the adjacent sorting register pairs touching the sorting unit 1 boxes in Fig. 4); and a second set of parallel sorting units 430 each configured to order key values stored in respective second adjacent sorting register pairs (e.g. the adjacent sorting register pairs touching the sorting unit 2 boxes in Fig. 4) after the first set of parallel sorting units 420 have ordered the key values stored in the respective first adjacent sorting register pairs; and

wherein the first and second sets of parallel sorting units may physically be the same units, but controlling alternate adjacent sorting register pairs; and

wherein the second adjacent sorting register pairs are interlaced with the first adjacent sorting register pairs.

The key value 410 may be in any form, including integers of different bit lengths, floating point numbers, and strings. The sorting registers 142 for the key values 410 may be of any bit length, such as 32, 64 or 128 bits. The location field 412 may have any bit length, such as 32, 64 or 128 bits.

Fig. 5 shows a flow chart of a sorting process of an example embodiment. The PBS application 124 uses the PBS DD 122 for sorting a data set. ln step 500, the PBS application 124 defines and extracts the input array 112 (with (key 410, location 412} pairs] from an original data set. The key 410 can be any field in the original data set. The location 412 may be a (multidimensional] array index, pointer, or any other location indicator for data items. The original data set may be a (multi-dimensional] array or any (distributed] data structure, in the memory 110, in a mass memory that comprises one or more hard disks and / or solid state drives, or in a cloud server. The original data set may contain much more data than just the key fields. ln step 510, the PBS application 124 gives to the PBS DD 122 a sorting request. The sorting request comprises, for example, a key count n, key 410 and location 412 field types, sorting order, and the memory location of input array 112, which contains all (key 410, location 412} pairs, as well as the address of the resulting output array 118. Herein, a key 410 may refer to an integer or floating point key value of, for example, 32,

64 or 128 bits, or some particular number of string of characters of, for example, 32, 64 or 128 bits. ln step 520, if the key count n exceeds the PBS size N of the PBS circuitry 140, the PBS DD 122 splits the input array 112 into size N blocks (and possibly to a remainder block with size <N). ln step 525, the PBS DD 122 gives a sorting request with parameters 114 to the PBS circuitry 140 to sort the input array 112 to wanted key value order, and to write the result to output array 118. ln step 530, the PBS circuitry 140 reads (e.g. with DMA) the input array 112 from the memory 110 to the PBS sorting registers 142. ln step 540, the PBS circuitry 140 sorts the registers 142 according to the key value ln step 550, the PBS circuitry 140 writes (e.g. with DMA) the sorted registers 142 back to memory 110 to the designated PBS output array 118. ln step 560, the PBS circuitry 140 also returns in an embodiment the number of steps used in the sort to the PBS DD 122. ln another embodiment, the number of steps used in sort is written to PBS parameter space 114 or to the PBS result 116. This (number of steps used in sort) can be used, for example, to determine by the PBS DD 122 or by the PBS application 124 whether the PBS input array 112 was already in a wanted order ln an embodiment, the PBS output array 118 is written on top of the input array 112. ln this case, the PBS DD 122 may signal to the PBS circuitry 140 that the output array location is the same as that of the input array location. Alternatively, the PBS DD 122 may signal to the PBS circuitry 140 a definition of the PBS output array 118 so that it defines the memory range of the PBS input array 112. ln case the PBS circuitry 140 has sorted only one of plural blocks, the reading, sorting and writing of data and associated signaling between the PBS DD 122 and the PBS circuitry (or circuitries) 140 may be performed for the block in question only. ln some embodiments, the PBS input array 112 is allowed to be discontinuous i.e. form of two or more different subranges of the memory 110. ln this case, the PBS application 124 may give the PBS DD 122 the PBS input array 112 definition accordingly as a set of different subranges. Also the output array 118 may be formed as two or more different subranges of the memory 110. lfin step 520 the PBS DD 122 split the input array 112 into size N blocks, the PBS output array 118 may consist of corresponding blocks of PBS output array 118. Each block is sorted, but sorting throughout the blocks may still be required. To this end, the entire PBS output array 118 is merged in step 570 or in step 590 for merging the separate blocks therein. The number of the blocks is n/N (rounded up), and merge time complexity is 0(n log (n/N)) which is then also the time complexity for the overall sort. ln an embodiment, the merging in step 570 is carried out by the PBS DD 122. ln an embodiment, the PBS DD 122 controls at least one core or processor of the CPU 120 to perform the merging ln another embodiment, the merging is performed by the PBS application 124 (that e.g. controls at least one core or processor of the CPU 120 to perform the merging) ln the embodiment of Fig.3, the merging may be done by device controller 310.

While the merging, if required, increases the sorting time and complexity, the overall sorting is sped up in the embodiments of Figs. 2 and 3 with parallelizing sorting in PBS circuitries 140 and merging already sorted blocks in the driver. A multi-core or multi processor CPU can may be used to speed up merge-operations ln an embodiment, merging is done pairwise to two already sorted input arrays, comparing the first element from each array at a time. Whichever element would be next in line in wanted order is removed from its input array and placed last in the new output array, and the next element in its input array is considered next lf there are more than two input arrays to be merged, they are all first merged pair wise. Merging is then (recursively) repeated for the longer output arrays until only one output array is remaining. The length of the each output array is the sum of the lengths of the input arrays. For example, merging four input arrays of length k may comprise first merging pair wise arrays {1,2} and {3,4}, and then merging the resulting two output arrays of length 2k. Alternatively, merging may be performed similarly to more than two input arrays. After the entire PBS input array 112 is sorted (with optional merging in step 570), the PBS output array 118 is formed and written to memory in step 580. After the sorting has been completed and the PBS output array 118 data are stored in the memory 110, the PBS DD 122 sends in step 580 a sorting completion signal to the PBS application 124. ln case the PBS application 124 performs the merging in step 570, the sorting completion signal is sent correspondingly earlier indicating completion of the sorting operations that are performed by the PBS circuitry 140 and possibly by the PBS DD 122. ln case the PBS circuitry 140 returned the number of steps needed for the sort to PBS DD 122, that number may be now written to PBS output 116. ln step 590, the PBS application 124 uses the output array 118 to sort the original data set in wanted order in linear time ln this step, any other data i.e. associated or accompanying data fields are retrieved based on the location data 412 and written (with the key values 410) so that an array or table or a data structure of desired form of the original data set is achieved ln another embodiment, the PBS application 124 leaves the original data set as is and uses the resulting PBS output array 118 as a new data structure through which the original data set is easily accessed in wanted order.

Let us further review Fig. 2. lf the key sets are very large, sorting can be speeded up with many PBS circuitries 140. The PBS circuitries can be controlled by the same PBS DD 122 that first spreads the work by assigning to each PBS circuitry 140 their own block to sort, and then merges the output arrays from the PBS circuitries 140 before returning the control to calling PBS application.

Reviewing further Fig. 3 alternative, multiple PBS circuitries 140 are implemented as separate devices under higher level PBS device controller 310. The device driver would now communicate with the PBS device controller 310 or a plurality of PBS device controllers 310 as shown in Fig. 3 rather than with individual PBSs. ln this case the device controller 310 may have more intelligence and memory, and the PBS device controller 310 may perform the merge operations for the PBSs that it controls. To this end, the PBS device controller may comprise a (optionally multi-processor and / or multicore) processing unit. Either way, it is advantageous to implement individual PBS circuitries as large as possible, because sorting in PBS has time complexity 0(n), whereas merge has time complexity 0(n log(n/N)). The embodiment of Fig. 3 expands this distributed processing to using multiple PBS device controllers 310, each with their own set of PBS circuitries 140, wherein the PDS device controllers may merge the sort the PBS outputs of their PBS circuitries 140 and then the final sort may be left to the PBS DD 122. ln one example embodiment, the PBS circuitry is integrated with the CPU 120 on the same chip. However, then the number of PBS registers i.e. PBS size (N) is much smaller than if a dedicated PBS chip is formed (e.g., more than 1,000,000 as opposed to more than 10,000,000), because only a portion of chip area can be used to implement PBS. On the hand, access to memory may be even faster this way.

Various embodiments have been presented lt should be appreciated that in this document, words comprise, include and contain are each used as open-ended expressions with no intended exclusivity. The foregoing description has provided by way of non-limiting examples of particular implementations and embodiments of the invention a full and informative description of the best mode presently contemplated by the inventors for carrying out the invention lt is however clear to a person skilled in the art that the invention is not restricted to details of the embodiments presented in the foregoing, but that it can be implemented in other embodiments using equivalent means or in different combinations of embodiments without deviating from the characteristics of the invention.

Furthermore, some of the features of the afore-disclosed embodiments of this invention may be used to advantage without the corresponding use of other features. As such, the foregoing description shall be considered as merely illustrative of the principles of the present invention, and not in limitation thereof. Hence, the scope of the invention is only restricted by the appended patent claims.