Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A PARALLEL MULTI-PIPELINE SYSTOLIC ARRAY FOR COMPLEX SINGULAR VALUE DECOMPOSITION ON A MULTI-PROCESSOR DEVICE
Document Type and Number:
WIPO Patent Application WO/2011/161202
Kind Code:
A2
Abstract:
The present invention refers to the field of wireless communications and more specifically to the design of a Parallel Multi-Pipeline Systolic Array (PMSA) architecture that implements Complex Singular Value Decomposition (CSVD) on a multi-processor device. CSVD is an important matrix factorization technique of a complex matrix P into a product of three matrices, i.e. a Hermitian unitary matrix of eigenvectors U, a diagonal matrix of eigenvalues Σ and a matrix of eigenvectors V, with several applications in Digital signal processing and wireless communications. The apparatus to perform the CSVD includes n sets of blocks, with n being a multiple of three, with each set of blocks comprising a first hardware block (60, 13), a second hardware block (70, 14) other than the first hardware block and a third hardware block (15) other than the first hardware block and the second hardware block; whereby a) the first hardware block (60, 13) of each of the n sets of blocks includes means to transform a 2x2 square matrix to a 2x2 upper triangular square matrix, b) the second hardware block (70, 14) of each of the n sets of blocks includes means to transform a 2x2 upper triangular matrix to a to a 2x2 diagonal square matrix c) the third hardware block (15) of each of the n sets of blocks includes means to swap and exchange elements of matrices. Further the apparatus includes means to transmit the output of the i-th set of blocks, with i equals 1 to n-1, to the i+1 set of blocks, and means to process a plurality of consecutive matrices simultaneously.

Inventors:
DIMITRIOU ATHANASIOS (GR)
PERISSAKIS STYLIANOS (GR)
PANAYIOTOPOULOS ILIAS (GR)
Application Number:
PCT/EP2011/060527
Publication Date:
December 29, 2011
Filing Date:
June 22, 2011
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INTRACOM S A TELECOM SOLUTIONS (GR)
DIMITRIOU ATHANASIOS (GR)
PERISSAKIS STYLIANOS (GR)
PANAYIOTOPOULOS ILIAS (GR)
International Classes:
G06F17/16
Domestic Patent References:
WO2011038940A12011-04-07
Other References:
N. D. HEMKUMAR, J. R. CAVALARO: "A Systolic VLSI Architecture for Complex SVD", IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS, May 1992 (1992-05-01), pages 1061 - 1064
Attorney, Agent or Firm:
SAMUELIDES, Emmanuel (Athens, GR)
Download PDF:
Claims:
CLAIMS

1 . Apparatus for Connplex SV Decomposition of either a single or a number of consecutive matrices, to an either a single or a number of consecutive Hermitian unitary matrices of eigenvectors, a number of consecutive diagonal matrices of eigenvalues) and a number of consecutive unitary matrices of eigenvectors, including at least one set of blocks, each set comprising a first hardware block (60, 13), a second hardware block (70, 14) other than the first hardware block and a third hardware block (15) other than the first hardware block and the second hardware block; whereby

the first hardware block (60, 13) includes means to transform a 2x2 square matrix to a 2x2 upper triangular square matrix,

the second hardware block (70, 14) includes means to transform a 2x2 upper triangular matrix to a to a 2x2 diagonal square matrix,

the third hardware block (15) includes means to swap and exchange elements of matrices

the apparatus comprises means to link the first hardware block to the

second hardware block, so that the output of the first hardware block is inputted to the second hardware block and the apparatus further comprises means to link the second hardware block to the third hardware block, so that the output of the second hardware block is inputted to the third hardware block.

2. Apparatus according to claim 1 , including n sets of blocks, with n being a multiple of three, with each set of blocks comprising a first hardware block (60, 13), a second hardware block (70, 14) other than the first hardware block and a third hardware block (15) other than the first hardware block and the second hardware block; whereby

the first hardware block (60, 13) of each of the n sets of blocks includes means to transform a 2x2 square matrix to a 2x2 upper triangular square matrix, the second hardware block (70, 14) of each of the n sets of blocks includes means to transform a 2x2 upper triangular matrix to a to a 2x2 diagonal square matrix

the third hardware block (15) of each of the n sets of blocks includes means to swap and exchange elements of matrices

the apparatus includes means to transmit the output of the i-th set of blocks, with i equals 1 to n-1 , to the i+1 set of blocks, and

the apparatus includes means to process a plurality of consecutive matrices simultaneously.

3. Apparatus according to claim 1 or claim 2, whereby the means to swap and exchange elements of matrices are wiring connections.

4. Apparatus according to claim 1 or claim 2, whereby the means to swap and exchange elements of matrices are software means.

5. Method for Complex SV Decomposition of a number of at least two consecutive matrices, to a number of consecutive Hermitian unitary matrices of eigenvectors, a number of consecutive diagonal matrices of eigenvalues and a number of consecutive unitary matrices of eigenvectors), whereby

each consecutive matrix undergoes a number of iter identical iterations, with the output of the j-th iteration with j = 1 to iter-1 being input to the j+1 iteration,

each iteration includes a first step and a second step other than the first step, with the output of the first step being input to the second step, when a matrix is processed in the second step of its i-th iteration, with /' equals 1 to iter, the following matrix of the consecutive matrices is processed in the first step of its i-th iteration and

each one of the first step and the second step are performed in distinct

hardware blocks, so that the first step is performed in a first hardware block and the second step is performed in a second hardware block, other than the first hardware block, and the output of the first step is input to the second block.

6. Method according to claim 5, whereby each iteration includes a third step other than the first step and the second step, with the output of the second step being input to the third step and the output of the third step of the l-th iteration with / being equal to 1 to iter-1 being input to the first step of iteration 1+1, and whereby /'ter is a multiple of three.

7. Method according to claim 6, whereby the third step is performed in a third block other than the first block and the second block and the output second step is input to the third block.

8. Method according to claim 6 or claim 7, whereby the third step performs a swap and exchange of matrices' elements.

9. Method according to any of the claims 5 to 8, whereby the first step performs a transformation of a 2x2 square matrix to a 2x2 upper triangular square matrix and the second step transforms a 2x2 upper triangular matrix to a to a 2x2 diagonal square matrix.

Description:
A Parallel Multi-Pipeline Systolic Array for Complex Singular Value Decomposition on a Multi-Processor Device

The present invention refers to the field of wireless communications and more specifically to the design of a Parallel Multi-Pipeline Systolic Array (PMSA) architecture that implements Complex Singular Value Decomposition (CSVD) on a multi-processor device. CSVD is an important matrix factorization technique of a complex matrix P into a product of three matrices, i.e. a Hermitian unitary matrix of eigenvectors U , a diagonal matrix of eigenvalues ∑ and a matrix of eigenvectors V , with several applications in Digital signal processing and wireless communications.

A method for CSVD decomposition is known from "N. D. Hemkumar, J. R. Cavalaro, "A Systolic VLSI Architecture for Complex SVD" IEEE International Symposium on Circuits and Systems (ISCAS), May 1992, Page(s):1061- 1064". According to this document the inversion matrix is partitioned such that a dedicated processor element is allocated to each sub-matrix thus increasing significantly the time needed to compute a single matrix without allowing for a pipelined approach to optimally process a number of consecutive matrices.

The object of the invention is to propose a method and an apparatus, which is more efficient in terms of execution speed than the conventional mesh connected systolic array architecture such that it can decompose fast and efficiently a number of consecutive matrices on a multi-processor array device.

The invention is defined in the independent claims 1 and 5.

The proposed architecture decomposes the CSVD problem into several parallel pipeline branches, where each pipeline branch is dedicated to process a sub-matrix of each one of the consecutive input matrices. Each sub-matrix of each input matrix input is supplied to the corresponding pipeline branch and is subjected to two passes or steps, Pass 1 and Pass 2, and a further swap and exchange step. Each branch at each pass implements a specific systolic array structure thus forming a pipeline which is fed by the corresponding sub- matrix of each consecutive input matrix. The rotational matrices computed in the diagonal pipelines at each pass, are supplied as known data to the off- diagonal pipelines. Pass 1 performs QR decomposition and Pass 2 completes the diagonalization on each of the diagonal sub-matrices while performing the necessary computations on the off-diagonal sub-matrices using the rotational matrices calculated at each individual pass.

The method and apparatus according to the invention which implement a PMSA architecture achieve a significant throughput advantage in comparison to known methods. In the proposed PMSA architecture the pipeline streams are implemented on a large array of interconnected processor elements. This allows for optimum design partitioning and eliminates the need for a dedicated swap & exchange unit by replacing these operations with simple signal interconnections. Using this technique it is possible to achieve optimized performance and significant increase in throughput such that when the previous matrix P(fc - l) is being processed in Pass 2, the current matrix p(&) is simultaneously processed in Pass 1 . The current embodiment of the design uses the CORDIC algorithm and achieves a CSVD for each consecutive matrix within about 5 sec (800 cycles of 160MHz clock) given a certain pipeline latency which is insignificant for a large number of matrices. The throughput can be further enhanced by applying lookup tables at the expense of memory or precision. The dependent claims define particular embodiments offering further advantages.

Preferred embodiments of the invention will be described with reference to figures 1 to 10. In particular:

Figure 1 presents the three successive steps conceptually comprising a ∑- sweep, i.e. the sequence of steps for the calculation of the diagonal matrix∑. Figure 2 presents a single∑-sweep of a∑-PMSA structure.

Figure 3 presents a single U-sweep of a U-PMSA structure, i.e. the sequence of steps for the calculation of matrix U.

Figure 4 presents a single V-sweep of a V-PMSA structure, i.e. the sequence of steps for the calculation of matrix V.

Figure 5 presents the systolic array structure corresponding to Pass 1.

Figure 6 presents the systolic array structure corresponding to Pass 2.

Figure 7 presents the signals required to perform the swap and exchange algorithm and the connection of the previous sweep to the following sweep for the computation of the eigenvalue diagonal matrix∑(&) .

Figures 8 and 9 present the signals required to perform the swap and exchange algorithm and the connection of the previous sweep to the following sweep for the computation of the eigenvector unitary matrices U(k) and

V(k) .

Figure 10 presents the complete PMSA architecture comprising in total nine sweeps.

We assume a complex NxN matrix which is partitioned into 2x2 compl sub-matrices of the form

where, superscript kl is the corresponding sub-matrix. The SVD factorization of a complex matrix P iZ into a product of three matrices where are unitary matrices of singular vectors and is a diagonal matrix of singular values which correspond to sub-matrix kl. The Jacobi algorithm is commonly used due to its numerical stability and high degree of parallelism. The two sided rotation method makes use of a sequence of plane rotations to diagonalize the matrix We calculate

Such that, where are calculated to produce a QR decomposition of P

where, and,

and are such that a real diagonal matrix is produced

where, and,

The first transformation multiplies P with a hermitian Jacobi rotation on the left and a complex Jacobi on the right to produce P iZ' such that the lower left component is nulled and the lower right component is real.

The second transformation completes the diagonalization by nulling the upper diagonal element

The method is applicable to any NxN matrix, which may be partitioned into 2x2 sub-matrix units. As an example, a sequence of 4x4 matrices Y(k) is considered, such that each matrix Y(k) is partitioned to sub- matrices P U (£),P 12 (£),P 21 (£),P 22 ( ) . A sequence of three steps namely, Pass 1 , Pass 2, Swap and Exchange as shown in Figure 1 is referred to as a∑- sweep and there exists one pipeline per sub-matrix as shown in Figure 2. As shown in Figure 10 a sweep is composed from a∑-sweep, a U-sweep and a V-sweep. The∑-sweep calculates the diagonal singular value matrix∑, the U- sweep calculates the unitary U matrix and the V-sweep calculates the unitary V matrix.

The number of sweeps must be a multiple of three and it is determined that an acceptable convergence is attained after nine sweeps for the specific embodiment 110 as shown in Figure 10.

Figure 2 presents a single∑-sweep of a∑-PMSA structure which accepts as input a number of consecutive 4x4 matrices v(k) partitioned to sub-matrices

P u (£),P 12 (£),P 21 (£),P 22 ( ) and performs CSVD to calculate the diagonal eigenvalue matrix∑(£) .

The∑ k '(k) matrix singular values can be obtained from the∑-sweep following the subsequent procedure:

PASS 1 : As shown in Figure 5 first we calculate the angles required to obtain equation 5. Figure 5 shows the implementation of Pass 1 . Figure 5 displays the connection of each Processor Array Element (PAE) of the hardware device, corresponding to the implementation of PASS 1 . The two PAE's named VEC 6 are operating in CORDIC vectoring mode. Each take as input a complex number and output the angle and magnitude of each of the complex elements corresponding to the bottom row of the 2x2 sub-matrix. These are subsequently forwarded to PAE ATAN1 7 which operates in CORDIC Vectoring mode to produce equation 9 and also calculates equation 8. The angles are forwarded to the ROT PAE 8 which operates in cordic rotational mode such that it computes the sine and cosine of the input angle. Subsequently PAE MA1 9 performs the left and right transformations in equation 7 on sub-matrix P U ( ) to produce matrix P U' ( ) and computes the left and right transformation matrices in equation 14 associated with Pass 1 . The left and right transformation matrices that correspond to each diagonal sub-matrix P U ( ) are

The same applies to sub-matrix P 22 ( ) which in parallel produces matrix P 22' ( ) using a separate structure of the form in 60 with corresponding inputs and outputs as shown in Figure 2. The left and right transformation matrices that correspond to diagonal sub-matrix P 22 ( ) are

The calculated rotation matrices together with P 12 ( ) and P 21 ( ) are subsequently forwarded to PAEs Passl MA3 (Multiply-Accumulate) 13, which in parallel perform the following transformations

PASS 2: As shown in Figure 6 first we calculate the angles required to obtain equation 6. Figure 6 shows the implementation of Pass 2. Again, the two PAE's named VEC 10 operating in CORDIC vectoring mode, each take as input a complex number and output the angle and magnitude of each of the complex elements corresponding to the top row of the 2x2 matrix from the Pass 1 output. These are subsequently forwarded to PAE ATAN2 11 which operates in CORDIC Vectoring mode to produce equation 1 1 and also calculates equations 12 and 13. The angles are forwarded to the ROT PAE's which operate in cordic rotational mode such that they compute the sine and cosine of the input angles. Subsequently PAE MA2 12 performs the left and right transformations in equation 10 on sub-matrix P u' ( ) to produce matrix P u" ( ) as shown in Figure 2. It also computes the left and right transformation matrices in equation 18 associated with Pass 2. The left and right transformation matrices corresponding to the diagonal sub-matrix P u' ( ) are

The same implementation applies to sub-matrix P 22' ( ) which in parallel with the previous produces matrix P 22" ( ) as shown in Figure 2. The left and right transformation matrices that correspond to diagonal sub-matrix P 22' ( ) are

The calculated rotation matrices are subsequently forwarded to PAE Pass 2 MA3 14 in Figure 7 corresponding to sub-matrix which

perform in parallel the next set of transformations

SWAP & EXCHANGE ELEMENTS: Subsequently, the elements within each sub-matrix are swapped together as shown in Figure 1. In the

horizontal elements are swapped amongst themselves. In the vertical elements are swapped amongst themselves and in the diagonal elements are swapped amongst themselves. Sub-matrix remains

unaffected at this stage. The elements in Figure 7, which shows the∑

sweep, where, superscript kl is the corresponding sub-matrix and subscript /) ' is the corresponding element within sub-matrix kl, are swapped and exchanged in 15 between sub-matrices according to the following rules as shown in Figure 1

and,

Thus, the method constructs the 4x4 matrix 21 illustrated in Figure 1 under "Exchange Elements". Subsequently, the singular vectors U k '(k) and \ k '(k) are obtained as shown in Figure 3 and Figure 4 correspondingly, where each figure corresponds to a single U-sweep and V-sweep respectively.

Figure 3 presents a single U-sweep of a PMSA structure with input a number of consecutive 4x4 matrices u(k) partitioned to sub-matrices U U ( ),U 12 ( ),U 21 ( ), U 22 ( ) and performs CSVD to calculate the eigenvector unitary matrix U(k) . The first U-sweep in the pipeline stream has constant input the identity matrix l 4x4 and subsequent sweeps take as input the previous sweep output. The rotational matrices calculated in 20 are

supplied to the corresponding systolic structures in 40.

Figure 4 presents a single V-sweep of a PMSA structure with input a number of consecutive 4x4 matrices \(k) partitioned to sub-matrices

\ n (k),\ l2 (k),\ 2l (k),\ 22 (k) and performs CSVD to calculate the eigenvector unitary matrix V(k) . The first V-sweep in the pipeline stream has constant input the identity matrix l 4x4 and subsequent sweeps take as input the previous sweep output. The rotational matrices R , Re calculated in 20 are supplied to the corresponding systolic structures in 50.

The U k '(k) and \ k '(k) matrices are originally both equal to the identity matrix and subdivided into sub-matrices V n (k),V l2 (k),V 2l (k), V 22 (k) and X 11 (k),X 12 (k),X 21 (k),X 22 (k) . At each pass, the complex rotational matrices in equations 14, 15, 18, 19 generated in a sweep are forwarded to the corresponding U kl (k) , X kl (k) sub-matrix PAE MA4 4, 5 correspondingly in Figure 8 and Figure 9. The following calculations are performed at each U and V sweeps.

and,

Finally the elements in Figure 7 where, superscript kl is the

corresponding sub-matrix and subscript /) ' is the corresponding element within sub-matrix kl, are exchanged in 17, 19 between sub-matrices according to the same rule shown in Figure 1.

Figure 10 displays the current SVD implementation. The precision requirements at the expense of hardware allocation allow for up to nine sweeps.