Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CIRCUIT AND METHOD FOR COMPUTING A SUM OF ABSOLUTE DIFFERENCES FAST AND EFFECTIVELY
Document Type and Number:
WIPO Patent Application WO/2006/108912
Kind Code:
A1
Abstract:
For calculating SAD values for two regular groups of multi-bit values, such as macroblocks of different video frames, there are calculated (602) differences of multi-bit values taken in pairs from corresponding locations of said two regular groups. Thus we produce uncorrected absolute differences and correction bits, which are compressed (611) to a SAD. As a part of said compressing (611), correction bits are compressed (613) initially separately from the compression (612) of uncorrected absolute differences. The compressed correction bits are fed into a number of locations in the main process of compressing (612) absolute differences.

Inventors:
VANNE JARNO (FI)
AHO EERO (FI)
HAEMAELAEINEN TIMO D (FI)
Application Number:
PCT/FI2006/000117
Publication Date:
October 19, 2006
Filing Date:
April 13, 2006
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
VANNE JARNO (FI)
AHO EERO (FI)
HAEMAELAEINEN TIMO D (FI)
International Classes:
H04N7/26; G06F7/544; G06T9/00; G06F17/16; G06T; H04N
Foreign References:
US5864372A1999-01-26
US5838392A1998-11-17
EP1313323A12003-05-21
EP1315381A12003-05-28
Other References:
VASSILIADIS S. ET AL.: "The Sum-Absolute-Difference Motion Estimation Accelerator", EUROMICRO, 1998, pages 599 - 566, Retrieved from the Internet
Attorney, Agent or Firm:
BERGGREN OY AB (Helsinki, FI)
Download PDF:
Claims:
Claims
1. A circuit for calculating sum of absolute difference, hereinafter SAD, values for two regular groups of multibit values, comprising: a number of absolute difference units (602), each of the number of absolute difference units (602) being coupled to receive one multibit value from each of the two regular groups of multibit values and being adapted to produce as outputs an uncorrected absolute difference of the two multibit values and a correction bit value indicating a correction to the uncorrected absolute difference, a compression array (611) coupled to receive uncorrected absolute differences and indications of correction bit values from the absolute difference units (602) and adapted to produce a sum vector and a carry vector that together represent a sum of the outputs of the absolute difference units; characterized in that the compression array (611) comprises: a main compression tree (612) coupled to receive uncorrected absolute differences from the absolute difference units (602), said main compression tree (612) comprising a number of interconnected adders and being adapted to compress a first number of input values to a second number of output values, and a correction tree (613) coupled to receive indications of correction bit values from the absolute difference units (602), said correction tree (613) comprising a number of interconnected adders and being adapted to compress a third number of input values to a fourth number of output values, said correction tree (613) being coupled to deliver its output values to adders in said main compression tree (612).
2. A circuit according to claim 1 , characterized in that the interconnections of said adders in said main compression tree (612) set up a configuration consisting of levels, so that adders on a first level are coupled to receive at least a part of said input values of said main compression tree (612), adders on a last level are coupled to deliver at least a part of said output values of said main compression tree (612), and adders on a second and subsequent levels have inputs coupled to outputs of adders on previous levels.
3. A circuit according to claim 2, characterized in that: said first level of said main compression tree (612) comprises six adders, said second level comprises four adders, a third level comprises three adders, a fourth level comprises two adders, a fifth level comprises one adder and a sixth level, which constitutes said last level, comprises two adders, each of said adders in the main compression tree (612) comprises three inputs, a carry output and a sum output, said adders in said correction tree (613) set up a configuration consisting of levels, so that adders on a first level are coupled to receive at least a part of said in put values of said correction tree (612) and adders on a second and subsequent levels have inputs coupled to outputs of adders on previous levels, said first level of said correction tree (613) comprises two adders, each having three inputs, a carry output and a sum output, a second level of said correction tree (613) comprises two adders, one of them having two inputs, a carry output and a sum output, and the other of them having three inputs, a carry output and a sum output, a third level of said correction tree (613) comprises one adder, having three inputs, a carry output and a sum output, and said adders of said main compression tree and said correction tree are intercon nected in the way shown in figures 8 and 9.
4. A circuit according to claim 3, characterized in that said adders of said main compression tree and said correction tree are carry save adders.
5. A circuit according to claim 1 , characterized in that correction bit values from each pair of two adjacent absolute difference units (602) are accumulated into a correction bit vector before taking them to the compression array (611 ), so that the indications of correction bit values that are coupled from the absolute difference units (602) to the compression array (611) are said correction bit vectors.
6. A circuit according to claim 5, characterized in that it comprises a half adder (721) per each pair of adjacent absolute difference units (602), which half adder (721 ) is coupled to receive said correction bit values and adapted to convert said correction bit values into said correction bit vector.
7. A circuit according to claim 1, characterized in that each of said absolute difference units comprises: an adder (701 , 711) having a direct input for one multibit operand and an inverting input for another multibit operand and a sum output of as many bits as there are bits in said multibit operands, a carry output adapted to produce a bit value "1" if an operand brought to said direct input was greater than an operand brought to said inverting input and a bit value "0" otherwise, and as many XOR gates (702, 712) as there are bits in said sum output, each of said XOR gates (702, 712) having one input coupled to receive one bit of said sum output and another input coupled to receive an inverse of the value of said carry output.
8. A circuit according to claim 1 , characterized in that it comprises a minimum SAD determinator (621 ) coupled to receive SAD values produced in said compression array (611) and adapted to compare a number of consecutively calculated SAD values in order to find the smallest of them.
9. A circuit according to claim 8, characterized in that said minimum SAD de terminator (621 ) comprises: a register (1108) adapted to store a previously found minimum SAD value, a sum vector input and a carry vector input coupled to receive first and second output values respectively from said main compression tree (612), a carry save adder (1102) having three inputs, of which one is coupled to receive a sum vector through said sum vector input, one is coupled to receive a carry vector through said carry vector input and one is coupled to receive an inverse of said previously found minimum SAD value, said carry save adder (1102) having a sum output and a carry output, a first adder (1104) coupled to receive operands from said sum and carry outputs of said carry save adder (1102) and having a onebit output adapted to indicate, as a result of summing said operands, whether a SAD represented by said first and second output values from said main compression tree (612) was smaller than said previously found minimum SAD value, a logical gate arrangement (1105, 1106) adapted to produce an indication when said onebit output of said first adder (1104) indicates a SAD represented by said first and second output values from said main compression tree (612) to be smaller than said previously found minimum SAD value at an announced appro priate time for comparison, a second adder (1107) adapted to calculate a SAD from said sum and carry vectors, a multiplexer (1109) coupled to receive said indication from said logical gate arrangement (1105, 1106) and to respond to an active indication by storing a SAD calculated by said second adder (1107) into said register (1108) and to the absence of an active indication by storing anew said previously found minimum SAD value into said register (1108).
10. A circuit according to claim 9, characterized in that said minimum SAD de terminator (621) comprises: a ZEROSAD input adapted to carry an indication of whether sum and carry vectors at said sum and carry vector inputs respectively are associated with a SAD associated with two regular groups of multibit values that involve no relative motion in a frame, and a bonus input and an enabling array (1207) of logical gates, said enabling array (1207) being responsive to an indication at said ZEROSAD input by enabling the introduction of a bonus value through said bonus input to the calculation of a SAD associated with two regular groups of multibit values that involve no relative motion in a frame.
11. A circuit according to claim 9, characterized in that said minimum SAD de terminator (621 ) comprises: a threshold input coupled to receive a threshold value, a selection multiplexer (1203) adapted to controllably replace coupling a previously found minimum SAD value with coupling a threshold value through said threshold input to an input of said carry save adder (1102) and a thresholdrelated output register (1206) coupled to receive and output an indi cation when said onebit output of said first adder (1104) indicates a SAD represented by said first and second output values from said main compression tree (612) to be greater than said threshold value.
12. A circuit according to claim 11 , characterized in that: said minimum SAD determinator (621) comprises a ZERO_SAD input adapted to carry an indication of whether sum and carry vectors at said sum and carry vector inputs respectively represent a SAD associated with two regular groups of multibit values that involve no relative motion in a frame, and said selection multiplexer (1203) is responsive to an indication at said ZERO_SAD input.
13. A circuit according to claim 9, characterized in that said minimum SAD de terminator (621) comprises: a ZEROSAD input adapted to carry an indication of whether sum and carry vec tors at said sum and carry vector inputs respectively represent a SAD associated with two regular groups of multibit values that involve no relative motion in a frame, and a minimum SAD related output register (1201) coupled to receive and output an indication when said onebit output of said first adder (1104) indicates a SAD represented by said first and second output values from said main compression tree (612) to be greater than said previously found minimum SAD value at a time when sum and carry vectors at said sum and carry vector inputs respectively do not represent a SAD associated with two regular groups of multibit values that involve no relative motion in a frame, said time being other than an announced appropriate time for comparison.
14. A digital video encoder for encoding a digital video signal, characterized in that it comprises a circuit according to claim 1.
15. A method for calculating sum of absolute difference, hereinafter SAD, values for two regular groups of multibit values, comprising: calculating (602) differences of multibit values taken in pairs from corresponding locations of said two regular groups, thus producing uncorrected absolute differences and correction bits and compressing (611) calculated uncorrected absolute differences and correction bits to a SAD; characterized in that it comprises: as a part of said compressing (611), compressing (613) correction bits initially separately from the compression (612) of uncorrected absolute differences and feeding compressed correction bits into a number of locations in the main process of compressing (612) absolute differences.
16. A method according to claim 15, characterized in that it comprises taking correction bits from the calculation (602) of two adjacent absolute differences and accumulating (721 ) said correction bits into a correction bit vector before taking them into the compression (613) of correction bits.
17. A method according to claim 15, characterized in that it comprises comparing a number of consecutively calculated SAD values in order to find the smallest of them.
18. A method according to claim 17, characterized in that it comprises promoting a SAD associated with two regular groups of multibit values that involve no relative motion in a frame.
19. A method according to claim 17, characterized in that it comprises comparing a SAD associated with two regular groups of multibit values that involve no relative motion in a frame to a threshold value and making decisions about further operation depending on whether said SAD is smaller or larger than said threshold value.
Description:
Circuit and method for computing a sum of absolute differences fast and effectively

TECHNICAL FIELD

The invention concerns the technical field of digital video encoding. Especially the invention concerns the task of determining a motion vector, which reveals, how much and into which direction a macroblock of a digital video frame has moved from the closest matching macroblock in the reference frame.

BACKGROUND OF THE INVENTION

Motion estimation is a known way of reducing temporal redundancy between successive video frames, and widely used in common digital video encoding methods. A frame of a digital video stream is divided into non-overlapping blocks commonly referred to as macroblocks. Concerning each macroblock a reference frame is studied in order to find a best matching previous macroblock, i.e. a macroblock of the reference frame that is as similar as possible to the present macroblock concerning pixel content. As a part of encoding the present frame, there is generated a motion vector that tells, how much and into which direction the present macroblock has moved from the best matching previous macroblock. Additionally a difference signal is needed, which tells, how much and in which way the present mac- roblock differs even from the best matching previous macroblock.

A common size for a macrobiock is 16 times 16 pixels. Strictly speaking the common definition of a macroblock means a 16x16 luminance block as well as two spatially corresponding 8x8 chrominance blocks, but since the luminance and chrominance components correlate strongly, it is customary to handle only luminance data in motion estimation.

Motion estimation is easily the most costly part of digital video encoding, because of the large amount of required calculational operations and the relatively large re- quirement for chip space, which is needed for the hardware elements that perform said calculational operations. An advantageous architecture for motion estimation should be possible to implement with as few calculational operations and as few logical gates as possible. Tradeoffs may be required, so that a certain increase in

chip space requirements may be acceptable if a significant gain in computation speed is achieved, and vice versa.

Fig. 1 illustrates a commonly accepted so-called three-stage pipeline solution, which is a general arrangement used for determining the SAD (sum of absolute differences) of a present macroblock and a candidate macroblock. The SAD is a measure of how closely the candidate macroblock matches the present macroblock, i.e. whether the candidate macroblock could qualify as the best matching previous macroblock. The first stage consists of a row of absolute difference units 101. A number of pixels are read from the present macroblock and a candidate macroblock, and spatially corresponding pixel values are fed in pairs to the absolute difference units. As already the name suggests, each absolute difference unit 101 calculates the absolute difference of the two pixel values it received. The summing of these is accomplished in a second stage, which is commonly desig- nated as the compression array 102. It calculates the sum of all absolute differences it receives from the first stage and delivers the result to a third stage, which is the minimum SAD determinator 103. During the process of comparing the present macroblock to a number of candidate macroblocks, the minimum SAD determinator 103 keeps track of which of them resulted in the smallest SAD, i.e. which was the most similar to the present macroblock.

Motion estimators based on the general solution of fig. 1 are known for example from the prior art publications US 5,864,372 and US 5,838,392. In the following we analyse some of their more detailed features.

Fig. 2 illustrates the so-called Chen's absolute difference unit, which is here used as an example of a known absolute difference unit 101. It adds together an operand A (which in this example is an eight-bit value a 7 ,a 6 ,...,ao) and the two's complement of an operand B (here b 7 ,b 6 ,...,bo). Two's complementation of B is achieved by bitwise inverting B and bringing a default value '1' to a carry-in input of the adder 201. If operand A is initially larger, the adder produces directly a correct positive result and asserts a carry (c 7 =Υ), which is inverted to C=O. If operand B was initially larger, the adder 201 yields a two's complement negative result, which is converted to positive in a row of XOR output gates 202 which this time receive a C 7 = C = 1 to their other input. The C value constitutes also a correction bit, which is forwarded to the next stage of the SAD determination to be taken into account there. The apostrophe in the notation 1 ABS is a reminder of the fact that the correction bit C is missing from the result.

Fig. 3 illustrates an adder tree known from a scientific publication Q. Shu and H. Chen: "An efficient implementation of motion estimation algorithms", in Proc. 4th Int. Conf. Solid-State and Integrated Circuit Technology, Oct. 1995, pp. 697-699. The adder tree was actually only a predecessor of real compression array solutions, but it provides a very intuitive way of understanding the task of the second stage illustrated in fig. 1. Absolute differences of pixels in a present macroblock and a candidate macroblock are fed in pairs to first-line adders 301. The summing proceeds through second-line adders 302 and third-line adders 303 to the output adder 304. Each of the adders 301 , 302, 303 and 304 receives also a selected one of the correction bits in order to take them into account. The output of the adder tree consists of the final sum 'SADO..15 and the remaining correction bit C15.

Fig. 4 illustrates a compression array structure known from US 5,864,372, here H- lustrated as a simplified version that only receives four absolute differences 1 ABSO to 'ABS3 and the corresponding correction bits CO to C3 as inputs. The compression array 401 comprises a number of cascaded 4:2 compression unit pairs like the one shown in the partial enlargement 402. Going towards the most significant bits there comes a limit after which compression unit pairs are replaced with somewhat simpler circuits. Each of said cascaded 4:2 compression unit pairs and said simpler circuits produces one bit of a SAD_S vector, which is a sum vector, as well as one bit of a SAD_C vector, which is a carry vector. The sum vector SAD_S and the carry vector SAD_C are temporarily stored in output registers 404 and 403 respectively and fed back as additional input information to the compres- sion array 401.

Fig. 5 illustrates the minimum SAD determinator proposed in US 5,838,392. A feature of this solution is that it never actually calculates the real SAD values, but handles only the sum vectors (the SAD_S vectors) and the carry vectors (the SAD__C vectors) separately. The minimum values of these are stored in the sum register 501 and the carry register 502 respectively. A combination of a 4:2 compression unit 503 and an adder 504 carries out a two's complement subtraction between the current and minimum SAD values, so that a carry bit obtained at a carry out output of the adder 504 represents a sign of the subtraction result. A con- trol input is needed to determine the correct moment for the comparison. If the carry bit from the adder 504 is '0' during comparison, the current SAD value is smaller than the previously stored minimum SAD, so the current SAD gets stored into the registers 501 and 502. An inverter 505 and two logic ports 506 and 507

take care appropriately controlling the storing process. In addition, motion vector input data (shown in fig. 5 as (k,l)) is stored in a register 508. A carry bit value '1' during comparison means that the current SAD has exceeded the minimum SAD. A notification "end_tag" is available to indicate such a situation and can be used e.g. to terminate SAD calculation at an early stage.

Disadvantages of the known solutions include excessive use of chip space, delays in calculation as well as suboptimal overall performance.

An objective of the present invention is to present a circuit and a method for SAD calculation and minimum SAD finding with superior performance and area efficiency.

The objective of the invention is achieved with an architecture in which correction bits are taken to specific handling quite early in the calculation process, preferably already from the absolute difference units, and only combined with the handling of absolute differences quite selectively in the compression stage.

A circuit according to the invention is characterized by the features recited in the characterizing part of the independent claim directed to a circuit.

A method according to the invention is characterized by the features recited in the characterizing part of the independent claim directed to a method.

Additionally the invention is directed to a digital video encoder, which is characterized by the features recited in the characterizing part of the independent claim directed to a digital video encoding apparatus.

A first aspect of the invention is an advanced absolute difference unit, in which one's complement representation for one of the operands is used, instead of two's complement representation, which was common in the earlier solutions. A carry bit is produced in the absolute difference unit and used to bitwise invert the result if necessary, but neither a low-performance end-around carry chain nor any constant '1' carry-in input are needed. The carry bit constitutes also a correction bit. Correction bits produced in two adjacent absolute difference units can be combined in a half adder or a corresponding circuit unit already before taking them to the second stage of the three-stage pipeline process.

A second aspect of the invention is a second stage, i.e. a compression array, in which the correction bits are initially handled in a correction tree separate from the main compression tree used to handle the uncorrected absolute differences. A preferable configuration of the compression array is based on using CSAs (carry save adders), which can be allocated and coupled in various ways. An advantageous embodiment involves six levels of CSAs in the main compression tree and three levels of CSAs in the correction tree, so that from the second and third levels of the correction tree compressed correction information is taken to suitable inputs of CSAs on the third, fourth, fifth and sixth levels of the main compression tree. The strategy of combining the compressed correction information to the main compression tree is most advantageously formulated so that in the eventual sum and carry vectors obtained as outputs from the compression array there are as few common bit positions as possible.

A third aspect of the invention is a third stage, i.e. a minimum SAD determination unit in which a variety of early termination mechanisms can be implemented according to choice. Among these are comparing the value of a current SAD to a previous minimum SAD already during calculation, when the calculation of the current SAD is still continuing, so that exceeding the previous minimum terminates the calculation before it is completed. Additionally the value of a first, no-motion SAD can be promoted with a negative bonus decreasing its value, so that it comes statistically more probable that no motion vector at all needs to be transmitted. As a third mechanism that affects the operation of the minimum SAD determination unit there are introduced the concept of thresholds. One or more thresholds for the SAD calculation can be determined, so that every time a currently calculated SAD exceeds a threshold, certain threshold-dependent decisions are made about the further proceeding of the current SAD calculation. Contrary to the first-mentioned mechanism, a threshold is a parameter value and not the result of any previous SAD calculation.

The exemplary embodiments of the invention presented in this patent application are not to be interpreted to pose limitations to the applicability of the appended claims. The verb "to comprise" is used in this patent application as an open limitation that does not exclude the existence of also unrecited features. The features recited in depending claims are mutually freely combinable unless otherwise explicitly stated.

The novel features which are considered as characteristic of the invention are set forth in particular in the appended claims. The invention itself, however, both as to its construction and its method of operation, together with additional objects and advantages thereof, will be best understood from the following description of spe- cific embodiments when read in connection with the accompanying drawings.

Fig. 1 illustrates the known concept of three stages in SAD calculation, fig. 2 illustrates the known Chen's absolute difference unit, fig. 3 illustrates a conventional summing tree, fig. 4 illustrates a known compression array concept, fig. 5 illustrates the known Chen's minimum SAD determination unit, fig. 6 illustrates a SAD calculation circuit and method according to an embodiment of the invention, fig. 7 illustrates a pair of absolute difference units according to an embodi- ment of the invention, fig. 8 illustrates a part of a compression array according to an embodiment of the invention, fig. 9 illustrates another part of a compression array according to an embodiment of the invention, figs. 10a to 1Od illustrate the composition of irregular adder units in figs. 8 and 9, fig. 11 illustrates a minimum SAD determinator according to a simple embodiment of the invention, fig. 12 illustrates a minimum SAD determinator according to a more versatile embodiment of the invention, fig. 13 illustrates the SAD calculation circuit and method of fig. 6 adapted to the embodiments presented in figs. 7, 8, 9 and 12, fig. 14 illustrates a comparison of delay and area requirements of various SAD calculation implementations and fig. 15 illustrates a video encoder according to an embodiment of the invention.

Fig. 6 is a schematic presentation, which can be understood to illustrate a circuit and a method for a SAD implementation for motion estimation, to be used as a part of digital video encoding. In conformity with the designations used earlier, we assume that there is a present macroblock for which a motion estimate should be generated, and a number of candidate macroblocks among which there should be identified the most suitable one, i.e. basically the one with least difference to the present macroblock. Straighforward application of this basic criterion may be

slightly distorted e.g. through the introduction of "zero bonus" that promotes one of the candidate macroblocks, but such details are discussed in more detail later.

A first stage 601 comprises absolute difference units 602, each of which is adapted to determine an absolute difference of two multi-bit input values. The number of the absolute difference units 602 equals that number of pixels in the present and candidate macroblocks that can be compared simultaneously. In fig. 6 we make the common assumption that the macroblocks to be compared are 16 x 16 pixels in size and that one line (or column) of pixels from each macroblock is taken to comparison simultaneously. Consequently there are sixteen absolute difference units 602, of which only four are explicitly shown. We also make the assumption that the pixel values are expressed with eight bits, so that each of the absolute difference units 602 receives two eight-bit input values and produces an eight-bit uncorrected output value and a carry bit. It would be very straightforward to change the number of absolute difference units and/or bit inputs in each absolute difference unit if some other design parameters were used, or to select the pixels to be compared according to some other selection strategy than simply considering lines or colums.

A second stage 611 is compression array that receives the uncorrected output values and correction bits (or more generally "correction values" or just "corrections") from the absolute difference units 602 of the first stage 601. As a difference to the known Chen's compression array, the second stage comprises a main compression tree 612 and a separate correction tree 613. The former is adapted to re- ceive the uncorrected absolute difference values and sum and carry vectors compressed previously by the second stage as feedback. The correction tree is adapted to receive the corrections. Correction bit compression in the correction tree 613 results in a number of compressed correction values, which are fed to suitable points in the main compression tree 612.

As a result the second stage 611 produces sum and carry vectors, which together describe the sum of absolute difference values in the row (or column, or group) of pixels that were compared simultaneously. These are taken to a third stage 621 , which comprises a minimum SAD determinator. It accumulates row-specific (or column specific, or group specific) SADs, compares the accumulated sum to a minimum value when appropriate and produces output values, among which is the found minimum SAD. The third stage 621 receives also some control inputs and produces some additional outputs, which can be used e.g. for implementing early

termination mechanisms under the control of a separate control entity (not shown in fig. 6).

Fig. 7 illustrates an advantageous arrangement of two adjacent absolute differ- ence units 602 in the first stage. The illustrated absolute difference units are those used to produce the absolute differences of the Oth and 1st pixel pairs. The basic solution of each absolute difference unit 602 resembles that of the known Chen solution (see fig. 2), but with important differences in details. Considering the right- hand side absolute difference unit, there is an 8-bit adder 711 , which operates on an operand A (a 7 ...a 0 ) and an inverted operand B (b 7 ...bo, shown before the bit- inverting). The 8-bit adder 711 applies one's complement representation for the operand B during addition, i.e. S = A + B ~ , instead of two's complement representation. If operand A was smaller than (or equal to) operand B, the produced negative result S (S 7 ...S 0 ) is bit-inverted in the XOR ports 712 to obtain the correct posi- tive value and no additional correction bit is required (c 7 = '0'). However, if operand A was greater than operand B, the 8-bit adder 711 produces a correct positive result S minus one. Since the required end-around carry addition does not change the output carry (c 7 ), immediate end-around carry addition is unnecessary for this case. Hence, the carry bit (c 7 = '1') generated by the adder 711 is forwarded as a correction bit and the one-bit error compensation is performed later.

The equations that describe the operation of an absolute difference unit 602 are

JA + B = A + (2 β -1 -B)≡ (A -B - i)mod(2 6 ) andC = 1 ,A > B ~ { A + B = 2 8 - 1 - (A + (2 8 - 1 - B))= B - A andC = 0 ,A ≤ B

As a result, the carry-in of the adder can be totally eliminated without any hardware overhead. This is an important difference to Chen's approach, where a constant carry-in of value '1' was required.

As a further difference to any known approach, the structure of the absolute difference unit 602 allows immediate correction bit accumulation without any increased delay. It should be noted that the left-hand side absolute difference unit is completely equal to that described above, with only the carry bit output shown on an opposite side of the 8-bit adder 701 to enable easier illustration of the immediate correction bit accumulation concept. The correction bit from the 8-bit adder 701 on the left and the correction bit from the 8-bit adder 711 on the right are taken, unin- verted, as inputs CO and C1 respectively to a half adder 721 , which produces a

two-bit correction bit vector C0..1. Similarly pairing up all absolute difference units in a 16-channel first stage means that the output of the first stage consists of sixteen 8-bit uncorrected absolute difference values 1 ABSO, 'ABS1 ,...,'ABS15 as well as eight 2-bit correction bit vectors C0..1 , C2..3 C14..15. Using half adders to generate the correction bit vectors destroys the information about the exact value of each single correction bit, but this has no adverse effects to the further processing; quite to the contrary, the generated 2-bit correction bit vectors are better suited for use in a compression array according to an embodiment of the invention.

Figs. 8 and 9 illustrate advantageous embodiments of a main compression tree and a correction tree respectively. The main compression tree of fig. 8 receives new input data ( 1 ABSO, 1 ABS 1 ,...,'ABS 15; C0..1) from the absolute difference units of the first stage and feedback data (SAD_S (pre) , SAD_C (pre) ) from its own output registers (not separately shown). Additionally there is an INIT input to be used for initializing the feedback values to zero at the beginning of the calculation of a new SAD. The feedback sum vector SAD_S (pre) has 16 bits, which after the initialization-enabling row of gates are designated as ssis, ssi 4 ,...,sso or ss 15 ..o for short. For reasons to be discussed in more detail below the feedback carry vector SAD_C (pre) has three bits less, i.e. only 13 bits, which are similarly designated after the initialization gates as sc-i δ , sci 4 ,...,sc 3 or SC 15 .. 3 . It should be noted that all subindex numbers like for example those in notation "SC 15 .. 3 " signify the corresponding bit positions. In all cases where a notation refers to multiple bit positions, these are given in the subindex from the most significant to the least significant bit position in question. On the contrary, the full scale numbers in correction bit vector notations like C0..1 or C12..13 only refer to the absolute difference units from which the correction bits were taken to the half adder that produced the correction bit vector in question.

In fig. 8, the input and output bit widths of the CSAs are marked with an MSB... LSB notation (most significant bit ... least significant bit). In all CSAs the left-hand output produces partial carries and the right-hand output produces a partial sum. Most of the CSAs have three inputs, with the exception of one two-input CSA in the correction tree. The CSAs 801 , 802, 803 and 804 with exceptional in- put widths have been marked with an asterisk and their composition is illustrated with the help of regular CSA and port structures in figs. 10a, 10b, 10c and 10d respectively. In these marked CSAs the input bit stages that receive bits from all three inputs are implemented with full adders, whereas half adders are used for bit

stages dealing with only two inputs. The unmarked, three-input CSAs are entirely constructed from full adders.

The topmost level in the main compression tree has six CSAs having the following inputs and outputs:

Table 1 : CSAs numbered from left to ri ht

The second level in the main compression tree has four CSAs having the following inputs and outputs:

Table 2: CSAs numbered from left to ri ht

The third level in the main compression tree has three CSAs having the following inputs and outputs:

Table 3: CSAs numbered from left to right

The fourth level in the main compression tree has two CSAs having the following inputs and outputs:

Table 4: CSAs numbered from left to right

The fifth level in the main compression tree has one CSA having the following inputs and outputs:

The sixth level in the main compression tree has two CSAs having the following inputs and outputs:

Table 6: CSAs numbered from left to right

Left input Middle input Right input Left output Right output

CSA 1 *bits 15..12: "bit 12: MSB *bits 15..12: "bits 15..13 *bits 15..12

The topmost level in the correction tree has two CSAs having the following inputs and outputs:

The second level in the correction tree has two CSAs having the following inputs and outputs:

Table 8: CSAs numbered from left to right

The third level in the correction tree has one CSA having the following inputs and outputs:

During operation, the sum and carry bits produced by the 3-stage correction tree of fig. 9 are inserted one by one into suitable, otherwise unused CSA inputs available in the main compression tree of fig. 8. Bolded letter pairs from A to G indicate connections between outputs in the correction tree and inputs in the main com- pression tree. In addition, the bit stage numbers that receive the bits from the correction tree are bolded in fig. 8. The LSBs ss 7 .. o and sc 7 .. 3 of the conditionally initialized previous sum and carry vectors respectively are inserted to the first stage of the main compression tree as is shown in fig. 8 and table 1 above. A number of intermediate bits from these vectors, namely ssn.. 8 and scn.. 8 , are individually in- serted into the main compression tree. In fig. 8 these intermediate insertions are indicated with circles (concerning the sum vector bits) and squares (concerning the carry vector bits) around the respective bit stage numbers. The MSBs SS- 15 .. 12 and SC- 15 ..12 of these vectors are accumulated in the final stage of the main compression tree.

The compressed carry vector SAD_C (cur) has 13 bits, the 3 MSBs of which come from the left-hand output of CSA 1 on the sixth level. The rest of said bits come from the left output of CSA 2 on the sixth level. The compressed sum vector SAD_S (cur) has 16 bits, of which the 4 MSBs come from the right output of CSA 1 on the sixth level, 10 subsequent bits come from the right output of CSA 2 on the sixth level, the next bit (bit stage 1 ) is the LSB of the right output of the CSA on the fifth level and the last bit (bit stage 0) is the LSB of the right output of CSA 3 on the third level.

The compression tree structure of figs. 8 and 9 has an advantageous property of reducing the number of operands at the earliest opportunity. Although the earliest possible operand reduction causes some area overhead in a circuit implementa-

tion, it also decreases the number of common bit stages involved in the compressed sum (SAD_S (cur) ) and carry (SAD_C (cur) ) vectors. In this context a common bit stage indicates a bit stage that is not only presented in one of said vectors. Without correction bit insertions, the main compression tree would be capable of decreasing the number of common bit stages from 16 to 12. Due to the separate structures developed for the correction bit accumulation, the compression tree structure of figs. 8 and 9 is capable of decreasing the number of common bit stages from 16 to 13, which falls only one bit stage short of the theoretical maximum; the common bit stages are SS15..3 and SC 15 .. 3 . The reduction in the number of common bit stages may not necessarily increase execution speed of a final sum and carry vector addition, since the delay of a fast adder is not typically a smoothly increasing function of the adder bit width. However, the reduced number of common bit stages decreases hardware cost due to narrower adder implementation and smaller number of registers in the later stages of the motion estimator.

Immediately inserting the first correction bit vector C0..1 into one of the Oth stage CSAs in the main compression tree is possible, because the three LSBs in the compressed carry vector of the previous SAD are unused, as described above. Considering a prior art compression array like that of fig. 4, where all 16 correction bits are inserted in a 0-bit stage of CSA inputs, we may note that said prior art solution causes - when compared to a traditional CSA tree that excluded correction bit addition - an area increase nearly equal to the size of additional CSA tree of the present invention. On the other hand said prior art solution only enables presenting fewer stages of the SAD result merely with the sum vector, i.e. it involves more common bit stages in the sum and carry vectors than the present invention.

Trees utilizing other adder types than CSAs could be presented, and they could be used to decrease the number of operands early and to implement a separate correction tree, from which the results are fed into the main compression tree, like in the exemplary solution above. However, it is believed that utilizing CSAs involves certain advantages because of the relative simplicity of their structure and the regularity of the resulting adder allocation and coupling scheme.

It would be possible to implement an early termination mechanism within the com- pression tree, following the approach taken in a prior art publication US 2003/0043911 A1. The idea is to interrupt an ongoing SAD computation, if the accumulated SAD value exceeds some predefined, fixed threshold value. However, excluding 1-3 MSBs in this way would only enable diminutive area savings in

the main compression array, while excluding more bits would easily cause excessive restrictions to SAD values than can be handled. Bit width reduction still maintains the height of the compression array. A full 16-bit array implementation is a very feasible solution; any desired output bit of the accumulated carry vector can be used as an interrupt that triggers early termination.

Fig. 11 illustrates an exemplary simple implementation for a third stage, i.e. a minimum SAD determinator. We should note that the proposed implementation does not include motion vector determination, which is left on a responsibility of some other suitable circuit element. The implementation of fig. 11 also does not include any early termination mechanisms, which are discussed in more detail later.

Compressed current sum and carry vectors are received through the SAD_S and SAD_C inputs respectively. The comparison to a previously stored minimum SAD value begins by bit-inverting said previously stored minimum SAD in an inverter 1101. Using a 15-bit CSA 1102, the bit-inverted minimum SAD value is accumulated with the SAD__S and SAD_C vectors. The constant one bit addition required for two's complement representation is managed with an OR gate 1103 coupled to receive the LSBs of the bit-inverted minimum SAD and the current SAD_S value and to deliver the result as an additional LSB to augment the left (carry) output of the CSA 1102. The CSA compression yields two difference vectors dif_sis..i and dif_ci 6 ..i, which are added together with a 16-bit adder 1104. The adder may be somewhat reduced, because only the most significant sum bit (s-ι 6 ) is needed at its output. This bit indicates the sign of the subtraction.

Since a single SAD value computation spans over several cycles, a control input CMPR is needed to indicate the proper time for a SAD value comparison. A gate structure consisting of a NAND gate 1105 and an AND gate 1106 is used to take into account both the CMPR signal and a ZERO_SAD input signal, which indicates whether a first SAD value computation is taking place, so that there is no previously stored minimum SAD value. If the most significant sum bit Si 6 is 1 O' or if the ZERO_SAD bit is '1' during comparison, the 13 MSBs of the current SAD value parallelly calculated in a 13-bit adder 1107 as well as the 3 LSBs taken directly from SAD_S (because corresponding bit stages were missing from SAD_C) are stored into register 1108 as the completed new minimum SAD value. A multiplexer 1109 is used to recycle an old minimum SAD or store the new depending on the

signal NEW_MIN. An inverter 1110 inverts the incoming ZERO_SAD control bit to ensure correct operation.

A new minimum SAD value activates both two output signals SAD_RDY and MIN_RDY, stored in output registers 1111 and 1112 respectively. If the new value was not the first one and not smaller than the previous minimum value, only SAD_RDY is activated. The dependencies between the CMPR, ZERO_SAD, si 6 , SAD_RDY and MIN_RDY bits are summarized in the following table.

Table 10:

Fig. 12 illustrates a minimum SAD determinator in which several early termination mechanisms have been implemented. The basic flow of minimum SAD determination is similar to that explained above in association with fig. 11. The first early termination mechanism is based on the fact that even before the current SAD value is completed (i.e. before the CMPR control signal becomes active), if the most significant sum bit si 6 becomes 1 I 1 , the currently accumulating SAD has already exceeded a previously stored minimum SAD value. Therefore the most significant sum bit si6 is output as an additional output signal SAD_EXC through output register 1201. An AND port 1202 performs an AND between the si 6 bit and the inverted ZERO_SAD control bit so that the SAD_EXC output remains at 1 O' while the first SAD value is calculated. Compared to fig. 11 , the mechanism for detecting minimum SAD exceeding causes an area overhead of only a single AND gate and a single 1-bit register, and no delay overhead at all. The SAD_EXC output assuming value '1' is meant to cause the control algorithm to terminate the calculation of the current SAD.

Comparing a current SAD value to a fixed threshold makes sense essentially only when the first SAD value is calculated, i.e. when the ZERO__SAD control bit is active. Therefore the implementation of fig. 12 includes a multiplexer 1203, controlled by the ZERO_SAD control bit, which selects either the inverted previous minimum SAD value or an inverted threshold value obtained through the TH input and the inverter 1204 as a basis for the comparison. An AND gate 1205 and an output register 1206 are used to output a THJΞXC indication similarly as was explained earlier concerning the SAD_EXC output, with the exception that the AND gate 1205 receives the non-inverted ZERO-SAD bit at one of its inputs in order to enable giv- ing THJΞXC outputs only during the calculation of the first SAD value. Additional area overhead caused by this mechanism includes 16 NOT ports, one 16-bit multiplexer, one AND port and one 1-bit register. Delay overhead equals the delay caused by the multiplexer 1203.

The implementation of the threshold-exceeding indication supports even comparison with multiple threshold values. After a currently used threshold value has been exceeded, the subsequent larger threshold value can be fed in through the TH input during the next cycle. The one cycle delay between successive threshold value comparisons prevents the examination of the remaining larger threshold values, if the current threshold was only exceeded during the final cycle before the SAD completion. However, this is not a serious drawback since threshold values are usually only applied to speed up the algorithm execution flow. Hence, an unexam- ined larger threshold value may in a worst case cause unnecessary calculations, but a proper result is still yielded.

It would be possible to perform threshold exceeding detection completely in parallel by duplicating the comparison parts of the minimum SAD determinator for each desired threshold level. Parallel threshold value detection would reduce delay due to the avoided selection multiplexer and would remove completely the delay be- tween successive threshold value examinations. However, the associated increase in circuit area would be relatively high due to the duplicated comparison logic.

A third mechanism related to early termination is based on deliberately promoting the SAD value for the first, i.e. no-motion candidate macroblock. This involves subtracting a predefined parameter input as -ZERO_BONUS from the SAD value for the zero motion vector. In practice the -ZERO_BONUS value is coupled through an enabling array 1207 of AND gates to a CSA 1208, which sums it with the cur-

rent SAD value. The AND gates of the enabling array 1207 receive their other, enabling input from the ZERO-SAD control input, which means that a -ZERO_BONUS input only materializes as a non-zero input to the CSA 1208 if the ZERO-SAD control bit is active. In known solutions, a suitable zero bonus is e.g. one hundred with a diamond search algorithm. The -ZERO_BONUS value may be a constant fixed at the design time, or it may also be a run-time parameter controlled by the control algorithm of digital video encoding. The zero-bonus mechanism of fig. 12 introduces an area overhead of 16 AND ports as well as one 16-bit CSA, and no delay overhead.

The following table summarizes the relations of the CMPR, ZERO_SAD, si 6 , SAD_RDY, MIN_RDY, SAD-EXC and TH-EXC bits in the implementation of fig. 12.

Table 11 :

Comparing to the known minimum SAD determinator of Chen (see US 5,838,392), the arrangements of figs. 11 and 12 may be said to involve the drawback of requiring one additional 15-bit adder (or 13-bit adder in fig. 11), which is the adder 1107 used to calculate the complete value of the SAD. However, on the positive side the arrangements of figs. 11 and 12 require only about one half of the registers and selection logic used to store the minimum SAD value and only about one half of the inversion logic used to invert the minimum SAD value. Minimum SAD comparison is also faster, because a CSA+OR combination is faster than a 4:2 compressor. A narrower adder is sufficient for difference vector addition, and the actual SAD value is immediately available, which is an advantage because it is required e.g. to make the decision between INTRA and INTER coding modes.

Fig. 13 illustrates the appearance of absolute difference units according to fig. 7, a compression array according to figs. 8 and 9 as well as a minimum SAD determi- nator according to fig. 12 in the general framework previously presented in fig. 6. The absolute difference units 602 come in pairs, with a half adder 721 between each pair to generate the two-bit correction bit vectors. Registers 1301 between the first stage 601 and the second stage 611 take care of transferring the uncor- rected absolute difference values as well as the correction bit vectors. Similarly there are registers 1302 between the second stage 611 and the third stage 621 to transfer the compressed SAD_S and SAD_C values. Output registers were illus- trated in figs. 11 and 12 to be parts of the minimum SAD determinator, so they are not explicitly illustrated in fig. 13. The first stage 601 receives 32 operands, which are the 16 pixel values from each of the present and candidate macroblocks. The second stage 611 receives 26 operands, which are the 16 absolute difference values, 8 correction bit vectors and two feedback values from the outputs of the sec- ond stage 611. Additionally the second stage 611 receives a control input INIT . The third stage 621 receives the compressed SAD_S and SAD_C values, a feedback value MIN_SAD from its own output as well as a threshold value TH, a zero bonus value -ZERO_BONUS and control signals CMPR and 2ERO_SAD. The outputs of the third stage 621 are the MIN-SAD value as well as indication outputs SADJRDY, MIN_RDY, SAD_EXC and TH_EXC.

Parts of the overall architecture of fig. 13 could be replaced with differently chosen solutions. For example, if Chen's absolute difference units (see fig. 2) would for some reason be preferred as such, instead of the first stage 601 according to an embodiment of the present invention, they could be used, provided that Chen's correction bits would be transformed into correction bit vectors in an additional intermediate stage before the second stage 611. Also if Chen's compression array would be preferred to substitute the second stage 611 according to an embodiment of the present invention, that would be possible by again introducing an in- termediate stage that this time should snatch the correction bits from the first stage before their combination to correction bit vectors. Also a prior art minimum SAD determinator could be used in place of the third stage 621 according to an embodiment of the present invention, by making small modifications that as such would be easy for the skilled person to make. However, replacing parts of the novel and inventive solution explained above would inevitably result in poorer performance and less advantageous solution.

Known methods are available for deriving theoretical delay and area estimates for the invented motion estimator. The area cost for n-input basic gates is assumed to be n cost-units (CUs), expect for XOR and XNOR gates having 2n CU cost. In turn, the delay cost for all the applied gates is supposed to be 1 τ. Registers, inter- connections as well as fan-out / fan-in related issues are usually excluded from the calculations. Due to these assumptions, the achieved results are better for comparison purposes than for standalone absolute metrics.

Adders typically represent a major part in delay and area costs. Therefore, three types of adders possessing different properties may be considered: RCAs (ripple carry adders), CLAs (carry look-ahead adders), and CSAs (carry save adders). In addition, 2-input gate versions (CLA(2)) besides n-input gate versions of CLAs (CLA(n)) are separately analyzed, since the impact of the available gate width is essential for CLA. The difference between CLA versions is that n-input gates used in CLA(n) are replaced with trees of 2-input gates in CLA(2). Worth noticing is that straightforwardly replacing n-input gates with trees of 2-input gates leaves space for CLA(2) area optimization. However, such optimization is not taken into account in the calculations here.

As a baseline, the delay and area costs for a single half adder are assumed to be 1 τ and 6 CU, whereas the corresponding metrics are 3 τ and 14 CU for a full adder. In the analysis, all the adders are assumed to include only the logic, which is required to produce the requested output. That is, logic related to unused input and output ports is removed.

The upper and lower charts in Fig. 14 illustrate the theoretical delay and area estimates for the different estimator units. A column with simple diagonal hatch represents an RCA/CSA implementation, a white column represents CLA(2)/CSA implementation, a cross-hatched column represents CLA(n)/CSA implementation and a column with double diagonal hatch represents CSA implementation. In each unit, adder type exchange occurs merely between RCAs and CLAs. These adders are adders 701 and 711 in the absolute difference unit of fig. 7 and adders 1104 and 1107 in the minimum SAD determination unit of fig. 11. The other adders are in each case implemented with CSAs.

Let us first consider the evaluated realizations of the absolute difference unit. In all the cases, the area cost is calculated for 16 units. The invented CLA-based unit is the most recommended solution for high-performance applications, whereas the

invented RCA-based unit provides the most area efficient implementation. In turn, the presented compression array unit including only CSAs is a very effective solution in terms of execution speed and area efficiency. Finally, comparison between RCA- and CLA-based minimum SAD determinators reveals that delay increase in RCA-based implementations is far more significant over area savings. Therefore, the CLA-based minimum SAD determinator is the recommended implementation.

Based on the analysis results, two efficient pipelining schemes are proposed for the preferred units. The first one follows the scheme presented in Fig. 13. In that 3-stage scheme, the RCA-based absolute difference units are adequately fast to be assigned to the first stage. The pipeline stages are theoretically well balanced, since each involved unit has combinational delay close to 19 τ. An average of CLA(2) and CLA(n) theoretical delay estimates is used for the CLA-based configurations. The latency of the architecture is three clock cycles. Another proposed pipelining scheme includes merely two pipeline stages. The execution speed of the second scheme is increased by utilizing CLA-based absolute difference units. According to theoretical delay metrics, the optimal location for the available intermediate pipeline stage is before the final CSA stage of the compression array. As a result, both of the pipeline stages have theoretical combinational delay of ap- proximately 23 τ.

The area and timing results based on logic synthesis are provided for the proposed 3-stage and 2-stage architectures. Synopsys Design Compiler is a known tool used in ASIC synthesis, and we may assume that the applied technology is a 0.18-micron CMOS process. Adders included in the proposed architectures are implemented with special Synopsys DesignWare components. Strictly speaking, RCAs and CLAs are realized with the known DW01_rpl and DW01_cla components, respectively. In turn, CSAs are constructed from 1-bit DW01_rpl adders. The area metrics (cell count) are based on equivalent 2-input NAND gates, whereas the delay values represent the critical path in the pipelined architectures. Registers are also included in the delay and area metrics.

With the selected technology, the proposed 3-stage SAD architecture is capable of operating at a frequency of 780 MHz. Implementing this maximum frequency ar- chitecture costs less than 5600 NAND gates. Obviously, lowering operating frequency requirements increases area efficiency of the architecture. The same architecture operating at 580 MHz costs 4200 NAND gates. In addition, 380 MHz operating frequency is still reached with the architecture having the minimum at-

tainable area of approximately 4000 NAND gates. On the other hand, the proposed 2-stage architecture can be clocked at 700 MHz at a cost of 7400 NAND gates. In turn, 520 MHz and 360 MHz operating frequencies are reached with 4900 and 3900 NAND gates, respectively. To conclude the examination, the pro- posed 3-stage architecture is a recommended implementation for high throughput and area efficient systems, in which three cycle latency is acceptable. In turn, the proposed 2-stage architecture is an advocated solution for low latency high-end systems.

Fig. 15 illustrates a video encoder according to an embodiment of the invention. An input video signal 1501 is taken to a motion estimator 1502, which performs the macroblock comparisons in order to find out the motion vectors. The last- mentioned are taken to a motion compensator 1503, which essentially outputs the appropriate macroblocks from a previous frame which a motion vector points at. The difference between the original frame and that indicated by the motion vectors is calculated in an adder 1504, the output of which thus constitutes the prediction error. DCT coefficients are calculated for the prediction error in the DCT encoder block 1505 and the calculated coefficients are quantized in the quantizer 1506. From there the quantized DCT coefficients are taken to the Huffman / Run-length encoder 1507, the output of which constitutes the encoded video signal 1508.

For the purposes of motion compensation, inverse quantizing and DCT decoding are performed in the appropriate blocks 1509 and 1510. An adder 1511 and a temporary frame store 1512 are needed to feed the feedback information appro- priately to the motion compensator 1503. There is a coupling from the output of the motion compensator 1503 to the adder 1511. A coupling from the temporary frame store 1512 to the motion estimator 1502 provides the latter with information about the contents of the reference frame.

A control unit 1513 is provided to control the operation of the video encoder. One of the main responsibilities of the control unit 1513 is making decisions about the coding mode to be selected for each frame: intra coding (also known as independent or self-sufficient coding) for frames that would be difficult to code by prediction from other frames, and various forms of inter coding for predictable frames. The control unit 1513 also executes all kinds of software- or firmware-based control routines that have an effect on the coding process. The functions of a control unit 1513 are not necessarily implemented in a single, concentrated circuit element but they can be distributed to two or more distributed control sub-units.

The present invention mainly affects the motion estimator 1502, which in a video encoder according to an embodiment of the invention comprises the absolute difference units, the compression array as well as the minimum SAD determinator that were described earlier. However, the invention has also an impact on the control unit 1513, especially if the early termination criteria described in association with figs. 12 and 13 above are utilized. Simple time-based control signals like CMPR, INIT and ZERO-SAD may come from a timer entity within the motion estimator 1502, or they may be outputs of the control unit 1513 directed to the mo- tion estimator 1502. Parameter values that can even be programmable, such as TH and -ZERO_BONUS, typically come from the control unit 1513 or from registers controlled by the control unit 1513.

The outputs SAD_RDY, MIN_RDY, SAD_EXC, TH_EXC and MIN_SAD are pref- erably directed from the motion estimator 1502 to the control unit 1513, which makes control decisions based on the values of said outputs. For example, an activated SAD_EXC output bit informs the control unit 1513 that the currently calculated SAD has already exceeded a previously calculated minimum SAD, so that the control unit 1513 may command the motion estimator 1502 to abort the calcu- lation before completion and to move on to calculating the next SAD. The control unit 1513 may also have been programmed to monitor the MIN_SAD values it receives from the motion estimator 1502, so that for example the inter/intra coding decisions are based on whether SAD minimum values are within certain limits or not. It is within the capability of a person skilled in the art to formulate the structure and operation of the control unit 1513 based on the explanations above.

It should be noted that the description above contains a number of illustrative examples which do not limit the applicability of the invention. For example, the invention is by no means limited to comparing just 16 pixels from two macroblocks of consecutive image frames. The number of pixels to be compared simultaneously may be something different, the frames to be compared may have some other relationship than being consecutive, and the "pixels" may be something else than elementary parts of a two-dimensional graphical image. As a straightforward generalization, the designation "macroblock" can be replaced with "regular group of multi-bit values", so that the invention is applicable to all kinds of cases where it should be found out, how similar one such regular group is to another. The concept of a "pixel" can be understood as describing a dimensionless entity, like a sample of a digital voice signal to be compared to another digital voice signal, or

describing an entity with three or more dimensions, like a "voxel" or "volume pixel" which is the atomic constituent of three-dimensional digital images. Pixel values in image frames are always positive, which can be taken as a postulate in an image processing application; if the suggested solution would be applied to some other purpose where the sign of the values to be compared could vary, some preprocessing would be needed in order to ensure correct operation.