Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR DATA COMPRESSION
Document Type and Number:
WIPO Patent Application WO/2021/173874
Kind Code:
A1
Abstract:
System and method for hardware compression and decompression comprises identifying a search buffer size according to a hardware requirement and a desired compression ratio, and breaking up a search buffer to independent searchable blocks such that they are routed on silicon and meet a cycle-time requirement. The system uses a search control and collation module comprising a pattern search module, a precode module in communication with the pattern search module, an encode module in communication with the precode module, and encode module in communication with the precode module, and an output buffer In communication with the precode module.

Inventors:
NORRIE PAUL MAXWELL (NZ)
MERLIN DENISE NINA (US)
Application Number:
PCT/US2021/019734
Publication Date:
September 02, 2021
Filing Date:
February 25, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NZIP TECH INC (US)
International Classes:
H04N11/04
Foreign References:
US20030138158A12003-07-24
US20100223237A12010-09-02
US20140132429A12014-05-15
US10224957B12019-03-05
Attorney, Agent or Firm:
STATTLER, John (US)
Download PDF:
Claims:
Claims:

1- A method for hardware implemented data compression, the method comprising: breaking a run-length search into independently searchable blocks where at least one of the blocks and their lengths is controllable, performing a run-length encoding scheme to extend the capability of a hardware implementation, performing a decoding of run-length and performing a recoding to simplify the decoding, process without compromising the compression rate.

2- A system for hardware implementation of data compression, the system comprising: a search control and collation module comprising a pattern search module, a precede module in communication with: the pattern search module, an encode module in communication with the precode module, and encode module in communication with the precode module, and an output buffer in communication with the precode module.

3- The system of claim 2, further comprising a recode module in communication with the precode module and the output buffer.

4- A data compression method comprising: identifying a search buffer size according to a hardware requirement and a desired compression ratio, and breaking up a search buffer to independent searchable blocks such that they are routed on silicon and meet a cycle-time requirement.

5- The method of claim 4, wherein a match buffer contains data that is compared with contents of search blocks to find the largest run-length and its corresponding location in the search block.

6- The method of claim 5, wherein match information is transmitted outside of a search block where it is aggregated with match information of other search blocks where the best overall match is selected.

7- The method of claim.6, wherein the aggregation of match information combines multiple smaller detected run-lengths into a larger single run-length.

8- The method of claim 4, wherein search blocks are aggregated into search groups where each search group consists of a power of 2 number of search blocks.

9- The method of claim 7, wherein a match buffer contains data that is compared with contents of search blocks to find a largest run-length and its corresponding location in the search block; the match information being transmitted to the search group where it is aggregated with the match information of the other search groups , and wherein the best group match is selected

10- ! he method of claim 9, wherein the aggregation of match information combines multiple smaller detected run-lengths into a iarger single run-length.

11- The method of claim 4, wherein data to be compressed is progressively moved through the search buffer in a manner of a shift; register such that it contains s window of the most recent bytes of the data.

12- The method of claim 11, wherein most current data unit to be compressed is contained in a match buffer used to compare with subsets of the search buffer in search blocks.

13- The method of claim 12, wherein the match buffer is equal to or less than the size of the largest run-length detectable.

14- The method of claim 4, wherein a search buffer within the search block is equal to or less than the size of the largest run-length detectabSe.

15- The method of claim 4, wherein a set of possible compression instructions is formed from: largest run-length found; number of bytes that have been repeated; compression instruction is a repeat of the prior compression instruction; a literal byte.

16- The method of claim 15, wherein the selection of the preferred compression instruction from the set of possible compression instructions is the instruction that describes the largest amount of data.

17- The method of claim 16, wherein the preferred compression instruction is Huffman encoded to reduce the number of bits of the instruction.

18- The method of claim 17, wherein the size of the Huffman table is reduced by combining entries of low frequency into a single encompassing entry with higher combined frequency.

19- The method of claim 17, wherein the huffman-encoded instructions are unencoded to reconstruct the compression instructions used for reconstructing the original uncompressed data.

20- The method of claim 18, wherein the reconstruct the compression instructions are recoded into a sequence of simpler instructions such that they are suitable for routing on silicon and meeting cycle-time requirement.

Description:
Title: System and Method for Data Compression

Reference to Related Document

Present application claims priority for US Provisional Patent Application No. 62981535 filed on 26-February-2020.

Field

Present embodiments relate to lossless hardware compression and decompression of generic data.

Background

There are many compression algorithms implemented in software, such as gzip, and PKZIP, Most of these are suitable for software implementation but difficult for hardware implementation since they take advantage of modern computer programming paradigms but not hardware design constraints.

There is an increasing need to compress and decompress data in digital hardware systems so as to benefit from the dramatic speed increases offered by hardware, both in an FPGA and ASIC implementation.

Present embodiments describe efficient hardware design of lossless compression and decompression.

Summary

The present embodiments in one aspect a data compression method comprising: identifying a search buffer size according to a hardware requirement and a desired compression ratio, and breaking up a search buffer to independent searchable blocks such that they are routed on silicon and meet a cycle-time requirement.

The present embodiments in one aspect of a system for hardware implementation of data compression, the system comprising: a search controi and collation module comprising a pattern search module, a precode module in communication with the pattern search module, an encode module in communication with the precode module, and encode module in communication with the recode module, and an output buffer in communication with the recode module.

Background Paramount to the RFCs and many compression algorithms are two key concepts:

Run-length encoding Huffman encoding

In run-length encoding, a ia RFC 1951, a byte stream: is replaced by a run-length stream. An unencoded byte may be replaced by an encoded literal-byte or a sequence of unencoded bytes may be replaced by a run(length, distance). For example, the unencoded byte-stream 1234123 can be replaced by literal-bytes 1, 2, 3, 4 and run{3,4), where run(3,4) means "copy the 3 bytes from the unencoded stream starting 4 bytes back"

Run- lengths generally turn out to be quite common and can replace a significant sequence of unencoded bytes with a single run(iength, distance) symbol, This alone can offer significant data compression, in compression systems, there is a finite set of allowed runflength, distance) symbols.

Huffman compression recodes the run-length encoding so that more frequent symbols have a shorter Huffman code. For example, ASCI! "e" is an 8-bit binary symbol but may be recoded as a 4-bit symbol because of its prevalence, "z" may be recoded as a 12-bit symbol. This exceeds its original 8 bits but the rarity of “z" makes its longer recoded length less relevant and the prevalence of "e" makes its recoded shorter length more relevant.

RFC's describe sophisticated methods for generating run-length codes, and for forming Huffman codes that can be either static or dynamic, in that the Huffman code can be changed on the fly in response to the changing character of the unencoded data. These schemes offer great variability for software implementations where the degree of compression can be improved by allowing more computing power to be expended. While great for software implementation, this is impractical for hardware designs.

Present embodiments present novel Implementations of some of the principles of the RFCs with various novel methods to create a new system for compressing and decompressing data that is amenable to an efficient hardware implementation.

Detailed Description of the Embodiments

The following embodiments relate to hardware implementation of data compression and decompression. Many of the embodiments according to the present disclosure are presented by way of example, it should be dear to a person skilled in the art that one can implement variations of these embodiments without departing from the spirit and the scope of the present disclosure, (Chris: is it meant to be "ordinarily skilled in the art"?)

The RFCs allow run-length distances to be as high as 32 kbytes, or a quarter of a Mbit. However, it is not practical for a high-speed digital circuit to rapidly search a quarter of a Mbit of data for a run-length match. Accordingly, some of the present embodiments effectively compress data even with as little as 256 bytes of data in the run-iength search. Figure 1 shows some of the steps taken in the present embodiments and are based on the following:

1. Searching data for any possible run-length match at any possible distance within the search data resuits in so many wires the design cannot be reduced to working silicon. Some of the present embodiments in contrast, after identifying dock cycle at step 101 and and receiving data at step 111, at step 121 break the searches into independently searchable blocks where the number of wires and their length are controllable to ensure they can be routed on silicon and meet the cycle time requirements. How these blocks find run-lengths is one aspect of the present embodiments.

2, According to some embodiments, a novel run-length encoding scheme makes it possible to easily extend the capability of a hardware implementation without any fundamental redesign of any system components.

3. According to some embodiments, at step 122, a Huffman scheme different from the RFCs allows new features to be added to the run-length encoding so as to improve upon the resulting compression. Such features, according to present embodiments, are novel and therefore not compatible with the RFCs and represent an improvement in compression.

4, According to some embodiments shown in step 123, the decoding of run-iengths includes a recoding to simplify the decoding process at step 124 'without compromising the compression rate.

5. According to some embodiments, a novel manner is envisaged according to which ail the above components are pieced together to form a coercive system.

Search 1 According to some embodiments, uncompressed data is shifted into a match buffer, conceptually one byte at a time. The width of the match buffer (in bytes) determines the largest run-length allowed in a particular implementation. The remainder of the process does not start until the match buffer is full.

According to some embodiments shown in figure 2, when new data is received, once the match buffer is full at step 221, two operations occur:

1, A search is performed as described in the following.

2. The data shifted out of the match buffer, as new data is shifted in, data is shifted into the search buffer so as to avail itself as historic data for future searches.

The search buffer is logically a single large shift register whose length is a tradeoff between compression results and practicality for hardware implementation, For example, in some embodiments 512 bytes is a reasonable length and in some embodiments 2048 bytes will lead to a larger design with better compression.

In some embodiments, the search buffer is partitioned into a power-of-2 number of groups, for example 8, called a search group, The search group is also partitioned into a power-of-2 number of blocks, for example 8, called search blocks. Accordingly, in this embodiment, there wouid be 64 search blocks. The number of bytes in the search blocks represent the largest match length obtainable for a block. For example, if a particular design is chosen to support a maximum run-length of 8 then the match buffer and search blocks would all be 8 bytes. In this embodiment, 64 search blocks of 8 bytes represent 512 bytes of search buffer.

First the system performs the search at step 222 and all blocks do this {search the data in block with respect to the data in the match buffer to find the longest run-length ) and pass their results to the next step 223, which chosesthis best search result. In the embodiment with search groups, each search group chooses the best run-lengh of all the search blocks in its group then the best run-length between ail the groups is determined. This is a divide-and-conquer approach - if there are 64 search blocks, it's not practical for hardware to choose the best run-length between ail 64 search blocks. Instead, each search group only has to choose between 8 search blocks then the system chooses the best of the 8 search groups.

In some embodiments, it is preferable for the last search group to not implement its last search block since this may lead to a more efficient huff man encoding scheme, in the given example, this would mean there are only 63 search blocks, or 504 bytes of search history. In some embodiments, it is beneficial for the width of the match block to be less than the width of the individual search blocks. This allows additional run-iength codes to the aforementioned litera!-byte and runflength, distance) symbols. The process is continued until all the data to be compressed has been passed through the match buffer and at. step 224 the data is shifted out.

The match buffer has been described as a single entity. However, as seen in figure 3, some embodiments allow it to be replicated. At step 311, each search block or search group may contain its awn local copy of the match buffer and all the match buffers are identical as seen in step 313. Thus, when a byte of data to be compressed is received, it is sent to all the match buffers. The reason for this is that a single match buffer cannot be connected to the logic of all the search blocks without creating problems for a hardware implementation. If each search group maintains a single or multiple local copy of the match buffer, the length of wires and the number of loads on each wire is kept to within reasonable hardware design tolerances. The incoming uncompressed data stream is fanned out to all the match buffers. The fan-out tree does not need to be limited to a single dock cycle - adding cycle points for physical design reasons is only a latency impact, not a data bandwidth impact nor a compressibility impact. Most significant bytes are shifted out at step 315 to the next search block.

In these embodiments, a search is performed accordingly. The search is shown in step 314. The proximal match buffer is compared with all the search blocks within the group. If the largest allowed run-iength is 8 then the searches would simultaneously seek matches of 0, 1, 2, 3, 4, 5, 6, 7, and 8 consecutive bytes at every possible starting position within each search block. For a given run-length, r, being sought, only the most significant r bytes of the match buffer are involved in the searches. This is because the most significant bytes of the match buffer are the next to be moved out and, hence, are the next to be run-length encoded.

In some embodiments, each search selects the largest run-length detected and transmits that result, along with the byte index in the search block that identifies either the start or end of the run, to the search group. If there are multiple largest matches (for example, search block bytes 0-2 and 4-6 both match and have a run-iength of 3} then it's not functionally important which result is transmitted. However, one result may be favored over another result due to a simplified hardware design, which includes the decoding of these symbols when the data is decompressed. There is also the question of whether the starting index or the ending index of the search buffer, called the offset of the run-length, is transmitted. In some embodiments, there are some optional compression improvements discussed later that are simplified by transmitting the starting index. For example, if the 0-2 run-length is chosen over the 4-6 run-length, that starting index is 0,

In some embodiments, a search group receives the search results from each of its associated search blocks. From those results, it selects the largest run-length and transmits it along with the offset and a block id to the· top ofthe search tree. The transmitted offset is a concatenation of two identifiers, a modified block id and the offset within the winning search-block. The search blocks are numbered 0 through b, where a lower number is given to a search block closer to the start ofthe search chain. This number, from the wi nning search block, forms the least significant bits of the modified block id. Similarly, the search groups are numbered, 0 though g, where a lower number is given to the search group closer to the start of the search chain. This forms the most significant bits of the modified block id. The modified block id and offset identify the "distance" component of the run(iength, distance) symbol.

In some embodiments, the top of the search tree, called the search top, receives the search results from each of its associated search groups. From those results, it selects the largest run-length and associated offset and block id. These represent the winning run(length, distance), although it is yet to be encoded into a symbol and it may not represent the wining symbol that then goes onto the huffman encoding. Nevertheless, this winning information is passed along to the run-length encoder then onto the Huffman encoder. in some embodiments, an optional feature of the search group is to analyze ways for combining mniiength, distance} symbols from two adjacent search blocks. For example, the last two bytes from one search block combined with the first two bytes from the next search block may provide a run-length of 4 rather than 2 adjacent run-lengths of 2. This optional feature is another reason that it is better to indicate the starting index of a ma tch because the ending Index may be in a different search block. Thus the search group can combine two runflength, distance) symbols by just adjusting the length and not modifying the starting index or associated block-id. This method can increase the compression ratio at the expense of logic that is only trivially more complex:. But part of this complexity is that the search blocks also need to provide partial runflength, distance} results to the search group, in the example given, the second search block was also required to look for matches at the start of its buffer against bytes in the match buffer that were not the most significant - in this case, it would have detected a 2-length match of bytes 0 and 1 of its search buffer with bytes 2 and 3 of the match buffer. This can then be combined with a 2-length match that the previous search block was already doing when it compared the upper two bytes of its search buffer with bytes 0 and 1 of the match buffer. By combining these two 2-iength matches, it's as if the first matching search buffer had a copy of the first two bytes of the next search block and could directly detect the 4-length run. in contrast, the case of each search block haying a copy or access to all or some initial subset of the next search block's buffer and including this in its comparisons, this may lead to a more difficult design, Whereas the approach in the present embodiments keeps ail the matching logic in each search block without wires crossing from one search block to another or additional flip-flops to extend the width of the local search buffer. Instead the combining operation can be performed at a different level of aggregation and idealiy in a different dock cycle, which would make It easier to meet cycle-time constraints.

There are a few additional points regarding the search according to some embodiments shown in figure

4:

1. At step 411 the search starts and the buffer is filled in step 412. When the match buffer is full there is a point in time when there are no valid bytes of data in the search buffer and the system checks for valid bytes at step 413. At step 414 the system checks whether there are valid bytes. As the clock cycles go by, the search buffer has more and more valid bytes until it is full at step 415. At that point, data shifted out at step 416 of the search buffer is simply discarded at step 416,

2. It's important to note that a non-valid byte in the search buffer causes a no-match for any and all comparisons for which it is a participant.

3. The search does not have to operate in one single cycle as seen in figure 5, It can be a pipelined operation. For example, there can be a cycle point in the search blocks, search groups, and search top:

● Cycle 1 step 511: Byte 1 moves into the distributed match buffers, creating match-buffer state 1. ● Cycle 2 step 512: The search blocks analyze match-buffer state 1. Byte 2 moves into the match buffers, creating match-buffer state 2. ● Cycle 3 step 513: The search groups analyze search block state 1, the search blocks analyze match buffer state 2, the next byte creates match buffer state 3, ● Cycle 4 step 514; The search top analyzes search group state 1, the search groups analyze search block state 2, the search blocks analyze match buffer state 3, the next byte creates match buffer state 4.

● Cycle 5 step 515: The search machine's pipeline is now full and the pipeline continues as understood in the art.

4. The pipeline does not degrade processing bandwidth, it only adds the latency of a small number of clock cycles. But it allows a faster clock cycle and a far easier place-and-route of the FPGA or ASIC.

Search 2

The method described in embodiments of "search 1" above is generally suitable for cases of 8 or less search blocks and run lengths of 4 or less. For these cases, search-1 is a smaller design than search-2. Search-1 can have a reduced compression ratio compared to search-2, but when a small design is important, search-1 can generate impressive results when used within the present embodiments.

However, search-1 may not scale well as the search blocks carry more than 8 byte or the run-lengths exceed 4 bytes. This is due to the geometrically increasing number of comparators required. At some point this leads to a lower maximum clock frequency and then to a design that can't be successfully placed and routed on silicon,

Search-2 is for designs seeking higher compression ratios via larger search-blocks and/or longer run-lengths.

As an example for showing how some embodiments of search-2 work, assume the maximum run-length Is 3 and the block size is 5 bytes. Every search block will have a buffer of run-length plus block-bytes = 3 + 5 = 8 bytes, numbered 0 through 7, where 0 holds the most recent byte entering the compression circuit. Using non-powers-of-2 simplifies the example without departing from the spirit and the scope of the present disclosure.

The buffer is right shifted on every cycle a new byte of uncompressed data arrives. Two bytes from two different sources are shifted into different parts of a search block's buffer and 2 bytes are shifted out from two different parts of the buffer, in particular a newly arriving uncompressed byte is shifted into bufferfO] of all the search blocks and the data right-shifted out of buffer[2] (the run-length - 1 position) is discarded. Daisy chain data from the prior search-block buffer byte, buffer)?] is shifted out and it becomes shifted in data for buffer[3] (the run length position) of the next sequential search block, thus creating a daisy chain linking all the search block buffer[3..7]'s together. For the very first search clock, its shift-in data for buffer[3] is the shift-out data of buffer(2j. How this shift scheme operates will become more evident as we continue.

In the following diagram c2 (cycle 2) is the first cycle the full run-length has entered the lower 3 bytes of the buffer. Let's caS! that subset of the buffer the match buffer. In cO, C arrived, in cl D arrived, and in c2 E arrived. So in c2 we have the most recent arriving bytes CDE and A and B have been finished with, in the upper 5 bytes of the buffer are data that has been daisy-chained in. Let's call that subset of the buffer the search buffer. The lower case letters of the search buffer are from different parts of the incoming data stream than the upper case letters of the match buffer. The upper case Setters of the match buffer are the most recently arriving data and the lower case letters are older data, even older data for the next search block in the daisy chain. Here is the starting configuration for this diagram:

012 34567 c0 CBA edcba cl DCB fedcb c2 EDC gfedc <= cycle 2, most recently arriving data is CDE c3 FED hgfied c4 GFE ihgfe C5 H6F jihgf c6 IHG kjihg c7 DIM lkjih

Cycle 2 has a full run-!ength of arriving data so our example will start from there. The objective of cycle 2 is to find all the possible run-length of 2 or more that include E, the most recent byte, anywhere in the search buffer, which holds whatever bytes cdefg represent.

Here is the same diagram showing what match buffer bytes are compared with what search buffer bytes in order to find these run-lengths:

812 34567

0 CBA edcba

1 DCB fedcb

2 EDC gfedc Eg Df Ce : Eg Df : Ef Be Cd : Ef De : Ee Dd Cc : Ee Dd : Ed Dc

3 FED hgfed

4 GFE ihg"fe

5 HGF jihgf

"Eg of Ce" means E is compared to g, D is compared to f, and C is compared to e. If they all match then the run-length is 3. If only "Eg of" match then the run-length is 2. The diagram does not show a "Eg" only case since a run-fength of 1 is not usable.

In c3, a new set of comparisons have been added:

Rewriting the comparisons in a shorter manner and adding the comparison for other cycles.·

So in cycle 2 "E[gfeti]" means E is compared with f, g, e, and d, etc. This Is the same Information as in the previous diagram for cycle 2 except it only shows what comparisons are done and not how the run-lengths are formed.

An important point according to these embodiments is that the matches for c2 were formed during cO, cl, and c2. During cO the matches for C are performed. During cl the matches for D are performed and the C results are copied down. During c2 the matches for E are performed and the C and D results are copied down:

Thus, c2 has all the match data required for run-length analysis by only having the first set of comparators.

Replacing these with the byte numbers we see the only comparators are 0[3456j. As it gets copied down to the next cycle it naturally becomes the l[4567j match then the 2[567] match:

According to these embodiments, only element 0 of the match buffer is required since elements 1-2 have their implied matches staged from the previous cycle. Therefore, the physical match buffer is actually only 1 element. Removing the unneeded elements saves a iot of flip flops, especially since the match buffer started out being as many bytes as the longest run-length and this is required for every search block. For 64 search blocks and a run-length of 16 bytes, this reduces the flip flop count from 8192 to 512,

However the comparison results need to be staged to the next cycle which determines the corresponding run-length, c2 in this example. This number of cycles is the run-length, 3 in this example, and the number of comparators is the blocksize minus 2, 5-1 ~ 4 In this case. The number of flip flops required for cO, c1, and c2 are 4 + 8 + 12 respectively, or 23 in total. Essentially it is the size of triangle which can be calculated using the arithmetic mean, r * (r+l)/2 where r is the run-length. But every one of these elements of the triangle is b comparisons, where b is the block-size. For a run-length of 16 and a block-size of 32, this is 4216 flop flops per search block. That's the reason search -2 is larger than search-1. However, search-2 only has b-1 comparisons whereas search-1 would have in the order of 500 comparisons whose results would need to be registered but each comparison is S bits and that leads to too many wires for a single clock cycle - in the order of 4,000).

Once in the c2 cycle of this example, all the comparison results can be analyzed. Actually the comparison results of c2 can be registered and c3 can do the run-length analysis, which leads to a design with a higher clock frequency. For every bit position in the search buffer, an associated vector of match results has been accumulated starting from c0 and going to c2. The largest run-!ength detected for each byte of the search buffer is equal to the longest unbroken sequence of matches in the vector starting at the vector position that was originally formed in cG and staged down to the analysis cycle, c3.

Once these have been determined, the longest run-length for the search block is selected and that length, along with the search clock byte number, is sent to the search group as part of the run(iength, distance) determination.

The search group for search-2 is the same as that for search- 1 except the match buffer has been eliminated from the search-Ts search group. The search top for search-2 is the same as that for search-1.

Run-length encoding

According to some embodiments as seen in figure 6, the run-length encoder 614 receives the winning runflength, distance) 612 from the search top 611, where the search top contains the search groups and search blocks, which is what the uncompressed data runs through, and it also receives a copy of the uncompressed data stream 613 as it arrives byte-by-byte and maintains its version of the match buffer, which may be wider or narrower than those in the search machine. Their width is a function of the largest run-length sought. The run-length encoder does not do run-length matching and therefore can have few bytes. It conceptually needs at least as many bytes to cover the search machine latency so that the most significant match byte corresponds to resuit currently arriving from the search block.

The run-length encoder will generate one of the following rcode symbols and pass it onto the Huffman encoder 615;

2, A literal byte, called a text symbol

2. A run(length, distance), called an rlen symbol

3. A repeat-byte sequence, called an rbyt symbol

4. A repeated command, called an rcmd symbol

5. A special command, called a spd symbol

6. A null command, which indicates no symbol is valid for the current cycle.

The run-length encoder will disqualify a runflength, distance) whose length is 0 or 1, In this case, the most significant byte of the match buffer has to be encoded as one of the other options above, A run-length of 0 is obviously not a run-length match at all. A run-length of 1 is effectively encoded as an rcode for a text symbol and therefore not required as a run-length rcode. The run-length encoder analyzes for repeat-bytes in the match buffer. If the most significant byte of the match buffer Is consecutively repeated, the length of that sequence can be encoded as a rbyte symbol. For this reason, the match buffer in the run-length encoder can be wider than what is minimally required so as to accommodate a chosen maximum rbyt length. This feature is optional and may increase compressibility with minimal hardware impact. if the length of the r!en exceeds that of the rbyt, it is chosen in preference to the rbyt. If neither the rien or rbyt qualify then the text symbol is chosen instead. The winning rcode symbol encodes a certain number of bytes of the uncompressed data stream. This number is called the skip count and is determined as follows:

1. text symbol: 1

2. rien symbol: the length from runflength, distance)

3. rbyt symbol: the number of repeated bytes

The skip count determines how many subsequent bytes of uncompressed data egress the match data are discarded, along with any associated run(tength, distance) and rbyt symbols since they all represent bytes of the uncompressed data that have already been processed into a pending run-length symbol. A null symbol takes the place of the discarded symbols and it indicates no new valid symbol for the current cycle.

The pending rcode symbol is then compared against the prior emitted rcode symbol. If it is a duplicate symbol it is withheld and may be replaced by a repeat-command, rcmd, rcode symbol instead. The rcmd symbol encodes how many times the same pending code would have been emitted. When the first non-duplicate code occurs, the rcode symbol emitted is an rcmd symbol along with the count. This feature is optional and may increase .compressibility with minima! hardware impact.

The rcode is yet to be encoded into its final form. The null symbol has no encoding. The remaining codes (text, rlen, rbyt, rcmd, spc!] are encoded with a unique static width identifier, called an rcode, and possible additional extra bits. The rcode is what will be Huffman encoded and therefore its width is mostly irrelevant. The eof symbol is emitted after the final byte of the uncompressed data stream has been processed. The special rcode symbol can be used to indicate special actions, such as:

● Indicating an eof after the final byte of the uncompressed data stream has been processed. ● Indicating a change to the Huffman encoding scheme based on statistical analysis of the emitted run-length codes. ● Indicating whether optional CRC of the uncompressed data is to be appended to the compressed stream and used as a decompression integrity check.

● etc.

The rcode is intended to simplify the process of the Huffman encoding. An example 10-bit encode scheme for rcodes is as follows.

● Codes 0-255: text symbol. The 8-bit byte is the encoded run-length code.

● Codes 256-263: special symbols, 256 means eof, others have optional meanings. ● Codes 264-273: rbyt symbols ● Codes 274-283: rcmd symbols ● Codes 284-613 r!en symbols ● Codes 614-1023: unused

The rbyt and rcmd symbols have an associated count. The count Is broken into 10 ranges and the range id, 0 through 9, is added to the starting static identifier for the symbol, 264 and 274 for the rbyt and rcmd respectively. The bits of the count from rbyt and rcmd with the most significant bit removed set bit comprise the extra bits that will accompany the rcode symbol when transmitted to the huffman encoder. The range table is:

For example, in this embodiment, an "rcmd 34" has a range id of 5 so its run-length symbol is 279. Bits 3:0 of 34 is 2. So the fuii run-length symbol is "279 and 4 extra bits with the value 2", A run-length code of 279 means rcmd <n> where n - 32 + <the next 4 bits>, which is rcmd 32+2, or rcmd 34.

It may not necessarily be a beneficial tradeoff to implement all ranges shown above since each range id will require a new huffman code. A given implementation may choose to support only range id's 0-3 and limit rcmds to 15 or less. rien symbols have an associated length, block id and offset. A given iength is assigned a set of 10 ranges for encoding the block id. Therefore, a rien of Iength 3 will be encoded as the starting Hen code, 294, plus 10 times the Iength of the run, 3, plus the block id encoding. Thus 284+30 through 284+39. If the maximum run-length encoded is 32, the number of codes assigned to rien symbols is 33*10 = 330, although the first 20 codes will never be used since rien lengths of 0 and 1 are never encoded.

The block id is encoded similarly but not exactly the same as the count for rbyt and rcmd symbols. The table is:

For example, in this embodiment, r!en 5 block-id 8 offset 2 will be encoded as 284 + (5*10) + 3 = 337 since 284 is the base rien code, 5*10 is the start of the code for length 5, and 3 is the range-id for a block-id of 8. The extra bits for the block-id is bits 3.Ί of (8+1) = 001, The extra bits for the offset is the 3-bit offset va!ue 2, or 010 in binary. So rien 5 block-id 8 offset 2 is encoded as rcode 337 and extra bits 001 and 010 for the ' block-id and offset respectively.

For example, an rcode is 328 with extra bits 11 for the block and 100 for the offset. 326-284 - 42. The run length is 4 and the range-id is 2, corresponding to block-ids 3+<extra-block-id>, or 3+(binary 11), or 3+3 = 6. it is worth noting that a multiplication n * 10 is simply an addition: (n«3) + (n«2) since this is 8n + 2n and the .shift requires no iogic gates.

The code table is an example of how 1024 entries can be partitioned. There are many ways this can be done that give variations in the maximum rien lengths that can be encoded, or the maximum number of search blocks that can be encoded, etc.

The run-length encoder can be pipelined and the search block pipeline can be thought of as running into the run-length pipeline to form a longer pipeline. Similarly for the Huffman encoder. Thus the entire compression is a pipeline of pipelined sub-units, thereby having a high throughout rate and a fast cycle time.

Huffman Encoder

In some embodiments, the result of the run-length encoder is a 10-bit rcode derived from its .symbol table plus one or two optional extra-bit fields. The Huffman encoder will recode the rcode into a shorter width code and append the extra bits. The number of extra bits is directly related to, and can be deduced from, the rcode (and therefore also from the rcode's associated Huffman code).

In some embodiments, the run-length encoder may use any element from its symbol table and every one of these elements must be processed by the Huffman encoder. However, the rcode set-size can be varied depending on the objectives of a particular implementation. Ail implementations will support the 256 literal symbols and these plus an eof symbol is all that is required. (In this case, the compression consists of just converting the literal symbols of all possible 8-bit text bytes into a set of huffman encoded symbols that generally have a smaller average number of bits per byte.) But most implementations will include additional rcode symbols, such as various run-!engths, repeat bytes etc. How many of these additional symbols will be included depends on the objectives of the implementation. (For example, higher compression rates may be achieved by including symbols that detect longer run-length or run-lengths from a further distance in the search history.) The huffman encoder must be able to process all the possible rcode symbols that will be generated by the run-length encoder but no more - it does not need to support the entire 1024 rcode-symbol-space of the run-length encoder.

In these embodiments, for any appropriate subset of rcode symbols that are included in a particular implementation, the symbols in that subset have the same value as that same symbol in any other subset. In other words, if a new implementation is created by reducing the capability of an existing implementation (for example, make it smaller with possible loss in compression efficient), the value of the rcode symbols that have to be processed by the Huffman encoder are not changed - there's just some that are no longer used. From the perspective of the Huffman encoder, nothing actually has to change -the Huffman encoder design can remain as is.

Therefore, in some embodiments, it is optional to redesign to Huffman encoder. However, certain advantages may be obtained by the effort of a redesign. These include:

1. A smaller Huffman encoder

2. A Huffman encoder that leads to better compression ratios.

The Huffman encoder converts the rcode with the appended bits from the run-length encoder into an hcode with the appended bits, where the hcode is a Huffman encoding of the rcode.

The formation of Huffman codes from an alphabet of symbols and their associated frequencies is weil documented in the art, as are variations on the theme. In some embodiments, the following variations lend themselves to best mode but are not required:

1. hcodes have a range for the number of bits in each code, with shorter lengths being assigned to more frequent codes. But too many widths make high-speed hardware encoding and decoding difficult. It is advantageous to bundle groups of low frequency codes and assign them a common frequency so that their assigned hcode lengths are the same. Since these codes are low frequency, the effect on compression ratios can be negligent. 2. The values of the assigned Huffman codes can be re-ordered or re-assigned among codes that are decoded together (maybe because they have a common length}. For example, all 5-bit hcodes may have two particular bits always set to 1 and only the other 3 bits will vary from one another. Furthermore, these same two bits would not be set for hcodes of lengths other than 5 bits.

As the h codes, along with any appended extra bits, are generated they are concatenated into a contiguous bit stream that is then segmented into unit size chunks according to the width of the output interface of the compression circuit. For example, the output may be a byte-oriented interface, and it may include protocol signals that allow the external logic to control the data-rate from the compression circuit, etc. The need for buffering and control is well understood in the art.

According to these embodiments, even though the uncompressed data may be an integral number of bytes, the compressed data stream can take on any bit length and this is problematic in a system that deals with data in units of bytes, or words, etc. Therefore, the end of the compressed stream may need to be padded with appended zeros to make it an integral length the desired data-unit size. The purpose of an eof rcode is to indicate that the padded data that follows should be discarded and not decoded as if they are a Huffman code.

The compressed stream may also include an appended CRC of the uncompressed data and used as an integrity check for the decompressed data.

Some embodiments address what hcode values are assigned to each of the rcode symbols. Methods for doing this are well understood in the art and their genesis is the pinnacle paper by David Huffman. It is a method whereby codes of varying bit-lengths are assigned to symbols so that the recoded data stream is smaller than the original data stream by the greatest possible amount by assigning shorter codes to more frequently occurring symbols. The frequency of symbols is clearly dependent on the particular data stream. However, as understood in the art, good compression can generally be achieved across a broad range of data by using the frequency of symbols in a broad range of data. It may not be the optimal Huffman encoding for a particular segment of data but on average will work weii over many types of data sets. This leads to the solution of a static Huffman encoding scheme for all data to be compressed. It may not be as good as a Huffman scheme that is optimized for each segment of data to be compressed but it only requires a single Huffman code and that leads to smaller and lower latency hardware implementations.

As to what constitutes a broad range of data to analyze symbol frequency for creating a static Huffman table, work was done on this at the university of Calgary, Canada and the university of Caterbury, New Zealand, This work resulted in what is known in the industry as the Calgary Corpus and Canterbury Corpus and are both a standard of reference of comparing the effectiveness of different compression schemes.

Some of the present embodiments generate the frequencies of each rcode in a given rcode set from these corpora and a Huffman encode is generated for each of the rcodes. it should be noted that even if an rcode has a frequency of zero, it must still be assigned an hcode mapping - just because it may not have occurred in the reference corpora doesn't mean it won't occur in some other data set that is to be compressed.

Run-Length Decode Recoder

According to some embodiments, decompression follows a similar flow to compression but in reverse. The input data is a sequence of Huffman codes whose values and lengths are decoded and associated extra bits extracted. They are converted into their equivalent rcode and passed onto a run-length decoder.

In some embodiments, the run-length decoder executes the instruction of the rcode. For example, an rcode value less than 256 is simply the literal value of the next byte to be output. An rbyt code indicates the value of a byte and how many times it is to be replicated in the output stream. An rcrnd code tells the run-length decoder how many times to repeat the last command it executed. An eof code may tell the decoder to treat the next incoming 4 bytes as a reference CRC value that should match the CRC regenerated from the output data and then terminate the decompression and discard any padding data that follows.

An den code is a little more difficult in that it refers to replicating data that was output some time back, in some embodiments, this requires a history ram to cache the output data. But the history ram does not need to maintain any data further back than the number of bytes in the compression units search buffer since that defines the largest distance in the rlen lookback. This is made more complex by the fact that some rcodes can specify long run lengths that may be cumbersome and inefficient to implement or may have a significant adverse effect on the achievable dock frequency and slow the design. Recoding is a solution to this problem. For example, a runflength, distance) symbols of run(32,100) can be recoded into 4 smaller rcodes which are then executed in sequence.

Consider the current state of the uncompressed data stream:

012345678

The next rcode is run(4,3), meaning copy the 4 bytes starting from the 3 rd byte ago. The 0 th byte ago is 0 so this instruction will result in the uncompressed data stream becoming:

3456012345678

However, run(4,3) can be recoded as run(2,5) run(2,5), which results in:

012345678

56012345678

=> 3456012345678

In general, run(ien,dist) can be decomposed into a sequence of n copies of: run ( len/n, dist+len-(len/n) }

Thus, run(32,100) can be recoded as 6 lots of run(32/8),10G+32-(32/8)), or 8 * run(8,128).

For example, run(8,2) transforms the data as follows:

0123456789

234567890123456789

But run(8,2) can be recoded as a sequence of 4 run(2,8), which gives the following identical transform;

0123456789

= > 890123456789

67890123456789

=·> 4567890123456789

~ > 234567890123456789 A decode of run(32,lG0) is difficult to perform at high speed and requires very wide data-path transforms of 32 bytes, or 256 bits. But recoding this as a sequence of sequence of 8 run(4,128) requires only 4-byte data-path transforms.

The recoding can have zero effect on performance. If the outgoing data-path is 4 bytes wide, nothing is gained by decoding run(32,l-00} versus of a sequence of (run(4,128}_ Yet the difference in data-path complexity and maximum cycle-time is dramatic.

There are other ways of decomposing a run-length rcode into a sequence of rcodes using the same principles outlined here without departing from the scope of present embodiments. For demonstration purposes, powers of 2 were used it the examples. But similar means of decomposition can be achieved with non-powers of 2 as well, which is important for decomposing run (15, dist) into smaller codes. In this case, create a sequence of 3 run(4,dist') followed by a final run(3,dist"). For example, run(7,2) performs the following transform:

0123456789

= > 23456780123456789

But run(7,2) can be recoded as a sequence of 3 run(2,7) and a run(l,8), which gives the following identical transform:

0123456789

780123456789

56780123456789

= > 3456780123456789

-> 23456780123456789

The mathematics of the recoding can be mostly statically predetermined and hard-coded into a table. The table identifies for every run length how long the new sequence is, and what the new length of each sequence element is, and what the addition term for the distance of each element of the sequence is. This table can be constructed so that all elements of the sequence are identical except possibly the last, as shown in the previous example. For the example of run(7,2) the 7 th table entry, because It's a run(7,x), would indicate:

3 * run (2,d) where d = dist+5 (where dist is 2 in the example) 1 * run (1,e) where e = dist+6 Thus, a recoding table for any range of run lengths and desired degree of decomposition can be statically generated and hard coded into a hardware design.

Alternatively, in some embodiments, the recoding parameters can be determined on the fly as a mathematical transform.

Some embodiments have been structured to allow each functional operation to be pipelined, thereby allowing high throughput bandwidth and a high clock cycle frequency. For example, the Huffman decode, the rcode recode, and the rcode decode can alt be part of a single pipeline and each of these subunits can themselves be pipelined.

The flowcharts shown in figure 7 and figure 8 illustrate the compression and decompression according to some of the present embodiments.

Figure 7 is an illustration of the compression scheme according to some embodiments. The Input Buffer 711 indicates when it is ready to receive new data using the "ready" signal. This allows a new byte of uncompressed data to be written into Input Buffer 711 and sent to the CRC Generation 721 where It is used to generate a CRC checksum over the uncompressed data, as well understood in the art. Input Buffer 711, decouples any paths between the logic outside Figure 7 from paths inside Figure 7 and this makes it easier to meet cycle-time constraints. input buffer 711 sends the received data to Search Control and Collation 712 which coordinates the run-length search operation between all the Pattern Search blocks, 712, receives all the results from the Pattern Search blocks, 722, and determines the best run-length that has been found. Pattern Search 722 contains the Search Blocks that have been described earlier and the Search Groups that may encapsulate groups of Search Blocks, as has been described earlier. The best run-length found by the search is sent to the block of Precode 713,

Precode 713 determines what symbol will be huffman encoded by Encode 714. The symbol to be encoded may be determined from the best run-length found by search control and collection 712, a repeat byte symbol (along with the length of the repeated byte sequence), a literal byte, or a special code, or a null symbol, as described earlier. The symbol to be encoded, as determined by precede 713, is sent to Encode 714,

Encode block 714 translates the symbol from Precode 713 into a huffman encoded symbol, as described earlier. The huffman encoded symbol is sent to block Recode 715. Recode 715 may provide further symbol reduction by compressing a duplicated sequence of symbols from 714 into fewer symbols. For example, if encode 714 generates the same Huffman encoded symbol 5 consecutive times, this may be reduced to the first symbol from encode 714 and the subsequent 4 symbols from ncode 714 are recoded as one symbol that indicates "replicate the prior symbol 4 times." The resulting symbol of recode 715 is sent to Output Buffer 716,

Output Buffer 716 indicates when it is ready to send out compressed data using the "ready" signal. This allows logic external to the compression circuit to consume the compressed data at its own rate and also decouples the timing in paths internal to the compression circuit with those outside the compression circuit. Output Buffer 716 may also perform any required padding at the end of the compressed data stream to make it an integral number of bytes and append the CRC checksum from CRC generation 721.

Figure 8 illustrates the decompression scheme according to some embodiments. The input Buffer 811 indicates when it is ready to receive new data using the "ready" signal. This allows a new byte of compressed data to be written into Input Buffer 811. input Buffer 811, decouples any paths between the logic outside Figure 8 from paths inside Figure 8 and this makes it easier to meet cycle-time constraints, input Buffer 811 sends the compressed data to Precode 812.

Precode 812 separates the compressed bit stream into a sequence of Huffman symbols. The Huffman symbols are sent to Decode 813.

Decode 813 decodes the Huffman symbols into their non-huffman encoded counterparts. This reverses the operations performed by Encode 714 of Figure 7. The non-huffman symbols, compression symbols, are sent to Recode 814,

Recode 814 optionally performs simplification of the compression symbols. For example, a run-length-8 symbol may be broken into 2 equivalent-function run-length-4 symbols so as to simplify the inflate function of inflate 815. The resulting compression symbols are sent to Inflate 815.

Inflate 815 performs the operation of the compression symbol to regenerate the original uncompressed data stream. For example, a repeat-byte-16 command will be inflated to 16 duplicated bytes specified by the repeat-byte command. The inflating of a run-length command requires a copy of the decompressed data and this may be maintained in a ram. Run-length commands are limited in how far they can look back into the decompressed because that limit is enforced by the size of the search buffer contained in the blocks of Pattern Search 722 of Figure 7. This limits the size of the ram inside inflate 815 and it's size is not related to the total amount of decompressed data. A run-length code will indicate, for example, copy the 8 bytes starting from 55 bytes back into the next 8 bytes of the decompressed data stream. Depending on the structure of the ram, this will be converted into the ram location or locations that will be read to extract the data and these types of processes are well understood in the art. The resulting decompressed data is sent to Output Buffer 816.

Integrity Check 821 receives inputs from precode 812, decode 813, and inflate 815 and sends integrity status information to Output Buffer 816. Data from Precode 812 indicates whether the received compressed data stream can be broken into valid Huffman codes. Decode 813 indicates whether the compression symbols can be correctly decoded. For example, if a run-length code specifies to copy 4 bytes of data from 300 bytes back but only 200 bytes have been presently decompressed then compression symbol is the result of corruption in the compressed data. Inflate 815 provides a copy of the decompressed data to integrity Check 821 so it can generate the CRC of the decompressed data and compare it to the CRC of the original uncompressed data that was appended to the end of the compressed data by Output Suffer 716 of Figure 7. If integrity Check 821 detects any condition that indicates the decompressed data may not match the original uncompressed data, it informs Output Buffer 816,

Output Buffer 816 indicates when it is ready to send out decompressed data using the "ready" signal. This allows logic external to the compression circuit to consume the compressed data at its own rate and also decouples the timing in paths internal to the compression circuit with those outside the compression circuit. Output Buffer 816 also indicates when the last byte of decompressed data is being sent out and if any integrity errors were detected by the decompression process.

The following code illustrates the compression and decompression schemes according to some of the present embodiments.

// COMPRESSION FLOW CHART fill aS! bytes of match buffer with data to be compressed set skip_count to 1 while data is still in the match buffer

// SEARCH PROCESS repeat skip ^ count times move new data byte into match buffer (if there is any) move oldest byte from match buffer to start of search buffer discard oldest byte of search buffer end_repeat if oldest match_buffer byte has same value as the prior oldest match ^ buffer byte inc repeat_byte_count end if for each block in the search buffer find the longest contiguous match of data, in the search black and the match buffer that includes the oldest match_buffer byte if match found that exceeds a predetermined minimum runjength if runjength is the longest so far found in this loop record the (b,r,o) - block number, run length, offset position in the block where the run starts end-if end if end for

// ENCODE PROCESS if the best (b,r,o) runjength greater than the repeat_byte_count encode the (b,r,o) as a runjength winner set skip_count to the runjength, r, minus 1 else if repeat_byte_couni > 0 encode repeat_byte_count as winner set skip_countto the repeat_byte_count minus 1 eise encode the oldest byte in match buffer as a literal winner set skip_count to 1 endjf

// EMIT PROCESS if the encoded winner is a repeat_byte if the repeat_byte count exceeds a predetermined maximum emit a repeat_byte with a value of 1 less than the current count encode the new byte in match buffer as the literal winner emit the encode reset the repeat_byte_count to 0 endjf else if this encoded winner matches the prior emitted encoded winner inc repeat_command_count if repeat_command_count exceed a predetermined maximum emit a repeat_command with a value of 1 less than the current count emit the encode reset the repeat_command__count to 0 endjf else if re peat_byte_ count > 0 emit a repeatjbyte with a value of the current count reset the repeat_byte_count to 0 end_if if repeat_eommand__count > 0 emit a repeat_command with a value of the current count reset the repeat_command_count to 0 end_if emit the encode end if endjwhi!e if repeat_byte_count > 0 emit a repeaVbyte with a value of the current count end if if repeat_command_count > 0 emit a repeat__command with a value of the current count end if emit eof

// EMIT FLOW CHART receive one of the following encodes:

- repeat byte with count

- repeat command with count

- runjength with block, length, offset

- literal byte with value of byte

- eof indicator generate a huffman encoding of the encode concatenate the huffman code to the compressed result

// DECOMPRESSION FLOW CHART move the compressed data into a decode buffer set ptr to 0 while not end_of_decodmg // EXTRACT PROCESS find the smallest sequence of bits starting at position ptr of the decode buffer that form a valid huffman code determine the length of the huffman code and the associated parameters for the code increment ptr by this length decode the huffman code and extract any associated parameters

// DECODE PROCESS if huffman code is eof set end_of_decoding to true end if if huffman code is a literal output the decoded literal byte insert the decoded literal byte to the start of the history data end if if huffman code is a repeat byte output the previous literal byte n times, n being the extracted parameter insert the previous literal byte n times to the start of the history data end if if huffman code is a repeat command repeat n times, n being the extracted parameter execute the "DECODE PROCESS" using the prior decoded huffman code end ^ repeat end if if huffman code is a runjength associated parameters are b!ockjiumber, runjength, and offset delta = block_number * block_size + offset extact runjength number of bytes from history data starting delta bytes back output the extracted bytes insert the extracted bytes to the start of the history data end if end while There are many other systems and methods that could be implemented according to, and without departing from the scope and spirit of the presently disclosed embodiments, some of the scope of which is disclosed by the following claims.