Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ARCHITECTURE AND METHOD OF A FLEXIBLE PIPELINE PROCESSOR
Document Type and Number:
WIPO Patent Application WO/2017/120619
Kind Code:
A1
Abstract:
The speed of many computer algorithms could be significantly increased by a pipeline architecture processor, except for the limitation that pipelines cannot branch. This invention discloses a method and architecture for a pipelined processing system which can be implemented N deep by M wide and has branching capability within each pipeline column independent of the other columns while continuously providing one result per column every clock cycle. The system also provides for selecting bypass paths which can allow a given program to configure the N x M matrix into subsets such that (N*x) rows can be utilized with (M/x) columns or (N/x) rows and (M*x) columns where x is often set to but not limited to a value of 1, 2, 4, or 8.

Inventors:
RUSSELL DAVID WAYNE (US)
Application Number:
PCT/US2017/017570
Publication Date:
July 13, 2017
Filing Date:
February 11, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
RUSSELL DAVID WAYNE (US)
International Classes:
G06F7/50; G06F7/38; G06F7/72; H03M13/03
Foreign References:
US20150091927A12015-04-02
US20090300336A12009-12-03
US20150200671A12015-07-16
US20070234187A12007-10-04
US7069372B12006-06-27
US20090106342A12009-04-23
US20060015553A12006-01-19
Download PDF:
Claims:
CLAIMS

The invention claimed is:

1) A hardware pipeline processing unit capable of utilizing the results of pipeline stage N to determine the functional operation for processing stage N+l .

2) The system of 1 where bypass paths allow the pipeline to be reconfigured from a 1

column by N*M rows to one row by N*M columns and any combination in between.

3) The system of where bypass paths allow data from any given stage to be redirected to another stage input register in addition to the next stage in its column.

4) The system of 1 where a single integrated circuit package is used to implement the

Flexible Pipeline Processor (FPP) but additional chips or co-processor are also tied to the FPP as arithmetic functional blocks or data fetch or storage processors.

5) The system of 1 where multiple integrated circuits, either custom designed or off-the- shelf are interconnected to implement the FPP.

6) The system of 1 where each stage contains some number of functional blocks and some number of register multiplexers.

7) The system of 1 where the FPP is coupled to one or more standard CPUs such that each shares in access to the memory system and data and algorithms not able to be performed by the FPP are performed by the standard CPU.

8) The system of 1 where data is transferred from another system into the FPP storage

memory to be processed and then transferred back to the originating system.

9) The system of 1 where multiple FPPs are defined within the same system.

10) The system of 1 where a real-time data stream such as an imaging system is input to the FPP such that the processing speed of the FPP is at least partially dependent on the data rate of the real-time system and no data fetch engine is required.

11) The system of 1 where the data output of the FPP is driven directly to a real-time display or process system such that storage of the data results is not required.

12) The system of 1 where functional operations proceed at a secondary clock rate which may be different from the stage transfer operations which proceed at a primary clock rate.

13) The system of 1 where functional operations within a stage may include complex

operations in some combination of elements such as but not limited to a CPU processor such as but not limited to a Von Neumann architecture, Harvard architecture, Butterfly processor, floating point processor, and/or a rendering engine, pixel processor, hardware accelerator, deep learning system, CPU array, neural network, Kalman filter, and/or Digital Signal Processor.

14) The system of 1 where pipeline interlocks may halt the execution of some number of pipelines to synchronize data availability with other pipelines.

Description:
SPECIFICATION TITLE OF INVENTION

Architecture and Method of a Flexible Pipeline Processor INVENTORS

David Wayne Russell, (USA) Winter Garden, Florida USA CROSS-REFERENCE TO RELATED APPLICATIONS US 62/275,914 01/07/2016

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

Not Applicable

REFERENCE TO SEQUENCE LISTING, A TABLE, OR A COMPUTER PROGRAM

LISTING COMPACT DISK APPENDIX

Not Applicable

FIELD

[0001] The field of this invention relates generally to computer architectures and more specifically to a flexible pipeline processor which can branch within a pipeline architecture.

BACKGROUND

[0002] The speed of some computer algorithms could be significantly increased by a pipeline architecture processor. In a pipelined processor data flows down a column of simultaneous processing elements. Often each processing element is designed specifically for that step in the computation process. Once the pipeline fills with data, the time referred to as the latency of the pipeline, thereafter the pipeline produces one computation result every clock cycle. In simplistic terms this produces a speed advantage in processing equal to the depth of the pipeline.

[0003] The problem is that the average number of cycles in a program before it branches is about 3, and pipelines cannot branch, so the algorithm has to uniquely fit the pipeline paradigm or be very short to be applicable to more applications. Most commercial CPUs at this point have pipelines within them for the initial steps of load opcode, execute, prefetch next opcode, etc. but these are usually only 3 to 5 steps deep. The number of algorithms that can fit a long pipeline without branching is very small and when only one algorithm matches the pipeline it is often referred to as a hardware accelerator rather than a pipeline processor.

[0004] An architecture needs to be developed that allows for generic algorithmic pipelines of increased length N, which implies including branching capability. In simplistic terms this produces a speed advantage in processing equal to the depth of the pipeline. In addition, if one pipeline is good, often eight are better. Many algorithms can benefit from simultaneous computation of a number of variables, so as long as there is room in the IC, any number M of parallel pipeline columns can be implemented resulting in speed advantages of NxM over conventional processors.

BRIEF SUMMARY OF THE INVENTION

[0005] The Flexible Pipeline Processor (FPP) is implemented by developing an architecture that simulates most of the processes a standard Von Neumann machine architecture would perform to reach the same result. In a Von Neumann or Harvard architecture, for example, the Central Processing Unit (CPU) contains basically one Arithmetic Logic Unit (ALU) capable of some number of arithmetic and logical operations such as ADD, AND, OR, XOR, and NOT. All other functions such as subtract, multiply, divide and more complex operations are implemented programmatically out of these simpler building blocks, but at the expense of longer computation times.

[0006] It has been said that general purpose computers are designed to do everything poorly. The "poorly" is not the original intent but the consequence of having to build all higher functions in software with very simplistic hardware resources. These machines can do virtually any algorithm, but the inevitable consequence of this flexibility is that it does not do any one thing particularly well. Hardware accelerators are machines specifically designed to implement one algorithm very fast, but it only works for the single algorithm for which it was designed. The FPP provides a middle ground where a much wider number of algorithms can be run, still faster than a standard CPU. BRIEF DESCRIPTION OF DRAWINGS

[0007] The following detailed description illustrates embodiments of the invention by way of example and not by way of limitation. The description clearly enables one skilled in the art to make and use the disclosure, describes several embodiments, adaptations, variations,

alternatives, and use of the disclosure, including what is currently believed to be the best mode of carrying out the disclosure. The disclosure is described as applied to an exemplary embodiment namely, a flexible pipeline processor. However, it is contemplated that this disclosure has general application to vehicle management systems in industrial, commercial, military, and residential applications.

[0008] As used herein, an element or step recited in the singular and preceded with the word "a" or "an" should be understood as not excluding plural elements or steps, unless such exclusion is explicitly recited. Furthermore, references to "one embodiment" of the present invention are not intended to be interpreted as excluding the existence of additional embodiments that also incorporate the recited features.

[0009] The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical components or features.

[0010] The novel features believed characteristic of the illustrative embodiments are set forth in the appended claims. The illustrative embodiments, however, as well as a preferred mode of use, further objectives, and features thereof will best be understood by reference to the following detailed description of illustrative embodiments of the present disclosure when read in conjunction with the accompanying drawings, wherein:

[0011] FIG. 1 shows an overall architecture of a Flexible Pipeline Processor.

[0012] FIG. 2 depicts without limitation an exemplar FPP stage.

DETAILED DESCRIPTION OF INVENTION

[0013] In the FPP each pipeline row or stage is composed of some number of ALUs and potentially other high level functions such as but not limited to Multiply, Divide, Shift, and even Fast Fourier Transform (FFT). The tradeoff in the FPP is hardware space versus software speed. As one cannot predict which function will be necessary at any particular stage and iterative operations within the pipeline eat up stages, the goal is to provide as many direct hardware functions as will fit.

[0014] Functions which take more than one clock cycle to complete, the FFT being one example, can be handled two ways: First, one high speed clock (secondary clock) can be used to cycle arithmetic units within the stage, and a slower clock (primary clock) used to cycle the results from one FPP stage to the next. The pipeline still produces one result per primary clock but it's the overall slower clock.

[0015] In addition the probability is high that only one or two arithmetic functions are used per stage, which leaves most of the ALUs in the design inactive. This is inefficient in terms of space but necessary to achieve high speeds. This opens the possibility of an architecture compromise where the ALU within each stage is itself a Von Neumann architecture machine and contains software, either customized or permanently in place, which performs the higher arithmetic functions. This significantly reduces the amount of hardware required in each stage, but now the slower clock must be significantly slower, it must be as slow as the longest arithmetic function even if that function is not used.

[0016] In another embodiment mixtures of some number of ALUs and other function types can be randomly mixed in the pipeline stages such that overall fewer are wasted, although one might have to skip a pipeline stage (No Operation or NOP) to reach the desired function in the sequence.

[0017] Secondly, bypasses can be implemented which shunts the data result from stage N to stage N + D instead of N+l where D is the latency of the operation. If an FFT, for example, takes an external processor 4 clock cycles to complete, then the data from N can be shunted to N+4 keeping the results of all columns in relative position. The intervening stages can still be utilized to calculate other data values.

[0018] In another embodiment if the FPP is defined as N rows by M columns, many algorithms make use of "neighboring" data points, and this data may be bypassed from one stage to another in both dimensions to make that data available when required. A "Manhattan Nearest Neighbor" algorithm, for example, requires the matrix neighbors of (N,M) which are (N-1,M), (N+1,M), (N,M+1), and (N,M-1) to complete the calculation. If the pipeline is arranged such that all of these values are within pipeline stages at a particular time, they can be processed without refetching the data multiple times resulting in greater efficiency.

[0019] Once the stages are defined, the final point of the invention is to implement branching. There are three significant types of branches, first branches that determine the next step in the computation based on the result of the current computation, and second branches that iterate a series of instructions X times, and third loops which iterate over a primary section of the program.

[0020] In this embodiment of the FPP a compromise is proposed where loops are iterations of the entire pipeline and short iterative operations are not allowed. In another embodiment the algorithm can be broken into multiple steps where the pipeline represents one iterative loop, whether it be long or short. Then the intermediates results are stored, the pipeline is

reprogrammed for the next step, and computation restarts. This is improved further by the ability to reorganize the pipeline N x M into fewer longer pipes (N*2) by (M/2) or more short pipes (N/2) by (M*2) one familiar with the art would recognize that other organizations are viable as well without altering the intention of the invention.

[0021] For short iterative computations the width of the data path can be expanded to the maximum width that the data fetch mechanism feeding the pipeline can manage. Since memory may be faster than the stage clocking frequency, multiple memory fetches may be possible to fill a wide pipeline. Pipeline columns may also be stitched together end to end by diverting the result of column M to the input of column M+l to provide very long computation loops as required.

[0022] Pipeline Stage 0 is the first stage of the pipeline. If it performs any arithmetic operation, it will produce result flags such as but not limited to EQUAL, LESS THAN, or ZERO. The setting of the flag indicates the condition and the reset of the flag indicates it's logical NOT, such as NOT EQUAL and GREATER THAN OR EQUAL TO or NOT ZERO.

[0023] This is only true if some validity flag indicates the functions were actually tested in the stage cycle. As a result of these flags, the program may desire to branch to a different operation, altering what STAGE 1 would perform. This is accomplished by storing the two programmatic alternatives at STAGE 1 and having STAGE 0 determine which operation code (opcode) is executed at STAGE 1 via its result flags. Similarly, for each stage of the pipeline, STAGE N determines the opcode that will be utilized at STAGE N+l .

[0024] As STAGE 0 traverses to STAGE 1, only two possible opcodes are selectable, the result of STAGE 0 was either TRUE or FALSE. At STAGE 2, however, two results are available for each of the possible results of STAGE 1 as well. Following this paradigm, at STAGE N there are 2 ** n (2 to the nth power) where (n = 0..N) possible combinations of opcodes that could be chosen. This can be dealt with in a number of ways such as but not limited to first, adding a prefetch phase to each pipline stage such that only two possible outcomes are prefetched knowing the combinatorics in play at STAGE (N-l). Second, in reality branches are usually short, and converge back to the same stream relatively quickly, so a local storage of all opcodes actually likely to be used (which can be determined from the program data flow graph) can be much smaller than 2 ** N (2 to the maximum number of pipeline stages N). In another embodiment, as there may be more than one flag that can be tested, combinations of flags in the logic may relate to more than two possible outcomes of the test.

[0025] In one embodiment the programming the pipeline is accomplished by a Very Long Instruction Word (VLIW) architecture. At each pipeline stage the number of instructions that could be branched to is stored. Each instruction contains all of the information flow possibilities that the pipe stage is designed to handle. For example, but not limited to, all combinations of moving the input register data to the output registers, input registers through some number of ALUs, the functionality of one or more ALUs, and the flags that will be tested at each ALU to determine the final result.

[0026] Referring now to the invention in more detail, in Fig. 1 there is shown an overall diagram of a flight plan generation system. The different illustrative embodiments recognize and take into account a number of different considerations. "A number", as used herein with reference to items, means one or more items. For example, "a number of different considerations" means one or more different considerations. "Some number", as used herein with reference to items, may mean zero or more items. [0027] In Figure 1, data first enter M pipeline columns via a parallel data fetch engine 100. In one embodiment this engine is implemented as a finite state machine and provides the required data for the defined number of pipeline stages every primary clock cycle (stage clock cycle). In one embodiment this can be accomplished by a data fetch mechanism that is wide enough to supply all data at once, or if the memory fetch cycle is shorter than the primary clock cycle multiple fetches may be made in one primary cycle utilizing without limitation the secondary clock for example. The number of columns may vary from 1 to M depending on how the pipeline is configured for this segment of code.

[0028] In one embodiment a very simple operation such as but not limited to MATRIX CLEAR could be configured to clear one element of the input data set for each stage N*M of the pipeline if the fetch and store mechanisms are able to keep up. In another embodiment only one data point might be required of the fetch engine if the pipeline is configured as 1 column by M*N stages.

[0029] In another embodiment, additional bypass paths can be created such that information from one stage is available at the previous or next stage. This allows for complex matrix operations such as nearest neighbor searches where one point requires some number of adjacent points to evaluate the result. Data bypasses can be organized horizontally, vertically, or diagonally depending on the implementation.

[0030] Data is gated into one or more registers of a pipeline stage 110. The operation of each stage is controlled by an opcode fetch machine 130 which determines the operation of stage N+l from the results of stage N, and this is duplicated across all M columns because each column's operations are now data dependent and independent. In one embodiment all of the opcodes possible are stored at each stage. In another embodiment it can be seen that once the flags for stage N are set, that limits the opcode choices of N+l to 2 and N+2 to 4. If the prefetch mechanism is fast enough to provide the opcodes in 2 cycles, that means only 4 opcodes need to be stored at any given N+2.

[0031] If the opcode fetch is very fast, only the required opcode needs to be fetched. It can be seen by one practiced in the art that various implementations, speeds, and widths of the opcode fetch mechanism could be implemented without altering the fundamental intent of the invention. [0032] At each primary clock, the results and data from stage N 110 is transferred to stage N+l 120. This continues until the end of the pipeline where the final results are clocked into a parallel result storage engine 140 which stores the results back into primary data storage. The storage engine must be implemented fast enough store any number of results that the pipeline is able to produce, which may be greater than, equal to, or less than the data input to the pipeline.

[0033] In Figure 2, one embodiment of a pipeline stage is depicted. In this embodiment six input registers 200 are available for storing data and intermediate results along the pipeline. The number of registers is arbitrary and in this embodiment designed from analysis of the algorithms to be run. In other embodiments the number of registers might be decided simply from how many gates are available in the implementation or other factors.

[0034] At the beginning of the primary clock cycle, a register multiplexer 210 directs the register data to one or more of the Arithmetic Logic Units (ALUs) 240 or other processing functions such as but not limited to a hardware multiply/divide unit 230, barrel shifter, or Fast Fourier

Transform (FFT) co-processor. Control of the register multiplexor and the functional units is controlled by the opcode 220 which could be set to a fixed code such as is found in STAGE 0,0 or it could be selected from the results of the previous stage. Note that if bypasses are used to redefine the NxM pipeline stages, STAGE 0,1 might have more than one opcode choices. Other paths from the register multiplexer might provide bypass paths from one register to another through the stage.

[0035] In another embodiment the use of bypasses from one pipeline stage (A) to another (B) may interfere with the intended progression of the pipeline where it produces one row of results each primary clock cycle. In this case a pipeline interlock circuit may be constructed which delays the input and output data clocks for some number of cycles. For example without limitation if there are 4 columns of pipelines denoted A, B, C, and D, and data from stage 3 of B is bypassed such that it is an input to pipeline A stage 2, then all pipelines would proceed normally at stage 1, B,C, and D would continue normally until stage 3 completes but pipeline A would be interlocked, until the data is available at stage 2, whereupon pipelines B, C, and D would interlock until A caught up and all pipelines continue again. This infers more complex data sourcing and storage mechanisms that can hold and release different values into each pipeline as necessary. In this example scenario, however, once the original interlock synchronized the data streams all pipelines could continue at full speed.

[0036] The input registers and output results may be multiplexed at the output of the stage 250. In another embodiment careful control of the data paths through the stage may make the second multiplexor unnecessary. In either case at the end of the primary cycle data from this stage 250 is routed to the registers of the next stage 270 and the result flags of the functional units are latched 260 to provide the correct opcode to the next stage or the result storage engine.

[0037] While the foregoing written description of the invention enables one of ordinary skill to make and use what is considered presently to be the best mode thereof, those of ordinary skill will understand and appreciate the existence of variations, combinations, and equivalents of the specific embodiment, method, and examples herein. The invention should therefore not be limited by the above described embodiment, method, and examples, but by all embodiments and methods within the scope and spirit of the invention. Further, different illustrative embodiments may provide different benefits as compared to other illustrative embodiments. The embodiment or embodiments selected are chosen and described in order to best explain the principles of the embodiments, the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

[0038] The flowcharts and block diagrams described herein illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various illustrative embodiments. In this regard, each block in the flowcharts or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function or functions. It should also be noted that, in some alternative implementations, the functions noted in a block may occur out of the order noted in the figures. For example, the functions of two blocks shown in succession may be executed substantially concurrently, or the functions of the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.