Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND APPARATUS FOR PROCESSOR PIPELINE SEGMENTATION AND RE-ASSEMBLY
Document Type and Number:
WIPO Patent Application WO/2000/070483
Kind Code:
A2
Abstract:
An improved method and apparatus for implementing instructions in a pipelined central processing unit (CPU) or user-customizable microprocessor. In a first aspect of the invention, an improved method of controlling the operation of the pipeline in situations where one stage has been stalled or interrupted is disclosed. In one embodiment, a method of pipeline segmentation ('tearing') is disclosed where the later, non-stalled stages of the pipeline are permitted to continue despite the stall of the earlier stage. Similarly, a method which permits instructions present at earlier stages in the pipeline to be re-assembled ('catch-up') to later stalled stages is also described. A method of synthesizing a processor design incorporating the aforementioned segmentation and re-assembly methods, and a computer system capable of implementing this synthesis method, are also described.

Inventors:
HAKEWILL JAMES ROBERT HOWARD
SANDERS JON
Application Number:
PCT/US2000/013221
Publication Date:
November 23, 2000
Filing Date:
May 12, 2000
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ARC INTERNAT U S HOLDINGS INC (US)
International Classes:
G06F9/38; G06F17/50; (IPC1-7): G06F15/80
Foreign References:
EP0352103A21990-01-24
EP0649085A11995-04-19
US5809320A1998-09-15
Other References:
"CONDITION REGISTER COHERENCY LOOK-AHEAD" RESEARCH DISCLOSURE,GB,INDUSTRIAL OPPORTUNITIES LTD. HAVANT, no. 348, 1 April 1993 (1993-04-01), page 243 XP000304185 ISSN: 0374-4353
DIEFENDORFF K ET AL: "ORGANIZATION OF THE MOTOROLA 88110 SUPERSCALAR RISC MICROPROCESSOR" IEEE MICRO,US,IEEE INC. NEW YORK, vol. 12, no. 2, 1 April 1992 (1992-04-01), pages 40-63, XP000266192 ISSN: 0272-1732
Attorney, Agent or Firm:
Nataupsky, Steven J. (Martens Olson & Bea, LLP Sixteenth Floor 620 Newport Center Drive Newport Beach CA, US)
Hunt, Dale C. (Martens Olson & Bea, LLP 16th floor 620 Newport Center Drive Newport Beach CA, US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:
1. A method of operating a processor having a pipeline, comprising: providing a first pipeline stage capable of processing a first instruction; providing a second pipeline stage, said second pipeline stage being downstream of said first pipeline stage and further being adapted to process a second instruction; stalling said first instruction at said first pipeline stage; and processing said second instruction in said second stage after said first pipeline stage has stalled.
2. The method of Claim 1, wherein said pipeline comprises a three stage pipeline, and the acts of providing said first and second pipeline stages comprise providing an instruction decode stage and an instruction execution stage, respectively.
3. The method of Claim 1, wherein the act of stalling comprises: detecting an interlock condition; and generating an interlock signal, said signal being adapted to stall said first pipeline stage.
4. The method of Claim 3, further comprising determining the validity of said instruction in said second pipeline stage prior to processing said second instruction therein.
5. A method of operating a processor having a pipeline, said pipeline comprising at least a first stage, a second stage, and a third stage, comprising: providing an instruction within each stage of said pipeline; stalling an instruction in said first stage; processing an instruction within said second stage after said first stage has stalled; moving the processed instruction within said second stage to said third stage; and inserting a blank slot into said second stage of said pipeline to prevent the processed instruction present in said second stage from being executed multiple times.
6. The method of Claim 5, wherein said first stage comprises an instruction fetch stage, said second stage comprises an instruction decode stage, and said third stage comprises an instruction execution stage.
7. The method of Claim 6, wherein the act of stalling said first stage comprises: detecting an interlock condition between said first stage and at least one other stage within said pipeline; and stalling said first stage in response to said interlock condition.
8. The method of Claim 5, wherein said pipeline further comprises a fourth stage following said third stage.
9. The method of Claim 8, further comprising: processing an instruction within said third stage after said first stage has stalled; and moving the processed instruction within said third stage to said fourth stage when the processed instruction within said second stage is moved to said third stage.
10. The method of Claim 7, further comprising: providing a flag setting instruction within said second stage, and a jump instruction within said first stage; detecting at least one instance when one or more flags set by said at least one flag setting instruction may affect the subsequent execution of said at least one jump instruction; and stalling the execution of said at least one jump instruction within said first stage of said pipeline at least until all flags to be set by said at least one flag setting instruction have been set.
11. A method of synthesizing the design of a processor, comprising: generating a first file specific to said design to include a plurality of instructions words; inputting information to said first file to include an instruction set, whereby the execution of at least one instruction word within a first pipeline stage of said processor continues after another one of said plurality of instruction words has been stalled in an earlier pipeline stage; defining the location of at least one library file; generating a script using said first file, said library file, and user input information; running said script to create a customized description language model; and synthesizing said design based on said description language model.
12. The method of Claim 11, wherein the act of synthesizing comprises running synthesis scripts based on said customized description language model.
13. The method of Claim 12, further comprising the act of generating a second file for use with a simulation, and simulating said design using said second file.
14. The method of Claim 13, further comprising the act of evaluating the acceptability of the design based on said simulation.
15. The method of Claim 14, further comprising the acts of revising the design to produce a revised design, and resynthesizing said revised design.
16. The method of Claim 11, wherein the act of inputting comprises selecting a plurality of input parameters associated with said design, said parameters comprising: (i) a cache configuration; and (ii) a memory interface configuration.
17. A machine readable data storage device comprising: a data storage medium adapted to store a plurality of data bits; and a computer program rendered as a plurality of data bits and stored on said data storage medium, said program being adapted to run on the processor of a computer system and synthesize integrated circuit logic for use in a processor having a pipeline, said processor logic further adapted to: detect a stalled instruction in a first stage of said pipeline; detect a valid instruction in a second stage of said pipeline; and continue execution of said valid instruction in said second stage while said first stage remains stalled.
18. A processor comprising: at least one pipeline having at least a first and second stage; means for detecting a stalled instruction in said first stage; means for detecting a valid instruction said second stage; and means for executing said valid instruction in said second stage while said first stage remains stalled.
19. A digital processor comprising: a processor core having a multistage instruction pipeline having at least first, second, and third stages, said core being adapted to decode and execute an instruction set comprising a plurality of instruction words; a data interface between said processor core and an information storage device; and an instruction set comprising a plurality of instruction words, said processor and said instruction set further being adapted to: (i) detect a first instruction stalled in said second stage of said pipeline; (ii) detect when a valid instruction is present in said third stage of said pipeline; and (iii) execute said valid instruction in said third stage after said second stage has stalled.
20. The processor of Claim 19, said processor and said instruction set being further adapted to: (iv) detect a stalled instruction present in said third stage of said pipeline; (v) detect an unused slot between said third stage and an instruction present in said first stage of said pipeline; and (vi) process said instruction present in said first stage and advance said instruction to said second stage in order to eliminate said unused slot.
21. A digital processor comprising: a processor core having a multistage instruction pipeline having at least first, second, and third stages, said core being adapted to decode and execute an instruction set comprising a plurality of instruction words; a data interface between said processor core and an information storage device; and an instruction set comprising a plurality of instruction words, said processor and said instruction set further being adapted to: (i) detect a stalled instruction present in said third stage of said pipeline; (ii) detect an unused slot between said third stage and an instruction present in said first stage of said pipeline; (iii) process said instruction present in said first stage while said third stage remains stalled, the thereby eliminate said unused slot.
22. The processor of Claim 21, wherein said unused slot comprises a slot selected from the group comprising: (i) an empty slot; (ii) a slot containing a killed instruction; and (iii) a slot containing a long immediate word.
23. A digital processor having an associated data storage device and at least one pipeline including at least first, second, and third stages, wherein the execution of instructions within said at least one pipeline is controlled by the method comprising: providing an instruction set comprising a plurality of instruction words, storing at least a portion of said instruction set within said storage device; running at least a portion of said instruction set on said processor; detecting a first instruction stalled in said second stage of said pipeline; detecting when a valid instruction is present in said third stage of said pipeline; and executing said valid instruction in said third stage while maintaining said first instruction stalled in said second stage.
24. A method of operating a processor having a pipeline, said pipeline comprising at least a first stage, a second stage, and a third stage, comprising: providing an instruction within each stage of said pipeline; stalling an instruction in said second stage; processing an instruction within said third stage after said second stage has stalled; moving the processed instruction out of said third stage; and inserting a blank slot into said third stage of said pipeline to prevent the processed instruction present in said third stage from being executed multiple times.
Description:
METHOD AND APPARATUS FOR PROCESSOR PIPELINE SEGMENTATION AND RE-ASSEMBLY This application claims priority to U. S. Provisional Patent Application Serial No.

60/134,253 filed May 13,1999, entitled"Method And Apparatus For Synthesizing And Implementing Integrated Circuit Designs,"and to co-pending U. S. Patent Application No.

09/418,663 filed October 14,1999, entitled"Method And Apparatus For Managing The Configuration And Functionality Of A Semiconductor Design,"which claims priority to U. S. Provisional Patent Application Serial No. 60/104,271 filed October 14,1998, of the same title.

Background of the Invention 1. Field of the Invention The present invention relates to the field of integrated circuit design, specifically to the use of a hardware description language (HDL) for implementing instructions in a pipelined central processing unit (CPU) or user-customizable microprocessor.

2. Description of Related Technology RISC (or reduced instruction set computer) processors are well known in the computing arts. RISC processors generally have the fundamental characteristic of utilizing a substantially reduced instruction set as compared to non-RISC (commonly known as "CISC") processors. Typically, RISC processor machine instructions are not all micro- coded, but rather may be executed immediately without decoding, thereby affording significant economies in terms of processing speed. This"streamlined"instruction handling capability furthermore allows greater simplicity in the design of the processor (as compared to non-RISC devices), thereby allowing smaller silicon and reduced cost of fabrication.

RISC processors are also typically characterized by (i) load/store memory architecture (i. e., only the load and store instructions have access to memory; other instructions operate via internal registers within the processor); (ii) unity of processor and compiler; and (iii) pipelining.

Pipelining is a technique for increasing the performance of processor by dividing the sequence of operations within the processor into discrete components which are effectively executed in parallel when possible. In the typical pipelined processor, the arithmetic units associated with processor arithmetic operations (such as ADD, MULTIPLY, DIVIDE, etc.) are usually"segmented", so that a specific portion of the operation is performed in a given component of the unit during any clock cycle. Fig. 1

illustrates a typical processor architecture having such segmented arithmetic units. Hence, these units can operate on the results of a different calculation at any given clock cycle. As an example, in the first clock cycle two numbers A and B are fed to the multiplier unit 10 and partially processed by the first segment 12 of the unit. In the second clock cycle, the partial results from multiplying A and B are passed to the second segment 14 while the first segment 12 receives two new numbers (say C and D) to start processing. The net result is that after an initial startup period, one multiplication operation is performed by the arithmetic unit 10 every clock cycle.

The depth of the pipeline may vary from one architecture to another. In the present context, the term"depth"refers to the number of discrete stages present in the pipeline. In general, a pipeline with more stages executes programs faster but may be more difficult to program if the pipeline effects are visible to the programmer. Most pipelined processors are either three stage (instruction fetch, decode, and execute) or four stages (such as instruction fetch, decode, operand fetch, and execute, or alternatively instruction fetch, decode/operand fetch, execute, and writeback), although more or less stages may be used.

Despite the aforementioned"segmentation"of operations within the processor, the instructions within the pipelines of prior art processors are generally contiguous.

Specifically, instructions in one stage generally follow immediately after instructions in a later stage with a minimum of blank slots, NOP codes, or the like. Furthermore, when an instruction at a later stage is stalled (such as when an instruction in the execution stage is awaiting information from a fetch operation), the earlier and later stages of the pipeline are also stalled. In this manner, the pipeline tends to operate largely in"lock-step"fashion.

When developing the instruction set of a pipelined processor, several different types of"hazards"must be considered. For example, so called"structural"or"resource contention"hazards arise from overlapping instructions competing for the same resources (such as busses, registers, or other functional units) which are typically resolved using one or more pipeline stalls. So-called"data"pipeline hazards occur in the case of read/write conflicts which may change the order of memory or register accesses."Control"hazards are generally produced by branches or similar changes in program flow.

Interlocks are generally necessary with pipelined architectures to address many of these hazards. For example, consider the case where a following instruction (n +1) in an earlier pipeline stage needs the result of the instruction n from a later stage. A simple solution to the aforementioned problem is to delay the operand calculation in the instruction decoding phase by one or more clock cycles. A result of such delay, however is

that the execution time of a given instruction on the processor is in part determined by the instructions surrounding it within the pipeline. This complicates optimization of the code for the processor, since it is often difficult for the programmer to spot interlock situations within the code.

"Scoreboarding"may be used in the processor to implement interlocks; in this approach, a bit is attached to each processor register to act as an indicator of the register content; specifically, whether (i) the contents of the register have been updated and are therefore ready for use, or (ii) the contents are undergoing modification such as being written to by another process. This scoreboard is also used in some architectures to generate interlocks which prevent instructions which are dependent upon the contents of the scoreboarded register from executing until the scoreboard indicates that the register is ready. This type of approach is referred to as"hardware"interlocking, since the interlock is invoked purely through examination of the scoreboard via hardware within the processor.

Such interlocks generate"stalls"which preclude the data dependent instruction from executing (thereby stalling the pipeline) until the register is ready.

Alternatively, NOPs (no-operation opcodes) may be inserted in the code so as to delay the appropriate pipeline stage when desired. This later approach has been referred to as"software"interlocking, and has the disadvantage of increasing the code size and complexity of programs that employ instructions that require interlocking. Heavily software interlocked designs also tend not to be fully optimized in terms of their code structures.

Another important consideration in processor design is program branching or "jumps". All processors support some type of branching instructions. Simply stated, branching refers to the condition where program flow is interrupted or altered. Other operations such as loop setup and subroutine call instructions also interrupt or alter program flow in a similar fashion. The term"jump delay slot"is often used to refer to the slot within a pipeline subsequent to a branching or jump instruction being decoded. The instruction after the branch (or load) is executed while awaiting completion of the branch/load instruction. Branching may be conditional (i. e., based on the truth or value of one or more parameters) or unconditional. It may also be absolute (e. g., based on an absolute memory address), or relative (e. g., based on relative addresses and independent of any particular memory address).

Branching can have a profound effect on pipelined systems. By the time a branch instruction is inserted and decoded by the processor's instruction decode stage (indicating

that the processor must begin executing a different address), the next instruction word in the instruction sequence has been fetched and inserted into the pipeline. One solution to this problem is to purge the fetched instruction word and halt or stall further fetch operations until the branch instruction has been executed, as illustrated in Fig. 2. This approach, however, by necessity results in the execution of the branch instruction in several instruction cycles, typically equal to the depth of the pipeline employed in the processor design. This result is deleterious to processor speed and efficiency, since other operations can not be conducted by the processor during this period.

Alternatively, a delayed branch approach may be employed. In this approach, the pipeline is not purged when a branch instruction reaches the decode stage, but rather subsequent instructions present in the earlier stages of the pipeline are executed normally before the branch is executed. Hence, the branch appears to be delayed by the number of instruction cycles necessary to execute all subsequent instructions in the pipeline at the time the branch instruction is decoded. This approach increases the efficiency of the pipeline as compared to multi-cycle branching described above, yet also complexity (and ease of understanding by the programmer) of the underlying code.

Based on the foregoing, processor designers and programmers must carefully weigh the tradeoffs associated with utilizing hardware or software interlocks as opposed to a non- interlock architecture. Furthermore, the interaction of branching instructions (and delayed or multi-cycle branching) in the instruction set with the selected interlock scheme must be considered.

What is needed is an improved approach to pipeline operation and interlocking which optimizes processor pipeline performance while providing the programmer with additional flexibility of coding. Furthermore, as more pipeline stages (and even multiple multi-stage pipelines) are added to processor designs, the benefits of enhanced pipeline performance and code optimization within the processor increased manifold. Additionally, the ability to readily synthesize such improved pipelined processor designs in an application-specific manner, and using available synthesis tools, is of significant utility to the designer and programmer.

Summary of the Invention The present invention satisfies the aforementioned needs by providing an improved method and apparatus for executing instructions within a pipelined processor architecture.

In a first aspect of the invention, an improved method of controlling the operation of one or more pipelines within a processor is disclosed. In one embodiment, a method of

pipeline segmentation ("tearing") is disclosed whereby (i) instructions in stages prior to a stalled stage are also stalled, and (ii) instructions in stages subsequent to the stalled instruction are permitted to complete. Hence, a discontinuity or"tear"in the pipeline is purposely created. A blank slot (or NOP) is inserted into the subsequent stage of the pipeline to preclude the executed instruction present in the torn stage from being executed multiple times. Similarly, a method is disclosed which permits instructions otherwise stalled at earlier stages in the pipeline to be re-assembled with ("catch up"to) later stalled stages, thereby effectively repairing any tear or existing pipeline discontinuity.

In a second aspect of the invention, an improved method of synthesizing the design of an integrated circuit incorporating the aforementioned jump delay slot method is disclosed. In one exemplary embodiment, the method comprises obtaining user input regarding the design configuration; creating customized HDL functional blocks based on the user's input and existing library of functions; determining the design hierarchy based on the user's input and the library and generating a hierarchy file, new library file, and makefile; running the makefile to create the structural HDL and scripts; running the generated scripts to create a makefile for the simulator and a synthesis script ; and synthesizing the design based on the generated design and synthesis script.

In a third aspect of the invention, an improved computer program useful for synthesizing processor designs and embodying the aforementioned methods is disclosed.

In one exemplary embodiment, the computer program comprises an object code representation stored on the magnetic storage device of a microcomputer, and adapted to run on the central processing unit thereof. The computer program further comprises an interactive, menu-driven graphical user interface (GUI), thereby facilitating ease of use.

In a fourth aspect of the invention, an improved apparatus for running the aforementioned computer program used for synthesizing logic associated with pipelined processors is disclosed. In one exemplary embodiment, the system comprises a stand-alone microcomputer system having a display, central processing unit, data storage device (s), and input device.

In a fifth aspect of the invention, an improved processor architecture utilizing the foregoing pipeline tearing and catch-up methodologies is disclosed. In one exemplary embodiment, the processor comprises a reduced instruction set computer (RISC) having a three stage pipeline comprising instruction fetch, decode, and execute stages which are controlled in part by the aforementioned pipeline tearing/catch-up modes. Synthesized gate logic, both constrained and unconstrained, is also disclosed.

Brief Description of the Drawings Fig. 1 is block diagram of a typical prior art processor architecture employing "segmented"arithmetic units.

Fig. 2 illustrates graphically the operation of a prior art four stage pipelined processor undergoing a multi-cycle branch operation.

Fig. 3 is a pipeline flow diagram illustrating the concept of"tearing"in a multi- stage pipeline according to the present invention.

Fig. 4 is a logical flow diagram illustrating the generalized methodology of controlling a pipeline using"tearing"according to the present invention.

Fig. 5 is a pipeline flow diagram illustrating the concept of"catch-up"in a multi- stage pipeline according to the present invention.

Fig. 6 is a logical flow diagram illustrating the generalized methodology of controlling a pipeline using"catch-up"according to the present invention.

Fig. 7 is a logical flow diagram illustrating the generalized methodology of synthesizing processor logic which incorporates pipeline tearing/catch-up modes according to the present invention.

Figs. 8a-8b are schematic diagrams illustrating one exemplary embodiment of gate logic implementing the pipeline"tearing"functionality of the invention (unconstrained and constrained, respectively), synthesized using the method of Fig. 7.

Figs. 8c-8d are schematic diagrams illustrating one exemplary embodiment of gate logic implementing the pipeline"catch-up"functionality of the invention (unconstrained and constrained, respectively), synthesized using the method of Fig. 7.

Fig. 9 is a block diagram of a processor design incorporating pipeline tearing/catch- up modes according to the present invention.

Fig. 10 is a functional block diagram of a computing device using a computer program incorporating the methodology of Fig. 7 to synthesize a pipelined processor design.

Detailed Description of the Invention Reference is now made to the drawings wherein like numerals refer to like parts throughout.

As used herein, the term"processor"is meant to include any integrated circuit or other electronic device capable of performing an operation on at least one instruction word including, without limitation, reduced instruction set core (RISC) processors such as the ARC user-configurable core manufactured by the Assignee hereof, central processing units (CPUs), and digital signal processors (DSPs).

Additionally, it will be recognized by those of ordinary skill in the art that the term "stage"as used herein refers to various successive stages within a pipelined processor; i. e., stage 1 refers to the first pipelined stage, stage 2 to the second pipelined stage, etc. While the following discussion is cast in terms of a three stage pipeline (i. e., instruction fetch, decode, and execution stages), it will be appreciated that the methodology and apparatus disclosed herein are broadly applicable to processor architectures with one or more pipelines having more or less than three stages.

It is also noted that while the following description is cast in terms of VHSIC hardware description language (VHDL), other hardware description languages such as Verilog may be used to describe various embodiments of the invention with equal success.

Furthermore, while an exemplary Synopsyst) synthesis engine such as the Design Compiler 1999.05 (DC99) is used to synthesize the various embodiments set forth herein, other synthesis engines such as Buildgates available from Cadence Design Systems, Inc., may be used. IEEE std. 1076.3-1997," IEEE Standard VHDL Synthesis Packages specify an industry- accepted language for specifying a Hardware Definition Language-based design and the synthesis capabilities that may be expected to be available to one of ordinary skill in the art.

Lastly, it will be recognized that while the following description illustrates specific embodiments of logic synthesized by the Assignee hereof using the aforementioned synthesis engine and VHSIC hardware description language, such specific embodiments being constrained in different ways, these embodiments are only exemplary and illustrative of the design process of the present invention.

Pipeline Segmentation ("Tearing') The architecture of the present invention includes a generally free-flowing pipeline.

If a stage in the pipeline is stalled, then the previous stages will also be stalled if they contain instructions. However, despite the stalling of these previous stages, there are several advantages to having later (i. e.,"downstream") stages in the pipeline continue, if no interlocks are otherwise applied. These advantages include, inter alia, (i) continued processing of some instructions within the pipeline thereby leading to better processor performance as compared to"stalling"the entire pipeline; (ii) the ability to continue processing flag setting instructions located at later stages in the pipeline, thereby ensuring flags are set prior to execution of a jump or branch instruction whose execution may be affected by the status of the flags; and (iii) allowing a scoreboarded load instruction to issue a request to memory at a later stage of the pipeline whilst an instruction dependent on the

result of the load is held at an earlier stage of the pipeline. The load must be allowed to issue, otherwise a deadlock situation would arise.

It is noted that with respect to the continued processing of flag setting instructions, Applicant's co-pending U. S. Patent Application entitled"Method And Apparatus For Jump Control In A Pipelined Processor"filed contemporaneously herewith, discloses a method and apparatus for interlocking flag setting instructions with subsequent jump/branch instructions which may be affected by flags set by the flag setting instruction.

As an example of the foregoing method, consider a processor with three stage pipeline (fetch, decode, execute) in which an instruction is stalled at stage 2, but the instruction at stage 3 is allowed to"tear away"from the earlier stages and continue its journey down through the remaining stages of the pipeline. Fig. 3 graphically illustrates this principle (assuming no interlocks are applied).

Referring now to Fig. 4, the method of controlling a multi-stage pipeline using the pipeline tearing concept of the present invention is described. The first step 402 of the method 400 comprises generating an instruction set comprising a plurality of instruction words to be run on the processor. This instruction set is typically stored in an on-chip program storage device (such as a program RAM or ROM memory) of the type well known in the art, although other types of device, including off-chip memory, may be used. The generation of the instruction set itself is also well known in the art, except to the extent that it is modified to include the pipeline tearing functionality, such modification being described in greater detail below.

Next, in step 404, the instruction set (program) is sequentially fetched from the storage device in the designated sequence by, inter alia, the program counter (PC) and run on the processor, with the fetched instructions being sequentially processed within the various stages of the pipeline. It is noted that in the context of a RISC processor, only load/store instructions may access program memory space, hence, a plurality of intermediate registers are employed in such processor to physically receive and hold instruction information fetched from the program memory. Such load/store architecture and use of register structures within a processor are well known in the art, and accordingly will not be described further herein.

In step 406, a stall condition in one stage of the pipeline is detected by logic blocks which combine signals to determine if a conflict is taking place, typically for access to a data value or other resource. An example of this is the detection of the condition when a register being read by an instruction register is marked as'scoreboarded'meaning that the

processor must wait until the register is updated with a new value. Another example is when stall cycles are generated by a state machine whilst a multicycle operation (for example a shift and add multiply) is carried out.

In step 408, the existence of a valid instruction in the N+1 stage of the pipeline (where N = number of stage where stall invoked per step 406) is verified. In the present context, a"valid instruction"is one which is not marked as"invalid"for any reason (step 410), and which has successfully completed processing in the prior (Nth) stage (step 412).

For example, in one embodiment relevant to Applicant's ARC Core, the"p3iv"signal (i. e., "stage 3 instruction valid") is used to indicate that stage 3 of the pipeline contains a valid instruction. The instruction in stage 3 may not be valid for a number of reasons, including: 1. The instruction was marked as invalid when it moved into stage 2 (i. e., p2iv ='0') and therefore continues to be invalid when it has moved into stage 3; or 2. The instruction in stage 3 has been marked as invalid by the pipeline tearing logic on a previous cycle, but has not subsequently been replaced by an instruction moving into stage 2 from stage 3.

It is noted that the"STOP"condition resulting from step 410 comes from the condition where invalid=yes, since the tearing only takes place when there are valid instructions in stage 2 and 3 at the same time.

Note that in the instance where the instruction present at stage 2 is determined in step 412 not to have been able to complete processing (Item 2. above), and the instruction at stage 3 is able to complete processing, it is necessary to allow the instruction at stage 3 to leave the pipeline (or move to the next stage) and mark stage 3 as being invalid to fill in the gap per step 414. An alternative method is to insert a NOP or other blank instruction into stage 3, and mark stage 3 as valid. If this blank is not inserted or the stage marked invalid, the instruction which was executed in stage 3 at the time the instruction in stage 2 could not complete processing will be executed again on the next instruction cycle, which is not desired.

It is further noted that, with respect to the interlocks associated with the"v6" embodiment of Applicant's ARC Core, which is described in detail in Applicant's co- pending U. S. Patent Application entitled"Method And Apparatus For Jump Control In A Pipelined Processor" (referenced below), Stage 2 of the pipeline could be stalled if a jump

instruction was present and stage 3 contained a flag-setting instruction. Hence, the pipeline tearing functionality of the present invention is required for v6 jump interlocks.

Lastly, in step 418, the valid instruction present in stage 3 (and subsequent stages in a pipeline having four or more stages) is executed on the next clock cycle while maintaining the instruction present in stage 2 stalled in that stage. Note that on subsequent clock cycles, processing of the stalled instruction in stage 2 may occur, dependent on the status of the stall/interlock signal causing the stall. Once the stall/interlock signal is disabled, processing of the stalled instruction in that stage will proceed at the leading edge of the next instruction cycle.

The following exemplary code, extracted from Appendix I hereto, is used in conjunction with Applicant's ARC Core (three stage pipeline variant) to implement the "tearing"functionality previously described: n_p3iv <= ip3iv WHEN ien3 ='0'ELSE '0'WHEN ien2 ='0'AND ien3 ='1'ELSE ip2iv; p3ivreg ; PROCESS (ck, clr) BEGIN IF clr= 1'THEN ip3iv <='0' ; ELSIF (ck'EVENT AND ck ='1') THEN ip3iv <= n_p3iv; END IF; END PROCESS; It will be recognized, however, that coding schemes other than that presented herein, whether for the same or other processors, may be used to effectuate the pipeline tearing function of the present invention.

Pipeline Re-Assembly ("Catch-up") When Stalled In addition to the pipeline tearing concept described above, the present invention also employs mechanisms to address the reverse situation; i. e., allowing earlier stages of the pipeline to continue processing or"catch-up"to the later stages when empty slots or spaces are present between the stages, or the pipeline has otherwise been"torn". This function is also known as"pipeline transition enable." As an example of the foregoing concept, consider the instance in the aforementioned three stage pipeline where an instruction is stalled at stage 3, and stage 2 is empty or contains a killed instruction/long immediate word (hereinafter referred to as an "unused slot"). Using the catch-up function of the present invention, stage 1 is permitted to catch-up to stage 2 on the clock edge by allowing continued processing of the stage 1 instruction until completion, at which point it is advanced into stage 2, and a new instruction is advanced into stage 1. Using this process, any empty slots or spaces between the stalled stage 3 and stage 1 are removed. Fig. 5 illustrates this concept graphically.

Referring now to Fig. 6, the method of controlling a multi-stage processor pipeline using the"catch-up"technique of the present invention is described. In a first step 602 of the method 600, the validity of the instruction at a first stage (stage 2 in the illustrated example) is determined. In the context of pipeline catch-up, a valid instruction is defined simply as one which has not been marked as invalid when it moved into it's current stage (stage 2 in the present example). If the instruction is not valid per step 602, the pipeline transition enable signal is set"true"per step 610 as discussed in greater detail below. The pipeline transition enable signal described controls the transition of an instruction word from stage 1 into stage 2. A pipeline'catch-up'would occur in this event if the instruction in stage 3 were not able to complete processing. The invalid slot in stage 2 would be replaced by an advancing instruction from stage 1, whilst the instruction at stage 3 would remain at stage 3.

If the instruction in stage 2 is valid per step 602, the ability of the valid instruction to complete processing in stage 2 is then determined in step 604. If the valid instruction can not complete processing and move out of stage 2 on the next cycle, the transition enable signal is set"false"per step 606, thereby disabling the pipeline transition. This prevents the valid, pending instruction from being replaced by the advancing instruction from the prior stage (stage 1). If the valid instruction in stage 2 is capable of completing processing, it is next determined if an interrupt pseudo instruction in stage 2 is waiting for a pending instruction fetch to complete processing in step 608. If so, then the transition

enable signal is again set"false", thereby again precluding the valid instruction in stage 2 from being replaced, since the valid (yet uncompleted) instruction will not advance to stage 3 upon the next cycle. If the valid instruction in stage 2 capable of completing on the next cycle, and is not waiting for a pending fetch, the transition enable signal is set to"true"per step 610, thereby permitting the stage 1 instruction to advance to stage 2, at the same time as the instruction in stage 2 moves into stage 3.

Hence, according to the foregoing logic, the pipeline transition enable signal is set "true"at all times when the processor is running except when: (i) a valid instruction in stage 2 cannot complete for some reason; or (ii) if an interrupt in stage 2 is waiting for a pending instruction fetch to complete. It is noted that if an invalid instruction in stage 2 is held (due to, inter alia, a stall at stage 3) then the transition enable signal will be set"true" and allow the instruction in stage 1 to move into stage 2. Hence, the invalid stage 2 instruction will be replaced by the valid stage 1 instruction.

The"catch-up"or pipeline transition enable signal (enl) of the present invention may be, in one embodiment, generated using the following exemplary code (extracted from Appendix II) hereto: ienl <='0'WHEN en ='0' OR (p2int ='l'AND ien2 ='0') OR (ip2iv ='1'AND ien2 ='0') ELSE '1'; It is also noted that the pipeline tearing and catch-up methods of the present invention may be used in conjunction with (either alone or collectively) other methods of pipeline control and interlock including, inter alia, those disclosed in Applicant's co-pending U. S. Patent Application entitled"Method And Apparatus For Jump Control In A Pipelined Processor," as well as those disclosed in Applicant's co-pending U. S. Patent Application"Method And Apparatus For Jump Delay Slot Control In A Pipelined Processor,"both filed contemporaneously herewith, both being incorporated by reference herein in their entirety.

Furthermore, various register encoding schemes, such as the"loose"register encoding described in Applicant's co-pending U. S. Patent Application entitled"Method and Apparatus for Loose Register Encoding Within a Pipelined Processor"filed contemporaneously herewith and incorporated by reference in its entirety herein, may be used in conjunction with the pipeline tearing and/or catch-up inventions described herein.

Method of Synthesizing Referring now to Fig. 7, the method 700 of synthesizing logic incorporating the pipeline tearing and/or catch-up functionality previously discussed is described. The generalized method of synthesizing integrated circuit logic having a user-customized (i. e., "soft") instruction set is disclosed in Applicant's co-pending U. S. Patent Application Serial No. 09/418,663 entitled"Method And Apparatus For Managing The Configuration And Functionality Of A Semiconductor Design"filed October 14,1999, which is incorporated herein by reference in its entirety.

While the following description is presented in terms of an algorithm or computer program running on a microcomputer or other similar processing device, it can be appreciated that other hardware environments (including minicomputers, workstations, networked computers,"supercomputers", and mainframes) may be used to practice the method. Additionally, one or more portions of the computer program may be embodied in hardware or firmware as opposed to software if desired, such alternate embodiments being well within the skill of the computer artisan.

Initially, user input is obtained regarding the design configuration in the first step 702. Specifically, desired modules or functions for the design are selected by the user, and instructions relating to the design are added, subtracted, or generated as necessary. For example, in signal processing applications, it is often advantageous for CPUs to include a single"multiply and accumulate" (MAC) instruction. In the present invention, the instruction set of the synthesized design is modified so as to incorporate the foregoing pipeline tearing and/or catch-up modes (or another comparable pipeline control architecture) therein. The technology library location for each VHDL file is also defined by the user in step 702. The technology library files in the present invention store all of the information related to cells necessary for the synthesis process, including for example logical function, input/output timing, and any associated constraints. In the present invention, each user can define his/her own library name and location (s), thereby adding further flexibility.

Next, in step 703, the user creates customized HDL functional blocks based on the user's input and the existing library of functions specified in step 702.

In step 704, the design hierarchy is determined based on the user's input and the aforementioned library files. A hierarchy file, new library file, and makefile are subsequently generated based on the design hierarchy. The term"makefile"as used herein

refers to the commonly used UNIX makefile function or similar function of a computer system well known to those of skill in the computer programming arts. The makefile function causes other programs or algorithms resident in the computer system to be executed in the specified order. In addition, it further specifies the names or locations of data files and other information necessary to the successful operation of the specified programs. It is noted, however, that the invention disclosed herein may utilize file structures other than the"makefile"type to produce the desired functionality.

In one embodiment of the makefile generation process of the present invention, the user is interactively asked via display prompts to input information relating to the desired design such as the type of"build" (e. g., overall device or system configuration), width of the external memory system data bus, different types of extensions, cache type/size, etc.

Many other configurations and sources of input information may be used, however, consistent with the invention.

In step 706, the user runs the makefile generated in step 704 to create the structural HDL. This structural HDL ties the discrete functional block in the design together so as to make a complete design.

Next, in step 708, the script generated in step 706 is run to create a makefile for the simulator. The user also runs the script to generate a synthesis script in step 708.

At this point in the program, a decision is made whether to synthesize or simulate the design (step 710). If simulation is chosen, the user runs the simulation using the generated design and simulation makefile (and user program) in step 712. Alternatively, if synthesis is chosen, the user runs the synthesis using the synthesis script (s) and generated design in step After completion of the synthesis/simulation scripts, the adequacy of the design is evaluated in step 716. For example, a synthesis engine may create a specific physical layout of the design that meets the performance criteria of the overall design process yet does not meet the die size requirements. In this case, the designer will make changes to the control files, libraries, or other elements that can affect the die size. The resulting set of design information is then used to re-run the synthesis script.

If the generated design is acceptable, the design process is completed. If the design is not acceptable, the process steps beginning with step 702 are re-performed until an acceptable design is achieved. In this fashion, the method 700 is iterative.

Referring now to Figs. 8a-8b, one embodiment of exemplary gate logic (including the"p3iv"signal referenced in the VHDL of Appendix I) synthesized using the aforementioned Synopsys Design Compiler and methodology of Fig. 7 is illustrated.

Note that during the synthesis process used to generate the logic of Fig. 8a, an LSI 10k 1. Oum process was specified, and no constraints were placed on the design. With respect to the logic of Fig. 8b, the same process was used; however, the design was constrained on the path from len3 to the clock. Appendix III contains the coding used to generate the exemplary logic of FIGURES. 8a-8b.

Referring to Figs. 8c-8d, one embodiment of exemplary gate logic (including the "ienl"signal referenced in the VHDL of Appendix II) synthesized using methodology of Fig. 7 is illustrated. Note that during the synthesis process used to generate the logic of Fig. 8c, an LSI 10k l. 0um process was specified, and no constraints were placed on the design. With respect to the logic of Fig. 8d, the same process was used; however, the design was constrained to preclude the use of AND-OR gates. Appendix IV contains the coding used to generate the exemplary logic of FIGURES. 8c-8d.

Fig. 9 illustrates an exemplary pipelined processor fabricated using a 1.0 um process and incorporating the pipeline tearing and catch-up modes previously described herein. As shown in Fig. 9, the processor 900 is an ARC microprocessor-like CPU device having, inter alia, a processor core 902, on-chip memory 904, and an external interface 906.

The device is fabricated using the customized VHDL design obtained using the method 900 of the present invention, which is subsequently synthesized into a logic level representation, and then reduced to a physical device using compilation, layout and fabrication techniques well known in the semiconductor arts.

It will be appreciated by one skilled in the art that the processor of Figure 9 may contain any commonly available peripheral such as serial communications devices, parallel ports, timers, counters, high current drivers, analog to digital (A/D) converters, digital to analog converters (D/A), interrupt processors, LCD drivers, memories and other similar devices. Further, the processor may also include custom or application specific circuitry.

The present invention is not limited to the type, number or complexity of peripherals and other circuitry that may be combined using the method and apparatus. Rather, any limitations are imposed by the physical capacity of the extant semiconductor processes which improve over time. Therefore it is anticipated that the complexity and degree of integration possible employing the present invention will further increase as semiconductor processes improve.

It is also noted that many IC designs currently use a microprocessor core and a DSP core. The DSP however, might only be required for a limited number of DSP functions (such as for finite impulse response analysis or speech encoding), or for the IC's fast DMA

architecture. The invention disclosed herein can support many DSP instruction functions, and its fast local RAM system gives immediate access to data. Appreciable cost savings may be realized by using the methods disclosed herein for both the CPU & DSP functions of the IC.

Additionally, it will be noted that the methodology (and associated computer program) as previously described herein can readily be adapted to newer manufacturing technologies, such as 0.18 or 0.1 micron processes, with a comparatively simple re- synthesis instead of the lengthy and expensive process typically required to adapt such technologies using"hard"macro prior art systems.

Referring now to Fig. 10, one embodiment of a computing device capable of synthesizing the logic associated with the tearing/catch-up signals disclosed herein is described. The computing device 1000 comprises a motherboard 1001 having a central processing unit (CPU) 1002, random access memory (RAM) 1004, and memory controller 1005. A storage device 1006 (such as a hard disk drive or CD-ROM), input device 1007 (such as a keyboard or mouse), and display device 1008 (such as a CRT, plasma, or TFT display), as well as buses necessary to support the operation of the host and peripheral components, are also provided. The aforementioned VHDL descriptions and synthesis engine are stored in the form of an object code representation of a computer program in the RAM 1004 and/or storage device 1006 for use by the CPU 1002 during design synthesis, the latter being well known in the computing arts. The user (not shown) synthesizes logic designs by inputting design configuration specifications into the synthesis program via the program displays and the input device 1007 during system operation. Synthesized designs generated by the program are stored in the storage device 1006 for later retrieval, displayed on the graphic display device 1008, or output to an external device such as a printer, data storage unit, other peripheral component via a serial or parallel port 1012 if desired.

While the above detailed description has shown, described, and pointed out novel features of the invention as applied to various embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the device or process illustrated may be made by those skilled in the art without departing from the invention. The foregoing description is of the best mode presently contemplated of carrying out the invention. This description is in no way meant to be limiting, but rather should be taken as illustrative of the general principles of the invention. The scope of the invention should be determined with reference to the claims.

APPENDIX I-VHDL USED TO CREATE SYNTHESISED LOGIC FOR PIPELINE TEARING libraryieee; use ieee. std logic-1164. all ; entity v007a is port (ck in stduiogic; clr in std_ulogic ; ien2 in std_ulogic ; ien3 in std_ulogic ; ip2iv in std-ulogic; p3iv out stdulogic); endv007a; architecture synthesis of v007a is signal n_p3iv std_ulogic ; signal ip3iv std_ulogic ; begin n_p3iv <= ip3iv WHEN ien3 ='0'ELSE '0'WHEN ien2 ='0' AND ien3 ='1'ELSE ip2iv; p3ivreg : PROCESS (ck, clr) <BR> <BR> <BR> <BR> BEGIN<BR> <BR> <BR> <BR> <BR> <BR> <BR> IF clr ='I'THEN ip3iv<='0' ; ELSIF (ck'EVENT AND ck ='1') THEN ip3iv <= n_p3iv ; ENDIF; ENDPROCESS; p3iv <= ip3iv; end synthesis;

APPENDIX II-VHDL USED TO CREATE SYNTHESISED LOGIC FOR PIPELINE CATCH-UP library ieee; use ieee.std_logic_1164.all; entity v007b is port (en in std_ulogic ; p2int. in std_ulogic ; ien2 in stduiogic; ip2iv. in std_ulogic ; ienl out stdulogic); endv007b; architecture synthesis of v007b is begin ienl <='0'WHEN en ='0' OR (p2int ='1'AND ien2 ='O') OR (ip2iv ='1' AND ien2 ='O') ELSE '1'; endsynthesis;

APPENDIX III-SYNTHESIS SCRIPT USED TO CREATE EXEMPLARY SCHEMATICS FOR TEARING LOGIC /* Analyze VHDL */ analyze-library user-format vhdl vhdl/v007a. vhdl /* Unconstrained logic */ elaborate-library user v007a compile write-format db-hierarchy-output db/v007auc. db createschematic-schematicview plot-output v007auc. ps removedesign-all /* Constrained logic */ elaborate-library user v007a create clock-name"ck"-period 10-waveform {0 5} ck set input delay-clock ck 8 ien3 compile write-format db-hierarchy-output db/v007ac. db createschematic-schematicview plot-output v007ac. ps

APPENDIX IV-SYNTHESIS SCRIPT USED TO CREATE EXEMPLARY SCHEMATICS FOR CATCHUP LOGIC /* Analyze VHDL */ analyze-library user-format vhdl vhdl/v007b. vhdl /* Unconstrained logic */ elaborate-library user v007b compile write-format db-hierarchy-output db/v007buc. db createschematic-schematicview plot-output v007buc. ps <BR> <BR> <BR> removedesign-all<BR> <BR> <BR> <BR> <BR> <BR> <BR> /* Constrained logic */ elaborate-library user v007b setmaxarea 0 set dont use find (cell, lsi_lOk/AO*) compile-map_effort high write-format db-hierarchy-output db/v007bc. db createschematic-schematicview plot-output v007bc. ps