Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND SYSTEM FOR EMULATING A FLOATING-POINT UNIT
Document Type and Number:
WIPO Patent Application WO/2023/035053
Kind Code:
A1
Abstract:
Systems and methods for emulating a floating-point unit are disclosed. The method receives one or more floating-point operands having a first floating-point format. Each of the one or more floating-point operands having the first floating-point format is converted into a first set of integers having the first floating-point format. Further, each of the first set of integers is converted into a second set of integers having a second floating-point format that is different from the first floating-point format. The first set of integers and the second set of integers each has a defined bit length depending on the respective floating-point format. Lastly, the method performs computations for a task using each of the second set of integers to emulate computations performed by the floating-point unit using the one or more floating-point operands having the second floating-point format.

Inventors:
GHAFFARI SEYED ALIREZA (CA)
WU WEI HSIANG (CN)
PARTOVI NIA VAHID (CA)
Application Number:
PCT/CA2021/051241
Publication Date:
March 16, 2023
Filing Date:
September 08, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
GHAFFARI SEYED ALIREZA (CA)
WU WEI HSIANG (CN)
PARTOVI NIA VAHID (CA)
HUAWEI TECH CO LTD (CN)
International Classes:
G06F9/455; G06F7/483; G06F9/302; G06N3/08
Foreign References:
US20190340499A12019-11-07
US10574260B22020-02-25
US11043962B22021-06-22
CN110555508A2019-12-10
US20210109709A12021-04-15
US20210208881A12021-07-08
JP2006318382A2006-11-24
Other References:
BENOIT JACOB ET AL: "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference", ARXIV, 15 December 2017 (2017-12-15), pages 1 - 14, XP002798211, Retrieved from the Internet [retrieved on 20200310]
Attorney, Agent or Firm:
RIDOUT & MAYBEE LLP et al. (CA)
Download PDF:
Claims:
CLAIMS

1- A computer-implemented method for emulating a floating-point unit, comprising: receiving one or more floating-point operands having a first floating-point format; converting each of the one or more floating-point operands having the first floating-point format into a first set of integers having the first floating-point format; converting each of the first set of integers into a second set of integers having a second floating-point format that is different from the first floating-point format, wherein the first set of integers and the second set of integers each has a defined bit length depending on respective floating-point format; and performing computations for a task using each of the second set of integers to emulate computations performed by the floating-point unit using the one or more floating-point operands having the second floating-point format.

2- The method of claim 1 , wherein a total bit length defined by the first floating-point format and the second floating-point format is different.

3- The method of any of claims 1 or 2, wherein the task is for training a deep learning model.

4- The method of claim 3, further comprising: repeating the converting of each of the one or more floating-point operands into the first set of integers, converting of each of the first set of integers into the second set of integers, and performing computations for a plurality of deep learning sessions of training, wherein for each session of training a different second floating-point format is used;

33 selecting one of the different second floating-point formats as a final second floating-point format; and emulating a further floating-point unit for processing operands that are formatted according to the final second floating-point format. - The method of claim 4, further comprising evaluating numerical stability of the training of the deep learning model using the one or more floating-point operands having the different second floating-point formats. - The method of any of claims 1 to 5, wherein computations are performed in parallel. - The method of any of claims 1 to 6, further comprising fabricating into a hardware component the emulated floating-point unit for performing operations using the second floating-point format. - The method of any of claims 1 to 7, wherein converting each of first set of integers into the respective second set of integers comprises a rounding operation using one of round truncate, round to odd, round to even, round toward zero, round away from zero, round toward infinity, and stochastic rounding. - The method of any of claims 1 to 8, wherein the first floating-point format is based on one of the formats described in the IEEE754 standard. 0-The method of any of claims 1 to 9, wherein each of the second set of integers comprises: a first integer value representing a sign value of the respective floatingpoint operand;

34 a second integer value representing an exponent value of the respective floating-point operand; and a third integer value representing a fraction value of the respective floating-point operand. -A system for emulating a floating-point unit comprising: a processor; and a memory storing instructions which, when executed by the processor, cause the system to: receive one or more floating-point operands having a first floatingpoint format; convert each of the one or more floating-point operands having the first floating-point format into a first set of integers having the first floatingpoint format; convert each of the first set of integers into a second set of integers having a second floating-point format that is different from the first floatingpoint format, wherein the first set of integers and the second set of integers each has a defined bit length depending on respective floatingpoint format; and perform computations for a task using each of the second set of integers to emulate computations performed by the floating-point unit using the one or more floating-point operands having the second floatingpoint format. -The system of claim 11 , wherein a total bit length defined by the first floatingpoint format and the second floating-point format is different. -The system of any of claims 11 or 12, wherein the task is for training a deep learning model. -The system of claim 13, further comprising instructions which, when executed by the processor, cause the system to: repeat the converting of each of the one or more floating-point operands into the first set of integers, converting of each of the first set of integers into the second set of integers, and performing computations for a plurality of deep learning sessions of training, wherein for each session of training a different second floating-point format is used; select one of the different second floating-point formats as a final second floating-point format; and emulate a further floating-point unit for processing operands that are formatted according to the final second floating-point format. -The system of claim 14, further comprising instructions which, when executed by the processor, cause the system to evaluate numerical stability of the training of the deep learning model using the one or more floating-point operands having the different second floating-point formats. -The system of any of claims 11 to 15, further comprising fabricating into a hardware component the emulated floating-point unit for performing operations using the second floating-point format. -The system of any of claims 11 to 16, wherein converting each of the first set of integers into the respective second set of integers comprises a rounding operation using one of round truncate, round to odd, round to even, round toward zero, round away from zero, round toward infinity, and stochastic rounding. -The system of any of claims 11 to 17, wherein the first floating-point format is based on one of the formats described in the IEEE754 standard. -The system of any of claims 11 to 18, wherein each of the second set of integers comprises: a first integer value representing a sign value of the respective floatingpoint operand; a second integer value representing an exponent value of the respective floating-point operand; and a third integer value representing a fraction value of the respective floating-point operand. - A non-transitory machine readable medium having tangibly stored thereon executable instructions for execution by a processor, wherein the executable instructions, when executed by the processor, cause the processor to perform the method of any one of claims 1 to 10.

37

Description:
METHOD AND SYSTEM FOR EMULATING A FLOATING-POINT UNIT

Technical Field

[0001] The present disclosure relates to emulating hardware in software, in particular methods and systems for emulating a floating-point unit.

Background

[0002] Personal computers, cars, mobile devices, loT devices, and electrical appliances for processing data are a few devices that use processing units. Processing units perform a multitude of computations on operands for the task they are used for. Central Processing Units (CPUs) and Microcontroller Units (MCUs) are examples of processing units. CPUs are the backbone of the modem data centers, which provide essential services such as the Internet, cloud services, and mobile networks. MCUs are critical for processing data on the Internet of things (loT) devices and smart sensors.

Other types of processing units exist that provide specific functionalities to the CPU. For instance, Graphical Processing Units (GPUs) are widely used to provide extra computing power to CPUs. Other notable processing units are Digital Signal Processors (DSPs), which are used in the telecommunication industry.

[0003] Designing efficient processing units that perform efficient computations is essential for today's industrial and research tasks. However, processing units are usually generic and used for various tasks, including machine learning algorithms. Also, the processing units may have high processing power, which may be correlated with energy consumption.

[0004] Processing units include several arithmetic units responsible for performing computations on operands, such as ALUs and FPUs. However, some industrial and research tasks, including the ones that use machine learning algorithms, may not need the full processing power of these processing units. In particular, these industrial and research tasks may not require the full computing power of the arithmetic units. As such, customized arithmetic units may be more efficient for performing operations than using the full power of the arithmetic units. Suppose a machine learning task is considered as an example. In that case, different machine learning algorithms for different tasks (i.e. , face detection, face recognition, image segmentation, etc.) may require different customized arithmetic units.

[0005] Therefore, verifying the performance of customized arithmetic units before fabricating them in hardware (i.e., a processing unit) is advantageous and can be essential to hardware designers. For instance, hardware designers can study the customized arithmetic behavior and properties before fabrication.

[0006] As data processing becomes more ubiquitous, there is a need for more energy-efficient processing units, which can be useful for battery-powered data centers and mobile devices such as cell phones, laptops, and tablets, where the energy efficiency of the processor can directly affect the battery life.

[0007] Thus, there are needs for systems and methods for emulating arithmetic units, such as floating-point units, in software.

Summary

[0008] In various examples, the present disclosure describes methods and systems for emulating a floating-point unit that operates on data having a custom floating-point format that departs from the generally available floating-point formats described in the IEEE754 standard. Upon successful experimentation of the emulated floating-point unit for a task, the emulated floating-point unit may be fabricated in a processing unit to operate on data having the custom floating-point format.

[0009] A floating-point unit is emulated using an emulation engine comprising a software library, a control unit, and at least one computation module. The computation module includes at least one computation unit configured for performing computations using data having the custom floating-point format. Further, the control unit controls the sequence of computations needed to be performed by the computation unit.

Additionally, the software library implements processes of a high-level language that performs various computations, such as training a deep neural network. Thus, the software library interfaces with the control unit, which controls the computation units that perform emulated floating-point computations for the tasks.

[0010] An example embodiment is a computer-implemented method for emulating a floating-point unit. The method may receive one or more floating-point operands having a first floating-point format. The method may convert each of the one or more floating-point operands having the first floating-point format into a first set of integers having the first floating-point format. Also, the method may convert each of the first set of integers into a second set of integers having a second floating-point format that is different from the first floating-point format. The first set of integers and the second set of integers each has a defined bit length depending on respective floatingpoint format. Additionally, the method may perform computations for a task using each of the second set of integers to emulate computations performed by the floating-point unit using the one or more floating-point operands having the second floating-point format.

[0011] In another example embodiment of the method, a total bit length defined by the first floating-point format and the second floating-point format is different. In another example embodiment of the method, the task is for training a deep learning model.

[0012] In another example embodiment, the method further comprises repeating the converting of each of the one or more floating-point operands into the first set of integers, converting of each of the first set of integers into the second set of integers, and performing computations for a plurality of deep learning sessions of training. For each session of training a different second floating-point format may be used. Also, the method may select one of the different second floating-point formats as a final second floating-point format and may emulate a further floating-point unit for processing operands that are formatted according to the final second floating-point format. [0013] In another example embodiment, the method further comprises evaluating numerical stability of the training of the deep learning model using the one or more floating-point operands having the different second floating-point formats. In another example embodiment of the method, computations may be performed in parallel.

[0014] In another example embodiment, the method further comprises fabricating into a hardware component the emulated floating-point unit for performing operations using the second floating-point format.

[0015] In another example embodiment, the method may convert each of the first set of integers into the respective second set of integer using a rounding operation of one of round truncate, round to odd, round to even, round toward zero, round away from zero, round toward infinity, and stochastic rounding.

[0016] In another example embodiments, the first floating-point format may be based on one of the formats described in the IEEE754 standard. In another example embodiment, each of the second set of integers may comprise a first integer value representing a sign value of the respective floating-point operand, a second integer value representing an exponent value of the respective floating-point operand, and a third integer value representing a fraction value of the respective floating-point operand.

[0017] Another example embodiment is of a system for emulating a floating-point unit. The system comprises a processor, and a memory storing instructions which, when executed by the processor, cause the system to receive one or more floatingpoint operands having a first floating-point format. Further, the instructions may cause the system to convert each of the one or more floating-point operands having the first floating-point format into a first set of integers having the first floating-point format. The instructions may also cause the system to convert each of the first set of integers into a second set of integers having a second floating-point format that is different from the first floating-point format. The first set of integers and the second set of integers each has a defined bit length depending on the respective floating-point format. The instructions may also cause the system to perform computations for a task using each of the second set of integers to emulate computations performed by the floating-point unit using the one or more floating-point operands having the second floating-point format.

[0018] In another example embodiment, a total bit length defined by the first floating-point format and the second floating-point format is different. In another example embodiment of the system, the task is for training a deep learning model.

[0019] In another example embodiment, the system may comprise instructions which, when executed by the processor, cause the system to repeat the converting of each of the one or more floating-point operands into the first set of integers, converting of each of the first set of integers into the second set of integers, and performing computations for a plurality of deep learning sessions of training. For each session of training, a different second floating-point format is used. Also, the instructions may cause the system to select one of the different second floating-point formats as a final second floating-point format, and emulate a further floating-point unit for processing operands that are formatted according to the final second floating-point format.

[0020] In another example embodiment, the system may comprise instructions which, when executed by the processor, cause the system to evaluate numerical stability of the training of the deep learning model using the one or more floating-point operands having the different second floating-point formats.

[0021] In another example embodiment, the emulated floating-point unit for performing operations using the second floating-point format may be fabricated into a hardware component.

[0022] In another example embodiment, the system may convert each of the first set of integers into the respective second set of integers using a rounding operation of one of round truncate, round to odd, round to even, round toward zero, round away from zero, round toward infinity, and stochastic rounding. [0023] In another example embodiment, the first floating-point format may be based on one of the formats described in the IEEE754 standard.

[0024] In another example embodiment, each of the second set of integers comprises a first integer value representing a sign value of the respective floating-point operand, a second integer value representing an exponent value of the respective floating-point operand, and a third integer value representing a fraction value of the respective each floating-point operand.

[0025] Another example embodiment is a non-transitory machine readable medium having tangibly stored thereon executable instructions for execution by a processor, wherein the executable instructions, when executed by the processor, cause the processor to perform any one of the method embodiments above.

[0026] Brief Description of the Drawings

[0027] FIG. 1 is an illustrative example of one structure described in the IEEE754 standard commonly used in computing devices, specifically with floating-point processing units, to represent floating-point numbers, in accordance with an example embodiment.

[0028] FIG. 2 is a block diagram illustrating an example computing device that can be employed to implement the methods and systems disclosed herein in accordance with an example embodiment.

[0029] FIG. 3 is a block diagram for an emulation engine illustrating its modules in accordance with an example embodiment.

[0030] FIG. 4 is a schematic of an example emulation engine illustrating operation and data flow in the emulation engine in accordance with an example embodiment. [0031] FIG. 5 is an example algorithm illustrating an addition computation between two integer value sets of the integer value representation unpacked from floating values using a single computation unit.

[0032] FIG. 6 is a flowchart of an example deep neural network training method using the emulation engine in accordance with an example embodiment.

[0033] FIG. 7 is a flowchart of an example method for emulating a floating-point unit in accordance with an example embodiment.

[0034] Similar reference numerals may have been used in different figures to denote similar components.

Description of Example Embodiments

[0035] A floating-point unit (FPU), which is a hardware component in a processing unit, performs arithmetic computations on floating-point operands. The computations performed by a FPU on floating-point operands are traditionally based on one of the floating-point formats of the IEEE754 standard, described in IEEE Computer Society. "IEEE Standard for Floating-Point Arithmetic. " IEEE Std 754-2008 (2008): 1-70. The floating-point formats of the IEEE754 standard usually represent a floating-point number with high-precision using, for example, 32 bits. However, many tasks, including tasks for machine learning, which perform computations on floating-point operands, do not need high precision. Instead, a custom FPU designed to operate on operands having a custom floating-point format, which may result in less precision than data containing floating-point numbers having a floating-point format of the IEEE754 standard, may be necessary.

[0036] The following terms are used in the disclosure. The term "FPU" is a hardware component of a processing unit, while the "emulated FPU" is a software version of the FPU.

[0037] The present disclosure describes methods and systems for emulating an FPU in software. The emulated FPU performs computations on data having a custom floating-point format that deviates from the floating-point formats of the IEEE754 standard. After successfully evaluating the emulated FPU when performing computation of a task, the emulated FPU is fabricated into a processing unit as a hardware component. In example embodiments, the FPU fabricated from the emulated FPU may replace the FPU that operates on a floating-point format of the IEEE754 standard.

[0038] The accuracy of some of the floating-point formats in the IEEE754 standard may not be needed. For example, to perform inference of a deep learning model, data represented using values in 8-bit integer format are accurate enough for most applications as described in Naigang Wang, Jungwook Choi, Daniel Brand, Chia- Yu Chen, and Kailash Gopalakrishnan. Training deep neural networks with 8-bit floating point numbers. In Advances in neural information processing systems, pages 7675- 7684, 2018.

[0039] Some well-known deep learning frameworks, such as Pytorch, are used to train deep learning models, Paszke, Adam, et al. "Pytorch: An imperative style, high- performance deep learning library. " arXiv preprint arXiv:1912.01703 (2019) is a deep learning framework that supports data represented using values in a 16-bit floating-point format as described in Bfloat discussed in Kalamkar, Dhiraj, et al. "A study of BFLOAT1 6 for deep learning training. " arXiv preprint arXiv: 1905. 12322 (2019).

[0040] Furthermore, IBM has also proposed using data represented by an 8-bit floating-point format to train neural networks described in Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training deep neural networks with 8-bit floating-point numbers. In Advances in neural information processing systems, pages 7675-7684, 2018, and described in Sun, Xiao, et al. "Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. " Advances in neural information processing systems 32 (2019): 4900-4909.

[0041] A low precision training (i.e. , using 8-bit formats instead of the formats described in the IEEE754 standard) of deep learning models was proposed in prior works to address the energy efficiency problem of deep learning models. More specifically, when a floating-point format representing a floating-point operand becomes smaller (e.g., a 32-bit single-precision number may be reduced to a floating-point format of 16-bits or 8-bits), the amount of data read from and written to memory by the processing unit decreases. Thus, reducing the number of operations saves energy and generates a more efficient deep learning model. As a result, custom FPUs can be designed and fabricated to consume less energy when performing computations on data having a custom floating-point format.

[0042] Fig. 1 is an illustrative example of a format of the IEEE754 standard for floating-point format commonly used to represent floating-point numbers in computing devices having processing unis that include a FPU. This example embodiment illustrates a single-precision floating-point format of the IEEE754 standard. This example is one of the structures described in the IEEE754 standard. The singleprecision floating-point format representation includes 1 bit reserved for the sign value of a floating-point number 102, 8 bits reserved for the exponent value of the floatingpoint number 106, and 23 bits reserved for the fraction value of the floating-point number. The fraction value is also referred to as mantissa value 108. The terms fraction value and mantissa value may be used interchangeably throughout this disclosure.

[0043] Furthermore, there is an exponent bias, which is a number considered in the arithmetic calculations by the processor. As a result, a floating-point number having the floating-point format of FIG. 1 comprises three binary strings: fraction, exponent, and sign. It is apparent to a person skilled in the art that a string of binary numbers can be represented as a decimal number. Therefore, each binary string (104, 106, and 108) can be represented as a decimal number. Hence, a floating-point number can be represented by three integer values.

[0044] For example, the floating-point number 1 .984 can be represented by a set of integers comprising: a sign value representing a positive number (perhaps +1 ), exponent value of -3, and fraction value of 1984.

[0045] Any deviation from the described structures (formats) specified in the IEEE754 standard (i.e. , bits reserved for a sign, exponent, and fraction) is considered a custom floating-point format, and for such, a custom FPU may be used. This custom FPU is designed to perform arithmetic based on the custom floating-point format. The custom floating-point format may change the precision and range of the floating value since it is represented differently. Thus, this disclosure provides a computer-aided design (CAD) tool to emulate an FPU that performs computations on a custom floatingpoint format before designing such FPU as hardware.

[0046] FIG. 2 is a block diagram illustrating a computing device 200 in which a FPU may be emulated. The computing device 200 may be an individual physical computer, multiple physical computers such as s server, a virtual machine, or multiple virtual machines. Dashed blocks represent optional components. The computing device 200 is configured for FPU emulation. Other computing devices suitable for implementing examples described in the present disclosure may be used, which may include components different from those discussed below. Although FIG. 2 shows a single instance of each component, there may be multiple instances of each component in the computing device 200. Also, the computing device 200 could be implemented using parallel and/or distributed architecture.

[0047] In this example, the computing device 200 includes one or more processing units 202, such as a CPU, GPU, an MCU, an ASIC, a field-programmable gate array (FPGA), a dedicated logic circuitry, or combinations thereof. Each of the aforementioned processing units may include various hardware components, whether fabricated on-chip or separate. For instance, the CPU may include one or more accumulators, registrars, multipliers, decoders, floating-point unit 218, and arithmetic and logic unit. While the arithmetic and logic unit performs bitwise operations on integer binary numbers, the floating-unit 218, described further below, operates on floatingpoint numbers. It is to be understood that other processing units, GPU, may include similar components.

[0048] A processing unit may include a floating-point unit (FPU) 218 for performing arithmetic computations. The FPU may be fabricated on the same chip as the computing unit or a separate unit within the computing device 200. The FPU 218 is usually a hardware component enabling fast computations. The FPU 218 performs primitive computations such as addition, subtraction, multiplication, division, square root, etc. With instructions from the processing unit 202, complex operations may be performed by combining the primitive computations, including training deep learning algorithms. The FPU 218 is usually designed to perform computations based on a specific floating-point format, most commonly a format of the IEEE754 standard described in FIG. 1 .

[0049] The computing device 200 may also include one or more optional input/output (I/O) interfaces 204, enabling interfacing with one or more optional input devices 212 and/or output devices 214. The computing device 200 may include one or more network interfaces 206 for wired or wireless communication with a network (e.g., an intranet, the Internet, a P2P network, a WAN and/or a LAN). The network interface(s) 206 may include wired links (e.g., Ethernet cable) and/or wireless links (e.g., one or more antennas) for intra-network and/or inter-network communications for receiving parameters or sending results.

[0050] The computing device 200 includes one or more storage units 208, which may include a mass storage unit such as a solid-state drive, a hard disk drive, a magnetic disk drive and/or an optical disk drive. The computing device 200 also includes one or more memories 210, which may have a volatile or non-volatile memory (e.g., a flash memory, a random access memory (RAM), and/or a read-only memory (ROM)). The memory(ies) 210 (as well as storage unit 208) may store instructions for execution by the processing unit(s) 202. The memory(ies) 210 may include software instructions for implementing an operating system (OS) and other applications/functions. In some examples, instructions may also be provided by an external memory (e.g., an external drive in communication with the computing device 200) or may be provided by a transitory or non-transitory computer-readable medium. Examples of non-transitory computer-readable media include a RAM, a ROM, an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), a flash memory, a CD-ROM, or other portable memory storage.

[0051] The computing device 200 also includes a module for emulating an FPU referred to as an emulation engine 216. A "module" can refer to a combination of a hardware processing circuit and machine-readable instructions (software and/or firmware) executable on the hardware processing circuit. A hardware processing circuit can include any or some combination of a microprocessor, a core of a multi-core microprocessor, a microcontroller, a programmable integrated circuit, a programmable gate array, a digital signal processor, an application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), GPU (Graphical Processing Unit), or a system on a chip (SoC) or another hardware processing circuit. The computing device 200 shows the emulation engine 216 as instructions in memory 202 when executed by the processing unit 202 causes the processing unit 202 to perform arithmetic computations otherwise performed by FPU 218. Other example embodiments may have the emulation engine 216 as a hardware component connected with bus 220 that facilitates communication between various computing device 200 components. The emulation engine 216 may be implemented in components of the computing device 200 or may be offered as a software as a service (SaaS) by a cloud computing provider. The emulation engine 216 may also be available on servers accessed by the computing device 200 through the network interface 206.

[0052] Example embodiments describe the emulation engine 216 as being parametrized, customizable, heterogeneous and/or parallelized. The emulation engine 216 is parameterized because it can receive floating-point operands as a set of integers. Also, the emulation engine 216 may be controlled by users; such users can choose different rounding operations and enter custom floating-point formats (explained below). Further, the emulation engine 216 may be customizable such that it can be modified for emulating different custom floating-point formats. The emulation engine 216 can be implemented with CPUs, GPUs, other processing units 202 explained above, or a combination of thereof; therefore, it is heterogeneous. Lastly, computations performed in the emulation engine 216 may be parallelized, allowing for parallel computations.

[0053] Optional input device(s) 212 (e.g., a keyboard, a mouse, a microphone, a touchscreen, and/or a keypad) and optional output device(s) 314 (e.g., a display, a speaker and/or a printer) are shown as external to the computing device 200 and connected to optional I/O interface 204. In other examples, one or more of the input device(s) 212 and/or the output device(s) 214 may be included as a component of the computing device 200.

[0054] FIG. 3 is a block diagram of modules of the emulation engine 216. The disclosure departs from traditional computing devices where hardware, specifically FPU 218, performs floating-point computations. Instead, the computations that are supposed to be performed by the FPU 218 are performed by the emulation engine 216. The emulation engine 216 may contain three modules - software library 302, control unit 304, and one or more computation modules (306-1 , 306-2, ..., 306-N), each computation module performing computations on floating-point operands having custom floating-point format.

[0055] It is to be understood that FIG. 3 describes an embodiment of emulation engine 216 emulating operations of multiple FPUs 218. Each FPU 218 can be emulated using a computation module (306-1 , 306-2, ..., or 306-N). However, example embodiments may use only one computation module for a single FPU 218, referred to simply by 306. Therefore, a computation module 306 could be any of the computation modules 306-1 , 306-2, ... , 306-N. Each computation module 306 comprises a plurality of computation units 308, where each computation unit 308 is responsible for performing arithmetic computations. Each computation unit 308 is configured and can perform primitive computations such as addition, subtraction, multiplication, square root, absolute, etc., but the combination of such primitive computations can compute complex operations. The sequence of the combination performed by the computation units 308 to compute a more complex operation, i.e. , inner product, is controlled by the control unit 304.

[0056] The control unit 304 is a module that sends instructions to computation modules (306-1 , 306-2, ..., 306-N) to schedule and instruct the computation units 308 to perform various computations of a broad spectrum. The computations may be as simple as computing the inner-product of vectors or much more complicated computations such as deep learning training algorithms. Therefore, the control unit 304 ensures the sequential consistency of the computations. For example, suppose the task is to compute the inner product between two vectors. In that case, the control unit 304 sends instructions to the computation modules (306-1 , 306-2,..., 306-N) to use one or more computation units 308 to perform the inner product computation. The control unit 304 also sends the sequence of the computations, which includes first, multiplication, then addition. The addition and multiplication are computations understood by computation units 308 of each computation module (306-1 , 306-2, ..., or 306-N).

[0057] Although the control unit 304 ensures sequential consistency of computations, for high-level computations, such as training a deep neural network, a software library 302 is used. The software library 302 administers the control unit 304. The software library 302 may be an application programming interface (API) of a high- level language configured to send instructions and control the control unit 304. The software library 302 may be an API that modifies computationally expensive software packages, such as PyTorch™, TensorFlow™ and scikit-learn™ , and other machine learning libraries, to use the control unit 304 and the computation modules (306-1 , 306- 2, ..., and 306-N) to perform the floating-point computations instead of using the FPU 218.

[0058] For example, if the task is to train a deep neural network, a sequence of complex operations is performed for forward propagation and backpropagation (discussed in detail below). These operations are usually implemented in the software library 302, which sends instructions to control unit 304 containing the steps that need to be performed, e.g., a sequence of inner products. Then, the control unit 304 assigns the number of computation units 308 to participate in the operations' computations required by the software library 302. Further, the control unit 304 sends the sequence of primitive computations needed to be performed by each computation unit 308.

[0059] Therefore, the emulation engine 216 consists of a hierarchy of controlling modules, starting as high-level operations in the software library 302, which are then interpreted by the control unit 304 into primitive computations to be performed by the computation units 308. [0060] Example embodiments may include more than one computation module (306-1 , 306-2, ... , 306-N) for a task, and each computation module (306-1 , ... , 306-N) has computation units 308 configured for a custom floating-point format. The task may be performed using all computation modules (306-1 , 306-2, ... , 306-N). Further, the custom floating-point format that achieves a desired performance, according to a performance measure, is selected. The selected custom floating-point format is used in designing and fabricating an FPU 218 that is based on processing values formatted according to the selected custom floating-point format.

[0061] FIG. 4 is a schematic diagram of an emulation engine 216 illustrating operations and data flow in the emulation engine 216. The emulation engine 216 receives data, which is one or more floating-point operands 402-1 , ... , 402-N, each having a floating-point value (a floating-point number). The received floating-point operands (402-1 , ..., 402-N ) may be formatted according to one example of the IEEE754 standard described in FIG. 1 or any other format. For simplicity and consistency, hereinafter, the disclosure refers to the format of the floating-point operands 402-1 , ..., 402-N, as a first floating-point format. This first floating-point format may be a format of the IEEE745 standard such as the one described in FIG. 1 . In the memory 210 of the computing device 200, with floating-point processing unit 202, the floating values 402-1 , ... , 402-N are represented as strings of binary bits, similar to FIG.1 . The module convert to integer 404 converts each floating-point operand (402-1 , ..., 402-N) to a respective set of integers (406-1 , ...406-N) having the custom floatingpoint format in two steps. A person of ordinary skill in the art understands the method of converting a floating-point value to a set of integers. Basically, a floating-point value is represented as a set of integers, as illustrated in FIG. 1.

[0062] The first step, convert to integer 404 converts the floating-point operands having a first floating-point format to a set of integers also having the first floating-point format. An example is in FIG.1 . The second step, convert to integer 404 converts, via rounding (explained below), the set of integers having the first floating-point format to the set of integers (406-1 ,... , 406-N) having a custom floating-point format. Therefore, the output of convert to integer 404 illustrates each floating-point operates value is represented as a respective set of integers (406-1 , ... ,406-N). Each set of integers, whether having the first floating-point format or the custom floating-point format, contains three integer values. Each set of integers (406-1 , ... ,406-N) has a sign value, an exponent value, and a fraction (or mantissa) value.

[0063] Example embodiments describe the custom floating-point format representing the floating-point operands with a different number of mantissa bits 108 than the first floating-point format. Example embodiments describe the custom floatingpoint format having a different exponent bias. For instance, an exponent bias value is applied when determining the exponent value of the floating-point operands having a format of the IEEE754 standard. For a single-precision number, the exponent value is stored in the range between 1-254. Further, the exponent value corresponds to the exponent value of the floating value minus 127 (the exponent bias value) to obtain an exponent value in the range -126 to +127. This exponent bias value may be different in a custom floating-point format when representing the set of integers 406-1 , ... ,406-N.

[0064] The second step of the convert to integer 404 operations includes a rounding module (not shown) that converts the set of integers of the floating-point operands having the first floating-point format into the second set of integers 406-1 , ... ,406-N having a custom floating-point format. Example embodiments describe the custom floating-point format to have fewer bits than the first floating-point format. For example, the number of mantissa bits 108 of the set of integer 406-1 , ... ,406-N is fewer than the number of mantissa bit 108 of the floating-point operands having the first floating-point format. There are several methods for rounding a floating-point number, including round truncate, round to odd, round to even, round toward zero, round away from zero, round toward infinity, stochastic rounding, etc.

[0065] It is understood that there are several rounding methods, and custom rounding methods may be implemented as well. Two of the rounding method are explained for completeness. Round truncate returns the fraction value (mantissa value 108) of a floating-point operand truncated to a specific number of decimal places. In round to odd, the method first truncates the fraction value 108 of a floating-point operand to the number of bits to represent the fraction value having the custom floatingpoint format. Further, if any of the removed bits that are truncated has a value of binary 1 , then the last bit of the fraction value is assigned binary 1 .

[0066] The computation module 306 receives the sets of integer values of (406-1 , ... ,406-N) of each floating-point operand and performs computations following instructions from the control unit 304. It is worth mentioning again, the sets of integer values (406-1 , ... , 406-N) have the custom floating-point format. The computation module 306 has computation units 308 configured to perform computations according to the custom floating-point format on the sets of integer values (406-1 , 406-2, ... , 406-N). Some example embodiments describe the emulation engine 216 with a plurality of computation modules, such as computation module 306-1 , 306-2, ..., 306-N of FIG. 3, each emulation engine 216 configured for operating on a custom floating-point format different from the other. In such a scenario, the emulation engine 216 has a convert to integer 404 module responsible for converting the floating-point operands 402-1 , ..., 402-N to a respective set of integers 402-1 , ... , 402-N for each custom floating-point format.

[0067] The computation module 306 comprises a plurality of computation units 308 responsible for performing computations on one or more sets of integer values (406-1 , 406-N). Each computation unit 308 comprises a plurality of modules, including a sign engine 408, an exponent engine 410, a fraction engine 412, rounding 414, and alignment 416. As discussed, each computation unit 308 can perform primitive computations. Primitive computations include addition, subtraction, multiplication, absolute, square root, etc. The combination of such primitive computations can compute complex operations. Each computation unit 308 is configured to perform the primitive computation for the custom floating-point format of the respective computation module 306 using the sign engine 408, exponent engine 410, fraction engine 412, rounding 414 and alignment 416. [0068] The sign engine 408 is configured to perform the primitive computations on the sign values of the sets of integer values (406-1 , 406-N). The sign engine defines the behaviour of the computation unit 308 when computing the value of the sign value that results from the computation instructed by the control unit 304. Similarly, the exponent engine 410 and the fraction engine 412 are configured to perform primitive computations on the exponent value and the fraction value of the sets of integer values (406-1 , ..., 406-N), respectively. Therefore, the exponent engine 410 and the fraction engine 412 define the behaviour of the computation units 308 when computing the exponent value and fraction value that result from the computation instructed by the control unit 304.

[0069] When performing computations, the module rounding 414 is used. Rounding 414 performs operations similar to the rounding module of the convert to integer 404. Rounding 414 is configured to ensure that the result of the computation instructed by the control unit 304 and performed by the fraction engine 412 is within the designated number of bits of the custom floating-point format.

[0070] The module alignment 416 is configured to ensure that computations instructed by the control unit 304 and performed by the exponent engine 410, and the fraction engine 412 yield an aligned set of integers. Alignment (normalization) is performed on the result of the computations performed by the computation units 308. Therefore, the alignment 416 may generate a normalized floating-point result 418. A normalized floating-point result 418 is an integer set with a fraction value that starts with binary 1 . This normalization is achieved by shifting the fraction value (in binary) to the left until the most significant bit is 1 (binary). For every shift to the left, the exponent is reduced by 1 . For illustration, if the fraction value is of 5 bits with a value of 5, i.e., 00101 , then the fraction value is shifted to the left twice to become 10100, and accordingly, the exponent value is adjusted by 2 -2 . The generated floating-point result 418 (a normalized floating value), after alignment 416, would have a fraction value of 10100 with an exponent of 2 (e.g. 2 -2 ). If alignment 416 cannot align (normalize) the set of integers results from the computation, the floating-point result 418 may be a subnormal floating value. Subnormality occurs when the adjustment to the exponent value would be out of the range of value that can be represented, e.g., exponent of less than e -127 . In this situation, the subnormal floating value is carried over for the next computations.

[0071] Example embodiments may describe the computation unit 308 to include other modules for controlling behaviour in a floating-point computation error. One example is catastrophic cancellation, a phenomenon that may result by subtracting two rounded numbers that yields a bad approximation, i.e. , one that may add the approximation error of both rounded values.

[0072] The computation units 308 receive instructions from the control unit 304 on what computations to perform. The control unit 304 may be responsible for four tasks: control sequential consistency, control rounding, control custom floating-point format, and control number of computation units 308. As for controlling sequential consistency, the control unit 304 sends instructions for organizing the computation sequence that needs to be performed by the computation units 308. Controlling sequential consistency also includes controlling the number of computation units 308 participating in the computations and deciding whether the computations are performed in parallel or serial. In other words, the control unit 304 controls the level of parallelism of the computation units 308; control unit 304 sends computation sequence instructions using common means of parallel processing synchronization such as mutexes and semaphores. Therefore, the control unit 304 acts as a scheduler for the computation units 308 by arranging the order of computations, i.e., add, multiply, accumulate, etc., of each participating computation unit 308.

[0073] Further, the control unit 304 also instructs the module convert to integer 404 regarding the custom floating-point format. The instruction may include information about the number of bits for the sign, fraction, exponent values, and value of the exponent bias. [0074] Example embodiments describe users of the computing device 200 dynamically changing the custom floating-point format. Similarly, the rounding operation performed in the convert to integer 404, and in computation units 308 may be controlled by the control unit 304 to perform one of the rounding methods described above.

Further, the user may select different rounding operations and observe the effect on the performance of a task. Suppose the emulation engine 216 is used to train a deep neural network; the user may observe the effect of a custom floating-point format and rounding method and performance.

[0075] Depending on the computations, some series of computations may require one computation unit 308, while more complex ones may require a plurality of computation units 308. Therefore, the control unit 304 sends instructions to the computation module 306, deciding the number of computation units 308 participating in performing desired computations.

[0076] Example embodiments describe a computing device 200 having emulation engine 216 with multiple computation modules 306 as in 306-1 , 306-2, ..., 306-N in FIG.

3. In such example embodiments, the control unit 304 may switch between computation modules 306 in sessions of training, each session of training having a different custom floating-point format. In other words, each session of training includes training a deep learning model using a different custom floating-point format. Such a feature enables the computing device 200 to observe task performance for other custom floating-point formats. For instance, example embodiments describe implementing a deep neural network using custom floating-point format 1 in computation module 306-1 , another time using custom floating-point format 2 in computation module 306-2, etc.

[0077] In other example embodiments, a computing device 200 having emulation engine 216 with multiple computation modules 306-1 , 306-2, ..., 306-N, each with a custom floating-point format, may be used as parts of a single task. For instance, if the task is to implement a deep learning neural network, then a custom floating-point format may be used for training, and another custom floating-point format, which is different from the first one, may be used for inference making. [0078] While the control unit 304 schedules the sequence of computations, the control unit 304 is controlled by the software library 302, which is another module in the emulation engine 216. Therefore, the control unit 304 receives instructions from a high- level language, software library 302, and configures the computation units 308 accordingly based on the received instructions.

[0079] The software library 302 includes a high-level language library that drives the control unit 304 to perform complex arithmetic operations, for example, training a deep learning model using the custom floating-point format without using the FPU 218 of the computing device 200. For example, a high-level library for a deep learning task may include functions that implement a convolutional layer, fully connected layers, gradient computation, backpropagation, etc.

[0080] The interaction between the software library 302, control unit 304, and computation units 308 generates a floating-point result 418 in memory 210. Floatingpoint result 418 has a set of integers. Although FIG. 4 shows the output of the computation module 306 is a single floating-point result 418, it is understood that this is dependent on the computations performed in the computation units 308 as it could be more than one floating-point result. For example, if the computations add two numbers, the result is the sum, which is a single number; hence, a single floating-point result. However, if the computations generate a matrix, then there are multiple floating-point results 418.

[0081] The set of integers of the floating-point result 418 includes a sign value, an exponent value, and a fraction value. Also, the set of integers of the floating-point result 418 has the custom floating-point format. Further, the set of integers of the floating-point result 418 may be in the same format as the sets of integers (402-1 , ... , 402-N) received by the computation module 306.

[0082] The emulation engine 216 may include a module, convert to float 420, responsible for converting the set of integers 418 into a floating-point output having the first floating-point format. Example embodiments describe the convert to float 420 to convert the set of integers 418 to a custom floating-point format. The output of the convert to float 420 is the output of the emulation engine 216.

[0083] It is understood that while the discussed embodiments described the computation module 306 to include a plurality of computation units 308, it may be possible to have a computation module 306 with a single computation unit 308. Also, input to emulation engine 216 is described as floating-point operands 402-1 , ... , 402-N having a format of the IEEE754 standard; however, the IEEE754 standard is an example and not a limiting factor - floating-point operands 402-1 , ... , 402-N in other formats may be received.

[0084] Although the sets of integers 406-1 , ... , 406-2, which is the output of the convert to integer 404, and the set of integers of the floating-point result 418 are illustrated to have three integer values, other representations are equally valid. For instance, the above-discussed processing unit 202 for FIG. 4 may be a floating-point processing unit 202, representing a floating-point number as illustrated in FIG. 1 ; however, a fixed-point processing unit 202 represents a floating-point number different from FIG. 1 . For a fixed-point processing unit 202, the floating-point number is represented as three integer values: sign value, integer part value, and fraction part value, where these values may be determined differently from how they are determined for the sign value, exponent value, and fraction value in a floating-point processing unit. The exponent value does not exist; instead, the position of the decimal point remains fixed, independent of the floating value it is representing.

[0085] Example embodiments may describe the floating-point number being represented for a fixed-point processing unit. In such a case, the set of integers representing the floating-point number in a fixed-point processing unit contains two integers: integer part value and fractional part value. The most significant bit of the integer part value is the sign value bit.

[0086] FIG. 5 is an example algorithm illustrating an addition computation between two sets of integers 406-1 and 406-2 converted from floating-point operands 402-1 and 402-2 using a single computation unit 308. At step 502, the algorithm illustrates the computation module 306 receiving two sets of integer values (406-1 and 406-2), f a (s a , e a , ma) and fb = ( s b> b> m bf where s a , s b are the sign values of sets of integers f a and f b , respectively, e a , e b are the exponent values of the sets of integers f a and f b , and m a , m b are the fraction values (also referred to as mantissa values) of the sets of integers f a and f b . The result of adding f a (s a , e^rria) to f b = (s b , e b , m b ) is f c = s c , e c , m c ), which is the floating-point result 418, where s c , e c , and m c are the sign value, exponent value, and the mantissa value of f c 418 respectively.

[0087] The computation unit 308 is configured to perform steps 504 - 518 to perform an addition computation. At step 504, the computation unit 308, via the exponent engine 410, computes the common exponent value e tmp between e a and e b . After completing step 504, step 506 begins. At step 506, the computation unit 308 right- shits m a by a number of bits equals to e tmp - e a and right-shifts m b by a number of bits equal to e tmp - e b . For instance, if the fraction value of m a is 90 then the binary representation of 90 is 1011010. If the value of e tmp - e a = 2, then 1011010 is shifted to the right by two bits to be 0010110, 22 in decimal. After completing step 506, step 508 beings. At step 508, the fraction engine 412 computes the signed mantissa such that sm a = s a x m a and sm b = s b x m b , where the value of s a and s b is -1 for a negative sign value and +1 for a positive sign value.

[0088] When step 508 is completed, step 510 starts, at which the fraction engine 412 computes an intermediate mantissa sum such that sm temp = sm a + sm b . When step 510 is completed, step 512 starts. At step 512, the sign value of f c is extracted from the intermediate mantissa sum computed in step 510 such that = sm tmp . Step 512 also determines a temporary mantissa value (m tmp ).

[0089] Step 512 is completed, and step 514 starts where the common exponent value e tmp aligns the temporary mantissa m tmp to determine e c and a second temporary mantissa value m tmp2 , where e c is the exponent value of f c . The alignment at step 514, also referred to as normalization 514, is performed by alignment 416, as explained above. [0090] After completing step 514, step 516 begins. At step 516, the second temporary mantissa value m tmp2 undergoes a rounding operation in rounding 414 to determine m c of f c . Several rounding operations may be performed as described above. Example embodiments also implement an optional step 518, which starts after completing step 516, for error checking. Error checking may include detecting a subnormal set of integers or the existence of a catastrophic cancellation.

[0091] It is to be understood that this example embodiment is performed for a simple addition task; therefore, the result of the addition is a floating-point result 418. However, more complex computations that include multiple operations may be stored in an accumulator (not shown). For example, performing a vector multiplication requires multiple multiplications and additions and results in an integer value set of preliminary computations stored in the accumulator (not shown). The final result (e.g., vector multiplication result) determined from further computations is the floating-point result 418.

[0092] While the example in FIG. 5 is for a primitive addition computation, complex computations may be performed, whether with one computation unit 308 or a plurality, in serial or parallel. For instance, the computation module 306 may emulate the arithmetic operations of MAC (multiply-accumulate) units, which are widely used for matrix multiplication. For example, in Hickmann, Brian, and Dennis Bradford, "Experimental analysis of matrix multiplication functional units," 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH), IEEE 2019, studied five common architectures of MAC units using a format of the IEEE754 standard. For this and similar prior works, example embodiments may describe using the emulation engine 216 with a custom floating-point format instead of a format of the IEEE754 standard. The emulation engine 216 for MAC units may also accept parameters that control the accumulator bitwidth and data-path of the MAC unit. Example embodiments may describe using the emulation engine 216 for emulating computations of the MAC unit, which may be used for deep learning applications. [0093] FIG. 6 is a flowchart of a training method 600 for using the emulation engine 216 in training a deep neural network. The emulation engine 216 may be applied in the context of deep learning to perform modelling, extraction, preprocessing, training, and the like on training data. For example, training a deep neural network model uses emulation engine 216 instead of the FPU 218 to optimize a deep neural network model.

[0094] Generally, examples disclosed herein relate to a large number of neural network applications. For ease of understanding, the following describes some concepts relevant to neural networks and some relevant terms that may be related to examples disclosed herein.

[0095] A neural network consists of neurons structured as layers of neurons. A neuron is a module that uses x s as inputs to the neuron. An output from the module may be:

[0096] where s=1 , 2, ..., n, n is a natural number greater than 1 , W s is a weight of x s , b is an offset (i.e. , bias) of the neuron, and a is an activation function of the neuron and used to introduce a nonlinear feature to the neural network. It is to be appreciated that most of the values of W s , x s , and b are floating values, and computation of equation (1 ) may be performed in the FPU 218. However, example embodiments of the present disclosure utilize the emulation engine 216 instead.

[0097] The output of the activation function may be used as an input to a neuron of a following layer in the neural network. The activation function may be a sigmoid function, for example. The neural network is formed by joining a plurality of the foregoing single neurons.

[0098] A deep neural network (DNN) is also referred to as a multi-layer neural network and may be understood as a neural network that includes a first layer (generally referred to as an input layer), a plurality of hidden layers, and a final layer (generally referred to as an output layer). The "plurality" herein does not have a special metric. A layer is considered to be a fully connected layer when there is a full connection between two adjacent layers of the neural network. To be specific, for two adjacent layers (e.g., the i-th layer and the (i+1 )-th layer) to be fully connected, each and every neuron in the i-th layer must be connected to each and every neuron in the (i+1 )-th layer.

[0099] Processing at each layer of the DNN may be relatively straightforward. Briefly, the operation at each layer is indicated by the following linear relational expression: y = where x is an input vector, y is an output vector, b is an offset vector, W is a weight (also referred to as a coefficient), and a(.) is an activation function. At each layer, the operation is performed on an input vector *, to obtain an output vector y.

[00100] Because there is a large number of layers in the DNN, there is also a large number of weights W and offset vectors b. Definitions of these parameters in the DNN are as follows: The weight W is used as an example. In this example, in an extremely simplified three-layer DNN (i.e., a DNN with three hidden layers), a linear weight from a fourth neuron at a second layer to a second neuron at a third layer is denoted as V7 2 3 4 . The superscript 3 indicates a layer (i.e., the third layer (or layer-3) in this example) of the weight W, and the subscript indicates the output is at layer-3 index 2 (i.e., the second neuron of the third layer) and the input is at layer-2 index 4 (i.e., the fourth neuron of the second layer). Generally, a weight from a k-th neuron at an (L-1 )-th layer to a j-th neuron at an L-th layer may be denoted It should be noted that there is no W parameter at the input layer.

[00101] More hidden layers in a DNN may enable the DNN to better model a complex situation (e.g., a real-world situation). In theory, a DNN with more parameters is more complex, has a larger capacity (which may refer to the ability of a learned model to fit a variety of possible scenarios), and indicates that the DNN can complete a more complex learning objective. Training of the DNN is a process of learning the weight matrix. A purpose of the training is to obtain a trained deep neural network model, which consists of parameters with the values of the learned weights W of all layers of the DNN and biases b. [00102] Training is the process of generating a DNN model. All model parameter values are initialized at step 602. The parameters include values W and b. After completing step 602, method 600 proceeds to step 604. At step 604, the DNN model is to be trained over multiple epochs. First, the epoch number is initialized to 0. During each epoch, a full corpus of training data is split into multiple batches (as well as a validation dataset). Method 600 then proceeds to step 606, where method 600 compares the epoch number to a target number of epochs. If the target number of epochs is not reached, the method 600 proceeds to step 608, where the DNN model is optimized. The model optimization at 608 includes two primary steps: performing forward propagation at step 610 and backpropagation at step 612. When performing forward propagation at step 610, method 600 sends each batch of training data through forward propagation to generate outputs of the DNN model. The outputs of the DNN model, which are the predicted values, are compared to desired target values (e.g., ground-truth values), and an error (loss) is computed. The loss is a way to quantitatively represent how close the predicted values are to the target values. The method 600 then proceeds to step 612, at which the loss is backpropagated to adjust the weights W and biases b of the DNN model before receiving the next batch of training data.

[00103] For example, suppose a defined loss function is calculated from forward propagation at step 610 of an input batch to an output of the DNN model. Backpropagation at step 612 calculates a gradient of the loss function with respect to the parameters (W and b) of the DNN, and a gradient algorithm (e.g., gradient descent) is used to update the parameters to reduce the loss function. Backpropagation is performed iteratively so that the loss function is converged or minimized. This model optimization at step 608 repeats the forward propagation and backpropagation of batches until all batches of the epoch are processed. All computations performed in the model optimization at step 608 uses the emulation engine 216.

[00104] Once all batches of the epoch are processed, method 600 proceeds to step 614, at which the epoch number is incremented, and another epoch starts with different batches from the same training data. After incrementing the number of epochs at step 614, method 600 compares the epoch number to the target number of epochs at step 606. If the target number of epochs is reached, method 600 terminates at step 616, and an optimized DNN model is generated and outputted at step 618.

[00105] Through model optimization at step 608, and over multiple epochs, the weights and biases of the DNN model should converge to an equilibrium state, indicating that the DNN model has been optimally trained relative to the full set of training data.

[00106] Method 600, specifically step 608 for model optimization step, is performed by the emulation engine 216. A user may enter parameters such as the custom floating-point format and rounding operation to use. At step 608, the forward propagation at step 610 and the backpropagation at step 612 are implemented in a high-level language in the software library 302. The software library 302 sends instructions to the control unit 304 containing computations required to be performed to achieve the forward propagation at step 610 and backpropagation at step 612. The control unit 304 sends instructions to the computation module 306 to perform the computations.

[00107] For instance, computing an output of a layer for forward propagation at step 610 using y = includes: matrix multiplication for Wx, addition (Wx + h), and mapping using the activation function a. The activation function may involve multiple operations as well, i.e. , implementing a sigmoid function. Therefore, the control unit 304 sends instructions to the computation units 308 to perform the computations. At step 608, the emulation engine 216 converts floating-point operands to integer sets, the floating-point operands including parameters of the DNN such as W, b, and the batch training data as input to the emulation engine 216.

[00108] After performing computations in the forward propagation 610 and backpropagation 612, the emulation engine 216 converts the integer value of the floating-point result 418 to floating-point outputs 422, which will become floating-point operands for the forward propagation step 610 in another iteration of method 600.

[00109] It is to be understood that what's performed by the emulation engine 216 depends on the software library 302. If the processing unit 202 processes a library function that is part of the software library 302, then the emulation engine 216 is involved. In this case, the computations are performed in software rather than hardware (FPU 218). On the other hand, if the processing unit 202 processes a library function that is not part of the software library 302, then the processing unit 202 uses an alternative method for computing, such as the FPU 218.

[00110] In this example embodiment, the training method 600 illustrated the forward propagation at step 610 and the backpropagation at step 612 are part of the software library 302. Other example embodiments may describe method 600 to include other steps as part of the emulation engine 216. Other example embodiments may describe fewer steps are part of the software library 302, i.e. , just the forward propagation at step 610 or just the backpropagation at step 612.

[00111] Example embodiments disclose using the emulation engine 216 to maintain and examine the computational stability of deep learning models, such as deep neural network models, for deep learning tasks. Stability is the study of how the performance, based on a performance measure, is affected by small changes in parameters. In this case, the small change in parameters may result from converting the parameters (W, b, and x) from a first floating-point format, such as one format of the IEEE754 standard, to a custom floating-point format, as described in detail in FIG. 3.

[00112] It is necessary to appreciate if experimentation using floating-point operands having the custom floating-point format succeeds, then an FPU that performs computations using the custom floating-point format is fabricated in hardware and deployed for the task experimented.

[00113] Advantages may arise from using an FPU according to the custom floating-point format rather than the formats of the IEEE754 standard include speed, accuracy, and energy consumption.

[00114] Thus, the emulation engine 216 is an essential tool for hardware designers to study the behaviour of a custom floating-point format for various applications before fabricating the custom floating-point format into a hardware device such as FPU 218.

[00115] FIG. 7 is a flowchart for the emulation engine method 700 for a floatingpoint unit. Method 700 starts at step 702 where one or more floating-point operands having a first floating-point format are received. Afterwards step 704 begins. At step 704, the one or more floating-point operands having the first floating-point format are converted into a first-set of integers having the first floating-point format. Method 700 then proceeds to step 706. At step 706, each of the first set of integers is converted into a second set of integers having a second floating-point format that is different from the first floating-point format. The first set of integers and the second set of integers each has a defined bit length depending on respective floating-point format. The first floatingpoint format may be a floating-point format according to one or more formats of the IEEE754 standard. On the other hand, the second floating-point format may be a custom floating-point format different from the first floating-point format. Step 706 ends and step 708 begins. At step 708, computations for a task are performed on the second set of integers to emulate computations performed by the floating-point unit using the one or more floating-point operands having the second floating-point format.

[00116] The disclosed methods may be carried out by modules, routines, or subroutines of software executed by the computing device 200. Coding of software for carrying out the steps of the methods is well within the scope of a person of ordinary skill in the art having regard to the methods of emulating floating-point unit using an emulation engine. The emulation engine method 700 may contain additional or fewer steps than shown and described, and the steps may be performed in a different order. Computer-readable instructions, executable by the processor(s) of the computing device 200, may be stored in the memory 210 of the computing device 200 or a computer-readable medium. It is to be emphasized that the steps of the emulation engine method need not be performed in the exact sequence as shown unless otherwise indicated. Likewise, various steps of the methods may be performed in parallel rather than in sequence. [00117] It can be appreciated that the emulation engine method 700 of the present disclosure, once implemented, can be performed by the computing device 200 in a fully automatic manner, which is convenient for users to use as no manual interaction is needed.

[00118] It should be understood by a person skilled in the art that, for the purpose of convenient and brief description, for a detailed working process of the foregoing system, apparatus, and unit, refer to a corresponding process in the foregoing method embodiments, and details are not described herein again.

[00119] In the several embodiments described, it should be understood that the disclosed systems and methods may be implemented in other manners. For example, the described system embodiments are merely examples. Further, units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the systems or units may be implemented in electronic, mechanical, or other forms.

[00120] The present disclosure may be embodied in other specific forms without departing from the subject matter of the claims. The described example embodiments are to be considered in all respects as being only illustrative and not restrictive.

Selected features from one or more of the above-described embodiments may be combined to create alternative embodiments not explicitly described, features suitable for such combinations being understood within the scope of this disclosure.

[00121] Also, although the systems, devices and processes disclosed for emulating an FPU and shown herein may comprise a specific number of elements/components, the systems, devices, and assemblies could be modified to include additional or fewer of such elements/components. For example, although any of the elements/components disclosed may be referenced as being singular, the embodiments disclosed herein could be modified to include a plurality of such elements/components. The subject matter described herein intends to cover and embrace all suitable changes in technology.

[00122] The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual requirements to achieve the objectives of the solutions of the embodiments.

[00123] In addition, functional units in the example embodiments may be integrated into one computing device 200, or each of the units may exist alone physically, or two or more units are integrated into one unit.

[00124] When the functions are implemented in the form of a software functional unit and sold or used as an independent product, they may be stored in a storage medium and include several instructions for instructing a computer device (which may be a personal computer, a server, or a network device) to perform all or some of the steps of the methods described in the embodiments of this application. The foregoing storage medium includes any medium that can store program code, such as a universal serial bus (USB) flash drive, a removable hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disc, among others.

[00125] The foregoing descriptions are merely specific implementations but are not intended to limit the scope of protection. Any variation or replacement readily figured out by a person skilled in the art within the technical scope shall fall within the scope of protection. Therefore, the scope of protection shall be subject to the protection scope of the claims.