Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
GLOBAL FILTER PRUNING OF NEURAL NETWORKS USING HIGH RANK FEATURE MAPS
Document Type and Number:
WIPO Patent Application WO/2021/195644
Kind Code:
A1
Abstract:
This application is directed to network pruning. A computer system obtains a neural network model having a plurality of layers each of which has a respective number of filters. The neural network model is pruned to a plurality of pruned neural network models. Specifically, for each pruned neural network model, the computer system assigns a respective distinct set of importance coefficients for the plurality of layers, and determines an importance score of each filter based on a respective importance coefficient of a respective layer to which the respective filter belongs. For each pruned neural network model, the computer system ranks the filters based on the importance score of each filter, and prunes the neural network model to a respective pruned neural network model by removing a respective subset of filters. A target neural network model is selected from the plurality of pruned neural network models.

Inventors:
GUAN BOCHEN (US)
XU QINWEN (US)
LI WEIYI (US)
Application Number:
PCT/US2021/030481
Publication Date:
September 30, 2021
Filing Date:
May 03, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
G06F7/523; G06K9/46; G06N3/04; G06N3/08
Other References:
SARA ELKERDAWY; MOSTAFA ELHOUSHI; ABHINEET SINGH; HONG ZHANG; NILANJAN RAY: "To filter prune, or to layer prune, that is the question", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 11 July 2020 (2020-07-11), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081719668
LI GUAN; WANG JUNPENG; SHEN HAN-WEI; CHEN KAIXIN; SHAN GUIHUA; LU ZHONGHUA: "CNNPruner: Pruning Convolutional Neural Networks with Visual Analytics", IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, IEEE, USA, vol. 27, no. 2, 13 October 2020 (2020-10-13), USA, pages 1364 - 1373, XP011834031, ISSN: 1077-2626, DOI: 10.1109/TVCG.2020.3030461
YU RUICHI; LI ANG; CHEN CHUN-FU; LAI JUI-HSIN; MORARIU VLAD I.; HAN XINTONG; GAO MINGFEI; LIN CHING-YUNG; DAVIS LARRY S.: "NISP: Pruning Networks Using Neuron Importance Score Propagation", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, IEEE, 18 June 2018 (2018-06-18), pages 9194 - 9203, XP033473845, DOI: 10.1109/CVPR.2018.00958
MINGBAO LIN; RONGRONG JI; YAN WANG; YICHEN ZHANG; BAOCHANG ZHANG; YONGHONG TIAN; LING SHAO: "HRank: Filter Pruning using High-Rank Feature Map", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 24 February 2020 (2020-02-24), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081606539
PAVLO MOLCHANOV, STEPHEN TYREE, TERO KARRAS, TIMO AILA, JAN KAUTZ: "PRUNING CONVOLUTIONAL NEURAL NETWORKS FOR RESOURCE EFFICIENT INFERENCE", 8 June 2017 (2017-06-08), pages 1 - 17, XP055507236, Retrieved from the Internet [retrieved on 20180914]
SEUL-KI YEOM; PHILIPP SEEGERER; SEBASTIAN LAPUSCHKIN; ALEXANDER BINDER; SIMON WIEDEMANN; KLAUS-ROBERT M\"ULLER; WOJCIECH SAME: "Pruning by Explaining: A Novel Criterion for Deep Neural Network Pruning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 12 March 2021 (2021-03-12), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081895140, DOI: 10.1016/j.patcog.2021.107899
Attorney, Agent or Firm:
WANG, Jianbai et al. (US)
Download PDF:
Claims:
What is claimed is:

1. A method for network pruning, comprising: obtaining a neural network model having a plurality of layers, each layer having a respective number of filters; pruning the neural network model to a plurality of pruned neural network models, including for each pruned neural network model: assigning a respective distinct set of importance coefficients for the plurality of layers; determining an importance score of each filter based on the respective distinct set of importance coefficients, the respective distinct set including a subset of important coefficients for a respective layer to which the respective filter belongs; ranking the filters based on the importance score of each filter; and in accordance with ranking of the filters, pruning the neural network model to the respective pruned neural network model by removing a respective subset of filters; and selecting a target neural network model from the plurality of pruned neural network models based on a model selection criterion.

2. The method of claim 1, further comprising: determining an average rank value for each filter of the neural network model; wherein for each pruned neural network model, the importance score of each filter is determined by combining the average rank value of the respective filter and the subset of importance coefficients of the respective layer to which the respective filter belongs.

3. The method of claim 2, wherein for each pruned neural network model, the distinct set of importance coefficients include a first importance coefficient at and a second importance coefficient bi for each layer, the method further comprising, for each layer: selecting the first importance coefficient at from a first set of importance coefficients in a first range; and selecting the second importance coefficient bi from a second set of importance coefficients in a second range.

4. The method of claim 3, wherein the first range is equal to the second range.

5. The method of claim 3, wherein the first range is distinct from the second range.

6. The method of claim 3, wherein for each layer, the average rank value Ri for each filter in the respective layer is modified to | 11 + fusing the first and second importance coefficients to generate the importance score of each filter.

7. The method of claim 3, wherein for each layer, the average rank value Ri for each

2 filter in the respective layer is modified to | 11 + fusing the first and second importance coefficients to generate the importance score of each filter.

8. The method of claim 3, wherein at least one of the first and second importance coefficients for at least one layer is distinct for every two pruning settings of two distinct pruned neural network models, each pruning setting corresponding to the respective distinct set of importance coefficients for the plurality of layers of a respective pruned neural network model.

9. The method of claim 2, wherein the average rank value for each filter of the neural network model is determined using a batch of predefined images, and corresponds to a depth map determined based on the batch of predefined images by the respective filter.

10. The method of any of claims 1-9, wherein selecting the target neural network model based on the model selection criterion further comprises: training each of the plurality of pruned neural network models for a predefined number of cycles; and selecting the target neural network model that has a loss function result better than any other pruned neural network models.

11. The method of any of claims 1-9, wherein selecting the target neural network model based on the model selection criterion further comprises: training each of the plurality of pruned neural network models completely; and selecting the target neural network model that uses the least number of training cycles.

12. The method of any of the preceding claims, wherein in accordance with the model selection criterion, the target neural network model corresponds to the least number of floating point operations per second (FLOPS) among the plurality of pruned neural network models.

13. The method of any of the preceding claims, wherein the target neural network model includes a plurality of weights associated with the respective number of filters of each layer, the method further comprising: maintaining a float32 format for the plurality of weights while pruning the neural network models; and quantizing the plurality weights.

14. The method of claim 13, wherein all of the plurality of weights are quantized to an int8 , uint8 , inti 6 or uintl6 format.

15. The method of claim 13, wherein the plurality of weights are quantized based on a precision setting of an electronic device, the method further comprising: providing the target neural network model to the electronic device.

16. A computer system, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-15.

17. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-15.

Description:
Global Filter Pruning of Neural Networks Using High Rank

Feature Maps

TECHNICAL FIELD

[0001] This application relates generally to deep learning technology including, but not limited to, methods, systems, and non-transitory computer-readable media for modifying neural networks to reduce computational resource usage and improve efficiency of the neural networks.

BACKGROUND

[0002] Convolutional neural networks (CNNs) have been applied in many areas, e.g., natural language processing, computer vision, image recognition, object detection, and image processing. Deployment of deep CNNs is often costly because such CNNs use many filters involving a large number of trainable parameters. Pruning techniques have been developed to remove unimportant filters in CNNs according to certain metrics. For example, weight decay is used to increase a sparsity level of connections in the CNNs, and a structured sparsity can also be applied to regularize weights. Most pruning techniques focus on the entire model and popular public datasets, require extended pruning time, and fail to converge when such techniques are applied to prune practical models that contain several networks and have complicated functions. Even if a pruning process converges, operation of the resulting neural networks normally requires computation and memory resources that cannot be afforded by many computer systems, particularly by many mobile devices. Additionally, in many situations, the neural networks as pruned are not hardware-friendly during deployment, and have weights distributed in a large dynamic range that is not friendly for quantization. It would be beneficial to develop systems and methods to efficiently prune a deep neural network that can be efficiently applied for data inference.

SUMMARY

[0003] Various implementations of this application are directed to improving efficiency of a neural network by pruning filters, thereby reducing model storage usage and computation resource usage in subsequent data inference. The core of filter pruning is a searching problem to identify a subset of filters to be removed for the purposes of improving a compression level of filters while reducing loss in computational accuracy. In some embodiments, a neural network is pruned gradually in a sequence of pruning operations to achieve a target model size, rather than being pruned once via a single pruning operation. Particularly, in some situations, the neural network is dilated prior to being pruned with the sequence of pruning operations. In some embodiments, each layer of filters is assigned with different importance coefficients, such that each filter is associated with an importance score determined based on the layer-based importance coefficients and ranked accordingly for filter pruning. In some embodiments, a neural network is divided into a plurality of subsets, and each subset is pruned in the context of the entire neural network. The pruned subsets are then combined to form a target neural network.

[0004] In one aspect, a method is implemented at a computer system for network pruning. The method includes obtaining a neural network model having a plurality of layers. Each layer includes a respective number of filters and identifying a target model size to which the neural network model is compressed. The method further includes deriving one or more intermediate model sizes from the target model size of the neural network model. The one or more intermediate model sizes and the target model size form an ordered sequence of model sizes. The method further includes implementing a sequence of pruning operations, and each pruning operation corresponds to a respective model size in the ordered sequence of model sizes. The method further includes for each pruning operation, identifying a respective subset of filters of the neural network model to be removed based on the respective model size and updating the neural network model to pruning the respective subset of filters, thereby reducing a size of the neural network model to the respective model size. In some embodiments, the updated neural network model of each pruning operation is trained according to a predefined loss function.

[0005] In another aspect, a method is implemented at a computer system for network pruning. The method includes obtaining a neural network model having a plurality of layers. Each layer has a respective number of filters. The method further includes pruning the neural network models to a plurality of pruned neural network models. Specifically, the method includes for each pruned neural network model, assigning a respective distinct set of importance coefficients for the plurality of layers, determining an importance score of each filter based on a respective subset of importance coefficients of a respective layer to which the respective filter belongs, ranking the filters based on the importance score of each filter, and in accordance with ranking of the filters, pruning the neural network model to a respective pruned neural network model by removing a respective subset of filters. The method further includes selecting a target neural network model from the plurality of pruned neural network models based on a model selection criterion.

[0006] In yet another aspect, a method is implemented at a computer system for pruning a neural network. The method includes obtaining a neural network model having a plurality of layers. Each layer has a respective number of filters. The method further includes dividing the neural network model into a plurality of neural network subsets. Each neural network subset includes a subset of distinct and consecutive layers of the neural network model. The method further includes separately pruning each neural network subset while maintaining remaining neural network subsets in the neural network model and combining each pruned neural network subset to generate a target neural network model.

[0007] In another aspect, some implementations include a computer system including one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods. [0008] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0010] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0011] Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.

[0012] Figure 3 is an example data processing environment for training and applying a neural network based (NN-based) data processing model for processing visual and/or audio data, in accordance with some embodiments.

[0013] Figure 4A is an example neural network (NN) applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network (NN), in accordance with some embodiments. [0014] Figure 5 is a flow diagram of a comprehensive process for simplifying a first neural network (NN) model, in accordance with some embodiments.

[0015] Figure 6 is a flow diagram of a subset-based filter pruning process for simplifying a neural network model, in accordance with some embodiments.

[0016] Figure 7 is a flow diagram of a pruning pipeline applied to simply each NN subset of a first NN network shown in Figure 6, in accordance with some embodiments. [0017] Figure 8 is a flow diagram of a post-pruning process for improving model performance of an NN model (e.g., the second NN network in Figure 6) based on a precision setting of a client device 104, in accordance with some embodiments.

[0018] Figure 9A is a flow diagram of an importance-based filter pruning process for simplifying an NN model, in accordance with some embodiments, and Figure 9B is a table 950 of two pruning settings defining importance coefficients for a plurality of layers of the NN model shown in Figure 9A, in accordance with some embodiments.

[0019] Figure 10A is a flow diagram of a multistep filter pruning process for simplifying a first NN model to a target NN model, in accordance with some embodiments. [0020] Figure 10B is a flow diagram of another multistep filter pruning process involving model dilation, in accordance with some embodiments.

[0021] Figures 11-13 are three flow diagrams of three filter pruning methods, in accordance with some embodiments.

[0022] Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

[0023] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0024] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104 A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network-connected home devices (e.g., a camera). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.

[0025] The one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera and a mobile phone 104C. The networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.

[0026] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0027] Deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video, image, audio, or textual data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the client device 104C). The client device 104C obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Subsequently to model training, the client device 104C obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the training data processing models locally. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A). The server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The client device 104A obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results from the server 102A, and presents the results on a user interface (e.g., associated with the application). The client device 104 A itself implements no or little data processing on the content data prior to sending them to the server 102A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104B. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104B imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface locally.

[0028] Figure 2 is a block diagram illustrating a data processing system 200, in accordance with some embodiments. The data processing system 200 includes a server 102, a client device 104, a storage 106, or a combination thereof. The data processing system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice- command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.

[0029] Memory 206 includes high-speed random access memory, such as DRAM,

SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;

• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 224 for execution by the data processing system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);

• Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;

• Data processing module 228 for processing content data using data processing models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;

• One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the data processing system 200, respectively, where the content data is processed by the data processing models 240 locally at the client device 104 or remotely at the server 102 to provide the associated results 242 to be presented on client device 104.

[0030] Optionally, the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200. Optionally, the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200. In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.

[0031] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0032] Figure 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.

[0033] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 240 is trained according to a type of the content data to be processed. The training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 240 is provided to the data processing module 228 to process the content data.

[0034] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.

[0035] The data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318. The data pre processing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the pre processing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data. The model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240. In some embodiments, the processed content data is further processed by the data post processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.

[0036] Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments, and Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments. In this application, each node 420 of the NN 400 corresponds to a filter. “Channel”, “filter”, “neuron”, and “node” are used in an exchangeable manner in the context of paining of the NN 400. The data processing model 240 is established based on the neural network 400. A corresponding model-based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights w , W2, W3, and W4 according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.

[0037] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the one or more layers includes a single layer acting as both an input layer and an output layer. Optionally, the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.

[0038] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 240 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0039] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. In an example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.

[0040] The training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.

[0041] Figure 5 is a flow diagram of a comprehensive process 500 for simplifying a first neural network (NN) model 502, in accordance with some embodiments. The first NN model 502 includes a plurality of layers 504, and each layer has a plurality of filters 506. A model compression module 508 is configured to simply the first NN model 502 to a second NN model 510. In some embodiments, the model compression module 508 applies one or more of: pruning, distillation, or quantization. For example, weights deemed as unnecessary to the first NN model 502 are removed via pruning. The first NN model 502 has a first number of filters 506 associated with a second number of weights, and the second NN model 510 has a third number of filters 506 associated with the fourth number of weights. In an example, the second number is less than the first number. In another example, the first and third numbers are equal, and however, the fourth number is less than the second number. Alternatively, in some situations, floating-point numbers of the first NN model 502 are approximated with lower bit width numbers by quantization. Alternatively, in some situations, distillation is applied to transfer knowledge from the first NN model 502 to the second NN model 510 by narrowing a difference between outputs of the first and second NN models 502 and 510. It is noted that in some embodiments, two or three of pruning, quantization, and distillation are implemented jointly to simply the first NN model 502. [0042] The model compression module 508 is part of a model training module 226

(e.g., in Figures 2 and 3). In some embodiments, the model compression module 508 is implemented on a server system 102, e.g., using training data provided by the server system 102 or by a storage 106. The server system 102 generates the first NN model 502 by itself or obtains the first NN model 502 from a distinct server 102, storage 106, or client device 104. The second NN model 510 is provided to a client device 104 to be applied for data inference. [0043] Particularly, in some implementations, the client device 104 provided with the second NN model 510 includes a mobile device having a limited computational and/or storage capability. The first NN model 502 cannot operate efficiently on the client device 104. In some situations, the server system 102 is configured to simplify the first NN model 502 to the second NN model 510 in response to a model simplification request received from the client device 104, and the model simplification request include information associated with the limited computational and/or storage capability of the client device 104. Alternatively, in some situations, the server system 102 is configured to pre-simplify the first NN model 502 to one or more NN models including the second NN model 510 and select the second NN model 510 in response to receiving a model simplification request from the client device 104.

[0044] Figure 6 is a flow diagram of a subset-based filter pruning process 600 for simplifying a neural network model 502, in accordance with some embodiments. The first NN model 502 is divided into a plurality of NN subsets 602 (e.g., 602-1, 602-2, ... 602-N), and each NN subset 602 includes a subset of distinct and consecutive layers of the first NN model 502. Each layer 504 only belongs to a single NN subset 602, and no layer or filter belongs to two NN subsets 602. In some embodiments, the plurality of NN subsets 602 cover less than all layers of the first NN model 502 (e.g., 7 of all 8 layers). Conversely, in some embodiments, the plurality of NN subsets 602 cover all layers of the first NN model 502. For example, if the first NN model 502 includes eight layers 504 and every two layers are grouped to a respective NN subset 602, the eight layers 504 are grouped to four NN subsets in total. In some embodiments, each of the NN subsets 602 is separately pruned while remaining NN subsets 602 in the first NN model 502 remain unchanged. During the pruning operation, each NN subset 602 is trained to minimize a loss function of the first NN model 502 that combines the pruned respective NN subset 602 and unchanged remaining NN subsets 602. After all of the NN subsets 602 are separately pruned, the pruned NN subsets 602 are extracted from the first NN model 502 from the plurality of pruning pipelines 606, and combined to generate the second NN network 510 as a target NN model 604. That said, the pruned NN subsets 602 are connected to an end-to-end one-stage network that is optionally trained jointly and again based on the loss function. Such a target NN model 604 is provided to and applied by a client device 104 for data inference.

[0045] In an example, the first NN model 502 is divided into a first NN subset 602-1, a second NN subset 602-2, ..., and an N-th NN subset 602-N. These NN subsets 602 are pruned and trained in the context of the first NN model 502 by a plurality of pruning pipelines 606 that are executed separately and independently of each other. In some situations, the plurality of pruning pipelines 606 are executed concurrently and in parallel to one another, e.g., by a plurality of distinct processing units. Specifically, in a first pruning pipeline 606A, the first NN model 502A is trained based on a loss function, while the first NN subset 602-1 is being modified (e.g., to remove a first subset of filters 506 in the first NN subset 602-1 and/or set to zero a subset of weights of each of a second subset of filters 506 in the first NN subset 602-1). In the first pruning pipeline 606A, the second NN subset 602-2 and any other remaining NN subset 602 (e.g., the N-th NN subset 602-N) are not modified. In a second pruning pipeline 606B, the first NN model 502B is trained based on the same loss function, while the second NN subset 602-2 in the first NN model 502B is being modified (e.g., to remove a third subset of filters 506 in the second NN subset 602-2 and/or set to zero a subset of weights of each of a fourth subset of filters 506 in the second NN subset 602-2). The first NN subset 602-1 and any other remaining NN subset 602 (e.g., the N-th NN subset 602-N) are not modified. In the other pruning pipeline(s) 606, each of the remaining NN subsets 602 is similarly pruned like the first and second NN subsets 602-1 and 602-2.

[0046] For each of the pruning pipelines 606, the first NN model 502 has only one

NN subset 602 pruned and the remaining NN subsets unchanged. Only one NN subset 602 is pruned at a time retains an overall accuracy of the first NN model 502, because all other unpruned NN subsets 602 are already well-trained. A size of the pruned NN subset 602 is reduced, and computational resource needed for data inference also drops. A corresponding data inference accuracy may be slightly compromised for the pruned NN subset 602, and therefore, the pruned NN subset 602 is not used in a different pipeline 606 to prune any other NN subset 602. Additionally, when the plurality of pruning pipelines 606 prune each of the NN subsets 602 in the context of the first NN model 502 separately, the first NN model 502 that has a relatively large model size does not need to be pruned as a whole, and such a pruning task is therefore divided into the plurality of pruning pipelines 606 in a manageable and efficient manner. Moreover, given that each pruning pipeline 606 handles only part of the pruning task, the pruning method of each pruning pipeline 606 (i.e., the pruning method 706 in Figure 7) can be flexibly selected, and does not have be optimized in many situations. As a result, a potentially heavy pruning process is simplified to multi-model pruning, which is progressive and easy to retrain and fine-tune.

[0047] For clarification, in some embodiments not shown, the plurality of NN subsets

602 cover less than all layers of the first NN model 502, and excludes an unpruned NN subset. After each NN subset 602 is pruned in the respective pruning pipeline 606, the respective NN subsets 602 are combined to one another and with the unpruned NN subset to form the target NN model 604. The unpruned NN subset is not pruned, and however, is involved in each pruning pipeline 606. Conversely, in some embodiments shown in Figure 6, the plurality of NN subsets 602 cover all layers of the first NN model 502, and no unpruned NN subset exists. After each NN subset 602 is pruned in the respective pruning pipeline 606, the respective NN subsets 602 are combined to one another without any unpruned NN subset to form the target NN model 604.

[0048] Figure 7 is a flow diagram of a pruning pipeline 606 applied to simply each

NN subset 602 of a first NN model 502 shown in Figure 6, in accordance with some embodiments. The pruning pipeline 606 starts with a first NN model 502 including an NN subset 702 to be pruned and one or more unchanged NN subsets 704. The NN subset 702 is optionally the first NN subset 602-1, second NN subset 602-2, ..., or N-th NN subset 602-N.

A pruning method 706 is applied to remove a subset of filters 506 in the NN subset 702 without changing the unchanged NN subset(s) 704. For example, in accordance with the pruning method 706, an importance score is determined for each of the filters in the NN subset 702 by combining weights associated with the respective filter, and the subset of filters 506 having the lowest importance scores are selected to suppress the corresponding floating point operations per second (FLOPs) of the first NN model 502 below a target FLOPS number that measures computational resource usage of the first NN model 502.

[0049] In some embodiments, after the NN subset 702 is pruned, a pruned first NN model 708 is outputted without being re-trained or fine-tuned. Alternatively, in some embodiments, after the NN subset 702 is pruned, the pruned first NN model 708 is further tuned (710). For example, the target NN model 604 (i.e., the second NN model 510) is provided to a client device 104 having a filter setting, e.g., which fits a register length of a single instruction multiple data (SIMD) computer structure. The respective number of filters in each layer of the pruned NN subset 702 of the pruned first NN model 708 is expanded based on the filter setting of the client device 104. This design can avoid padding time cost during deployment on hardware, and a subsequent inference time of the entire NN model can be improved.

[0050] Specifically, in an example, the client device 104 uses a CPU to run the target

NN model 604, and the filter setting of the client device 104 requires the number of filters in each layer of the pruned NN subset 702 to be a multiple of 4. The respective number of filters of each layer of the pruned NN subset 702 is expanded to at least the nearest multiple of 4, e.g., from 27 filters to 28 filters. In another example, the client device 104 uses a graphics processing unit (GPU) to run the target NN model, and the filter setting of the client device 104 requires the number of filters in each layer of the pruned NN subset 702 to be a multiple of 16. The respective number of filters of each layer of the pruned NN subset 702 has to be expanded to the nearest multiple of 16, e.g., from 27 filters to 32 filters. In yet another example, the client device 104 uses a digital signal processor (DSP) to run the target NN model 604, and the filter setting of the client device 104 requires the number of filters in each layer of the pruned NN subset 702 to be a multiple of 32. The respective number of filters of each layer of the pruned NN subset 702 has to be expanded to the nearest multiple of 32, e.g., from 27 filters to 32 filters.

[0051] In some embodiments, the server system 102 is configured to simplify the first

NN model 502 to the target NN model 604 in response to a model simplification request received from the client device 104, and the model simplification request includes information associated with the filter setting of the client device 104. Alternatively, in some situations, the server system 102 is configured to pre-simplify the first NN model 502 to a plurality of NN model options based on a plurality of known filter settings that are often used by different client devices 104, and select the target NN model from the NN model options in response to receiving a model simplification request from the client device 104.

[0052] Further, in some embodiments, an L2 norm regularization is added to the pruning pipeline 606 and applied (710) to a predefined loss function associated with the pruned first NN model 708. The L2 norm regularization corresponds to a term dedicated to weights of filters 506 of the respective pruned NN subset 602. For example, the term is associated with a square of the weights of filters 506 of the respective pruned NN subset 602. Specifically, the predefined loss function includes a first loss function. Prior to dividing the first NN model 502, the first NN network model is trained according to a second loss function, and the first loss function is a combination of the second loss function and the term. [0053] Based on the expanded channel number and/or L2 norm regulation, the pruned first NN model 708 is trained and fine-tuned (712) to provide an intermediate first NN network 714 having a newly pruned NN subset 702 in each pruning pipeline 606. The L2 norm regulation controls a dynamic range of weights, thereby reducing an accuracy drop after quantization. As a result, the newly pruned NN subsets 702 are obtained for the plurality of NN subsets of the first NN model 502 from the plurality of pruning pipelines 606, respectively. Each of these pruned NN subsets 702 is extracted from the intermediate first NN network 714 in each pruning pipeline 606, and ready to be combined into the target NN model 604 (i.e., the second NN model 510) that is used by the client device 104.

[0054] Figure 8 is a flow diagram of a post-pruning process 800 for improving model performance of an NN model 802 (e.g., the second NN model 510 in Figure 6) based on a precision setting of a client device 104, in accordance with some embodiments. In some embodiments associated with Figures 6 and 7, weights associated with filters of the first NN model 502 maintain a float32 format while the plurality of NN subsets 602 are separately pruned in the plurality of pruning pipelines 606. After the pruned NN subsets 602 are combined to the NN model 802 (i.e., the second NN model 510 in Figure 6), the weights of the un-pruned filters 506 in the NN model 802 are quantized to provide the target NN model 804. For example, the weights are quantized from the float32 format to an int8 , uint8 , intl6, or uintl6 format based on the precision setting of the client device 104. Specifically, in an example, the client device 104 uses a CPU to run the target NN model 804, and the CPU of the client device 104 processes 32 bit data. The weights of the NN model 802 are not quantized, and the NN model 802 is provided to the client device 104 directly. In another example, the client device 104 uses one or more GPUs to run the target NN model, and the GPU(s) process 16 bit data. The weights of the NN model 802 are quantized to an inti 6 format, thereby converting the NN model 802 to the target NN model 804. In yet another example, the client device 104 uses a DSP to run the target NN model, and the DSP processes 8 bit data. The weights of the target NN model are quantized to an int8 format, thereby converting the NN model 802 to the target NN model 804. After quantization of the weights, e.g., to a fixed 8-bit format, the target NN model 804 has fewer MACs and smaller size, and is hardware-friendly during deployment on the client device 104.

[0055] In some embodiments, the server system 102 is configured to simplify the first

NN model 502 to the target NN model 804 in response to a model simplification request received from the client device 104, and the model simplification request includes information associated with the precision setting of the client device 104. Alternatively, in some situations, the server system 102 is configured to quantize the second NN model 510 pruned from the first NN model 502 to a plurality of NN model options based on a plurality of known precision settings that are often used by different client devices 104, and select the target NN model 804 from the NN model options in response to receiving a model simplification request from the client device 104.

[0056] Generally, in some embodiments, a progressive multi-model compression pipeline (Figure 6) is established for a deep neural network using multi-model parallel pruning (Figure 7). In some embodiments, fixed 8-bit end-to-end post training quantization (Figure 8) is applied to further compress the deep neural network. The L2 norm regularization (710) is optionally applied during pruning to facilitate subsequent quantization, while filter alignment is also used during pruning to improve hardware friendliness of the resulting target NN model. As the deep neural network is pruned to be friendly to the SIMD, it fits a corresponding hardware accelerator and increases deployment efficiency.

[0057] Various implementations of this application are directed to improving efficiency of a neural network by pruning filters, thereby reducing model storage usage and computational resource usage during a data inference stage. The core of filter pruning is a searching problem to identify a subset of filters to be removed for the purposes of achieving a certain compression level for filters with an acceptable level of loss in computational accuracy. Filter pruning is classified into predefined pruning and adaptive pruning. In predefined pruning, different metrics are applied to evaluate the importance of filters within each layer locally without changing a training loss. Model performance can be enhanced by fine-tuning after filter pruning. For example, an L2-norm of filter weights can be used as measuring importance. Alternatively, a difference between unpruned and pruned neural networks is measured and applied as an importance score. In another example, a rank of feature maps is used as an importance measure. The rank of the feature maps can provide more information than L1/L2 norms and achieve better compression results.

[0058] Conversely, in adaptive pruning, a pruned structure is learned automatically when a hyper parameter (e.g., important coefficients ai and bi) is given to determine a computation complexity. An adaptive pruning method can embed a pruning demand into the training loss and employ joint-retraining optimization to find an adaptive decision. For example, Lasso regularization is used with a filter norm to force filter weights to zeros. Lasso regularization is added on a batch normalization layer to achieve pruning during training. In another example, a scaling factor parameter is used to learn sparse structure pruning where filters corresponding to a scaling factor of zero are removed. In some embodiments, AutoML is applied for automatic network compression. The rationality is based on the exploration among the total space of network configurations for a final best candidate.

[0059] Figure 9A is a flow diagram of an importance-based filter pruning process 900 for simplifying an NN model 902, in accordance with some embodiments, and Figure 9B is a table 950 of two pruning settings 952 and 954 defining importance coefficients for a plurality of layers 906 of the NN model 902 shown in Figure 9A, in accordance with some embodiments. In some embodiment, the filter pruning process 900 is implemented at a server system 102 to prune the NN model 902 to a target NN model 904, and the target NN model 904 is provided to a client device 104. In some embodiments, the filter pruning process 900 is implemented directly at the client device 104 to prune the NN model 902 to the target NN model 904. The NN model 902 has a plurality of layers 906, and each layer 906 has a respective number of filters 908. The NN model 902 is pruned to a plurality of pruned NN models 910. In an example, each of the plurality of pruned NN models 910 has a pruned number of filters 908, and the NN model 902 has a first number of filters. A difference of the pruned number and the first number is equal to a predefined difference value or a predefined percentage of the first number. In another example, each of the plurality of pruned NN models 910 can be operated with a respective FLOPS number that is equal to or less than a predefined FLOPS number, and a respective subset of filters 908 are removed from the NN model 902 to obtain the respective pruned NN model 910 corresponding to the respective FLOPS number. After the plurality of pruned NN models 910 are generated, the target NN model 904 is selected from the plurality of pruned neural network models 910 based on a model selection criterion, e.g., by Auto Machine Learning (AutoML). [0060] Specifically, for each pruned NN model 910, a respective distinct set of importance coefficients (e.g., those in the pruning setting 952) are assigned to each of the plurality of layers 906 in the NN model 902. For example, a first layer 906A is assigned with a first set of importance coefficients (e.g., ai and 6;), and a second layer 906B is assigned with a second set of importance coefficients (e.g., <¾ and bi). A third layer 906C is assigned with a third set of importance coefficients (e.g., and hi), and a fourth layer 906D is assigned with a fourth set of importance coefficients (e.g., and hi). An importance score / is determined for each filter 908 based on the respective subset of importance coefficients of a respective layer 906 to which the respective filter 908 belongs. For example, the filter 908A is included in the third layer 906C, and an importance score / is determined based on the third subset of importance coefficients and b assigned to the third layer 906C. The filters 908 of the entire NN model 902 are ranked based on the importance score / of each filter 908. In accordance with ranking of the filters 908, a respective subset of filters 908 are removed based on their importance score /, thereby allowing the NN model 902 to be pruned to the respective pruned NN model 910. Specifically, each of the plurality of pruned NN models 910 has a pruned number of filters 908 satisfying a predefined difference value, percentage, or FLOPS number, and the pruned number of top-ranked filters 908 are selected based on the importance score / of each filter 908 to generate the respective pruned NN model 910.

[0061] In an example, for each pruned NN model 910, the distinct set of importance coefficients include a first importance coefficient ai and a second importance coefficient bi for each layer 906. For each layer 906, the first importance coefficient ai is selected from a first set of importance coefficient values in a first range, and the second importance coefficient bi is selected from a second set of importance coefficient values in a second range. In some embodiments, the first range is equal to the second range, e.g., [0, 2] or (0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2}. Alternatively, the first range is not equal to the second range. Referring to Figure 9B, both of the first importance coefficient ai and second importance coefficient bi of each layer 906 are selected from a range of [0, 1.5] The importance coefficients ICS-1 of the pruning setting 952 correspond to a first pruned NN model 910A, and the importance coefficients ICS-1 of the pruning setting 954 correspond to a second pruned NN model 910B. At least one of the importance coefficients corresponding to the first pruned NN model 910A is not equal to a corresponding one importance coefficient corresponding to the second pruned NN model 910B. In this example, three importance coefficients are different from the first and second pruned NN models 910A and 910B, and include both of the importance coefficients cii and bi of the first layer 906 A and a second importance coefficient b 4 of the fourth layer 906D.

[0062] In some embodiments, an average rank value is determined for each filter 908 in the NN model 902 to determine the respective importance score / of the respective filter 908. For example, the average rank value of each filter 908 is determined using a batch of predefined images. When the batch of predefined images are inputted into the NN model 902, each filter 908 outputs a feature map 912. The average rank value of each filter 908 is determined based on characteristics of the feature map 912, and indicates an importance rank of the respective filter 908 for processing the batch of predefined images. In another example, the average rank value of each filter 908 of the NN model 902 is an L2 norm of all weights of the respective filter 908. The importance score / of each filter 908 is generated by combining the average rank value of the respective filter 908 and the respective subset of importance coefficients of the respective layer 906 to which the respective filter 908 belongs. For each pruned NN model 910, if the distinct set of importance coefficients include a first importance coefficient a t and a second importance coefficient bi for each layer 906, the average rank value Ri for each filter 908 in the respective layer 906 is modified to one of | 11 + bi

(L2 norm) and | 11 + bi (LI norm) using the first and second importance coefficients to generate the importance score / of each filter 908. Each of the plurality of pruned NN models 910 has a pruned number of filters 908, and the pruned number of top-ranked filters 908 are selected based on the importance score / of each filter 908 to generate the respective pruned NN model while the low-ranked filters 908 are removed.

[0063] After the plurality of pruned NN models 910 are generated, the target NN model 904 is selected from the plurality of pruned neural network models 910. In some embodiments, in accordance with a first model selection criterion, the target NN model 904 is selected from the plurality of pruned NN models 910 by training each of the plurality of pruned NN models for a predefined number of cycles and selecting the target NN model 904 that has a loss function result better than any other pruned NN models 910. Alternatively, some embodiments, in accordance with a second model selection criterion, each of the plurality of pruned NN models 910 is completely trained, e.g., to minimize a corresponding loss function, and the target NN model 904 is selected if the target NN model 904 has used the least number of training cycles among the plurality of pruned NN models 910. Additionally, in some embodiments, in accordance with a third model selection criterion, the target NN model 904 corresponds to the least number of FLOPS among the plurality of pruned NN models, e.g., when the NN model 902 is pruned to the plurality of pruned NN models 910 having the same pruned number of filters 908.

[0064] In some embodiments, the target NN model 904 includes a plurality of weights associated with the respective number of filters 908 of each layer 906. The NN model 902 and pruned NN models 910 maintain a float32 format to obtain the target NN model 904. The plurality weights of the target NN model 904 are quantized (912), e.g., to an int8 , uint8 , inti 6, or uintl6 format based on the precision setting of the client device 104. For example, the client device 104 uses a CPU to run the target NN model 904, and the CPU of the client device 104 processes 32 bit data. The weights of the target NN model 904 are not quantized, and the target NN model 904 is provided to the client device 104 directly. In another example, the client device 104 uses one or more GPUs to run the target NN model 904, and the GPU(s) process 16 bit data. The weights of the target NN model 904 are quantized to an inti 6 format. In yet another example, the client device 104 uses a DSP to run the target NN model, and the DSP processes 8 bit data. The weights of the target NN model 904 are quantized to an int8 format.

[0065] Referring to Figure 9A and 9B, in some embodiments, the filter pruning process 900 is applied based on automated learned, global, and high rank (LGHR) feature maps. This filter pruning process 900 optionally takes into account both a rank of feature maps 912 and Auto Machine Learning (AutoML) filter pruning. The LGHR feature maps provide a global ranking of the filters 908 across different layers 906 in the NN model 902. Hyper parameters of LGHR feature maps (e.g., important coefficients cn and bi) are automatically searched, thereby reducing human labor and time for parameter settings. Specifically, each filter 908 corresponds to a feature map generated at an output of the respective filter 908 applied to combine a previous layer 906 in the NN model 902.

[0066] The filter pruning process 900 is directed to global filter pruning that uses high rank feature maps, and includes three stages: a rank generation stage, a search space generation stage, and an evaluation stage. Feature maps 912 are generated from the filters 908 the NN model 902, and used to generate the average rank values of the feature maps 912 and determine importance scores I of the filters in the NN model 902. A regularized evolutionary algorithm is optionally applied to generate a search space based on the importance scores of the filters 908 and fine tune a candidate pruned architecture (i.e., a selected pruned NN model 910). During the rank generation stage, a batch of images from dataset are used to run through the layers 906 of the NN model 902 to get the feature maps 912 and estimate the average rank value of each feature map 912 associated with a respective filter 908. [0067] During the search space generation stages, the low rank feature maps or filters

908 are globally identified and removed across layers. In some embodiments, the importance score / of a filter 408 in each layer 906 is determined by one of the following two equations: where l is an index of a layer 906, and ai and bi are two leamable parameters (also called importance coefficients) for global modification, which can scale and shift the importance score / of filters 908 in each layer 906. R is the average rank value of all the feature maps of a CNN layer. LGHR ranks the filters by the importance score / of each filter 908 and remove low score filters 908. Based on these defined importance scores / of individual filters 908, a regularized evolutionary algorithm (EA) is applied to generate a network architecture search space as explained above. Specifically, the network architecture search space includes the plurality of pruned NN models 910, and each pruned NN model 910 is generated based on a distinct set of importance coefficients for the plurality of layers 906.

[0068] In some embodiments, during the evaluation stage, the target NN model 904 is selected from the pruned network search space and fine-tuned via several gradient steps. A loss between the NN models 902 and 904 is used to select the importance coefficients of the layers 906. In some situations, the importance coefficients are reset iteratively an optimized target NN model 904 is identified. In some embodiments, first importance coefficients of layers 906 are selected in a first range to generate a first batch of pruned NN models 910.

One or two first target NN models 904 are selected from the plurality of pruned NN models 910, and the first importance coefficients corresponding to the one or two first target NN models 904 are selected to narrow down the first range of importance coefficients to a second range. Second importance coefficients of layers 906 are selected in the second range to generate a second batch of pruned NN models 910. One or two second target NN models 904 are selected from the second batch of pruned NN models 910. The target NN model 904 is outputted to the client device 104, or continues to narrow down the second range for importance coefficients iteratively.

[0069] In LGHR, an importance score measure is used based on the rank of the feature maps associated with different filters 908. An AutoML pipeline is optionally used to search a target NN model 904. In some embodiments, LGHR takes into account both feature map ranking and AutoML filter pruning. LGHR provide a global ranking solution to combine filters 908 across different layers 906. Global ranking analysis makes it easy to set a pruning target and find an optimal target NN model 904. Additionally, LGHR modified adaptive pruning, e.g., using low rank feature maps. LGHR uses AutoML to learn hyper-parameters that can greatly reduce the workload and time for hyper-parameter settings.

[0070] Figure 10A is a flow diagram of a multistep filter pruning process 1000 for simplifying a first NN model 1002 to a target NN model 1004, in accordance with some embodiments. The first NN model 1002 has a plurality of layers 1006, and each layer 1006 has a plurality of filters 1008. The first NN model 1002 has a first model size, and is operated with at least a first computational resource usage that is measured in FLOPS. The first NN model 1002 is required to be compressed to a target model size corresponding to a target computational resource usage measured in FLOPS. In an example, the first NN model 1002 is operated with at least 4G floating point operations per second (i.e., 4G FLOPS), and needs to be compressed by 75% to the target NN model 1004 having a target model size corresponding to 1G FLOPS or less. In some embodiments, a single pruning operation is implemented to prune the first NN model 1002 down to the target model size. Alternatively, in some embodiments, a sequence of pruning operations 1010 are implemented to reach the target model size.

[0071] The intermediate model sizes are determined to approach the target model size gradually from the first NN model 1002, e.g., with an equal or varying step size. Specifically, the first model size and the target model size are used to derive one or more intermediate model sizes. The one or more intermediate model sizes and the target model size form a sequence of decreasing model sizes ordered according to magnitudes of these model sizes. When the sequence of pruning operations 1010 are implemented, each pruning operation 1010 corresponds to a respective model size in the ordered sequence of model sizes. In each pruning operation 1010, a respective subset of filters 1008 of the first NN model 1002 are identified to be removed based on the respective model size, and the first NN model 1002 is updated to remove the respective subset of filters 1008, thereby reducing a size of the first NN model 1002 to the respective model size. Additionally, in some embodiments, during each pruning operation 1010, each updated first NN model 1012 is optionally trained according to a predefined loss function. That said, after the respective subset of filters 1008 is removed, weights of remaining filters 1008 of the updated first NN model 1012 are adjusted based on the predefined loss function during training.

[0072] In an example, the first NN model 1002 is operated with the first model size of

4G FLOPS, and the target model size is 1G FLOPS. Two intermediate model sizes of 2G FLOPS and 3G GLOPS are derived based on the first and second model sizes. The ordered sequence of model sizes are 3G, 2G, and 1G FLOPS. A sequence of 3 pruning operations 1010 are implemented to reduce the computational resource usage of the first NN model 1002 from 4G FLOPS to 3G, 2G, and 1G FLOPS, successively and respectively. Optionally, at the end of each pruning operation, the updated first NN model 1012 is trained such that the weights of the remaining filters 1008 are adjusted based on the predefined loss function. In another example, the first NN model 1002 is operated with the first model size of 1000 filters, and the target model size is 400 filters. Two intermediate model sizes of 800 and 600 filters are derived. A sequence of 3 pruning operations 1010 are implemented to reduce the first NN model 1002 from 1000 filters to 800, 600, and 400 filters, successively and respectively. Alternatively, in some embodiments, one or more intermediate model sizes are determined to form a sequence of decreasing model sizes having varying step sizes. An example of the sequence is 800, 500, and 400 filters.

[0073] In some embodiments, the updated NN model 1012 having the target model size (i.e., the target NN model 1004) is provided to a client device 104, and the target model size satisfies a target computation criterion associated with the client device 104. For example, the client device 104 is a mobile phone, and the target model size is 100 filters. Conversely, the client device 104 is a tablet computer, and the target model size is 400 filters. [0074] In some embodiments, each of the pruning operations 1010 receives an input

NN model. A first pruning operation receives the first NN model 1002, and any following pruning operation receives the updated first NN model 1012 of an immediately preceding pruning operation 1010. For each pruning operation, an importance factor is determined for each filter 1008 of the input NN model, and a respective subset of filters 1008 that having the smallest importance scores are selected among the filters of the input NN model. Further, in some embodiments, the importance factor of each filter 1008 is determined based on a sum of weights of the respective filter 1008 applied to convert inputs of the respective filter 1008. This sum is optionally a weighted sum of the weights of the respective filter 1008. In some situations, the importance factor of each filter 1008 is based on an LI norm, e.g., is an unweighted sum or a weighted sum of an absolute value of the weights of the respective filter 1008. In some situations, the importance factor of each filter 1008 is based on an L2 norm, e.g., is a square root of an unweighted sum or a weighted sum of squares of the weights of the respective filter 1008. For each pruning operation 1010, the important factors of the filters 1008 of the input NN model (which is a subset of the first NN model 1002) are ranked and applied to determine the subset of filters 1008 to be removed. [0075] Figure 1 OB is a flow diagram of another multistep filter pruning process 1050 involving model dilation, in accordance with some embodiments. The filter pruning process 1050 includes two stages: dilating a first NN model 1002 to a dilated NN model 1014 and pruning the dilated NN model 1014 by a sequence of pruning operations to a target NN model 1004. In an example, the first NN model 1002 is dilated, such that a number of filters is increased by 1.5 or 2 times. The number of filters is increased in the dilated NN model 1014 compared with the first NN model 1002 from which the dilated NN model 1014 is dilated. In some embodiments, one or more supplemental layers of filters 1008 are added to the plurality of layers 1006 of the first NN model 1002. In some embodiments, one or more supplemental filters 1008 are selectively added to to each of a subset of the plurality of layers 1006 of the first NN model 1002.

[0076] In some embodiments, the dilated NN model 1014 has a dilated model size, and the one or more intermediate sizes are derived based on the dilated model size and the target model size that is determined based on the client device 104 receiving the target NN model 1004. The sequence of pruning operations 1010 are initiated from the dilated NN model 1014. In an example, the first NN model 1002 is operated with the first model size of 4G FLOPS, and the target model size is 1G FLOPS. The first model size is dilated to 6G Flops. One intermediate model size of 3.5G FLOPS is derived based on the dilated and target model sizes. The ordered sequence of model sizes are 3.5G, and 1G FLOPS. A sequence of 2 pruning operations 1010 are implemented to reduce the computational resource usage of the first NN model 1002 from 6G FLOPS to 3.5G and 1G, successively and respectively. Alternatively, in some embodiments, the sequence of model sizes are not equally spaced. Three intermediate model sizes of 4.5G, 3G, and 2G FLOPS are derived based on the dilated and target model sizes. A sequence of four pruning operations 1010 are implemented to reduce the computational resource usage of the dilated NN model 1014 from 6G FLOPS to 4.5G, 3G, 2G, and 1G FLOPS, successively and respectively. In some embodiments, due to dilation of the first NN model 1002, the multistep filter pruning process 1050 optionally includes more pruning operations than the multistep filter pruning processing 1000 to reach the same target model size, while each pruning operation is similarly implemented, e.g., based on importance scores of the filters 1008 calculated based on the LI or L2 norm .

[0077] Referring to both Figures 10A and 10B, in some embodiments, during each pruning operation 1010, each updated first NN model 1012 is optionally trained according to a predefined loss function. After the respective subset of filters 1008 is removed, weights of remaining filters 1008 of the updated first NN model 1012 are adjusted based on the predefined loss function during training. Alternatively, in some embodiments, the updated first NN model 1012 is not trained after each pruning operation 1010. Rather, the updated first NN model 1012 obtained after the entire sequence of pruning operations 1010 is trained based on the predefined loss function, and fine-tuned to the target NN model 1004.

[0078] In some embodiments, the target NN model 1004 includes a plurality of weights associated with the respective number of filters 1008 of each layer 1006. The first NN model 1002 and updated first NN model 1012 generated by each pruning operation 1010 maintain a float32 format. The plurality weights of the target NN model 904 are quantized (1016), e.g., to an int8 , uint8 , inti 6, or uintl6 format based on the precision setting of the client device 104 that is configured to receive and use the target NN model 1004. Additionally, in some embodiments, the client device 104 receiving and using the target NN network 1004 has a filter setting that defines numbers of filters fitting a register length of a SIMD computer structure. After the first NN model 902 is pruned to the updated NN model 1012 with the sequence of pruning operations 1010, the updated NN model 1012 is tuned based on the filter setting. The number of filters in each layer of the updated NN model 1012 is expanded based on the filter setting of the client device 104. For example, the number of filters in each layer of the updated NN model 1012 is expanded to a multiple of 8, 16, or 32 based on the filter setting of the client device 104.

[0079] In summary, each pruning operation 1010 is configured to reach a distinct model size or a distinct computational resource usage that is measured in FLOPS. The distinct model size or resource usage decreases gradually with the respective pruning operation in the sequence of pruning operations. The sequence of pruning operations are thereby implemented to prune the first NN model 1002 (Figure 10A) or dilated NN model 1014 (Figure 10B) to reach each distinct model size or computational resource usage successively. Different pruning methods can be applied in each pruning operation 1010 of the multistep filter pruning processes 1000 and 1050. Different pruning operations 1010 in the same sequence can use different pruning methods, such as filter-wise pruning and AutoML pruning. As such, the multistep filter pruning process 1050 can provide the target NN model 1004 with a high compression rate (e.g., greater than a threshold compression rate) that is hard to be reached by a single pruning operation.

[0080] Figures 11-13 are three flow diagrams of three filter pruning methods 1100,

1200, and 1300, in accordance with some embodiments. For convenience, each of the methods 1100, 1200, and 1300 is described as being implemented by a computer system (e.g., a client device 104, a server 102, or a combination thereof). An example of the client device 104 is a mobile phone. In an example, each of the methods 1100, 1200, and 1300 is applied to prune filters of a corresponding neural network model. Each of the methods 1100, 1200, and 1300 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figures 11-13 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the system 200 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in each of the methods 1100, 1200, and 1300 may be combined and/or the order of some operations may be changed.

[0081] Referring to Figure 6 and 11, a computer system obtains (1102) a neural network model 502 having a plurality of layers 504, and each layer 504 has a respective number of filters 506. The neural network model 502 is divided (1104) into a plurality of neural network subsets 602, and each neural network subset 602 includes a subset of distinct and consecutive layers 504 of the neural network model 502. The computer system separately prunes (1106) each neural network subset 602 while maintains remaining neural network subsets 602 in the neural network model 502. In some embodiments, two or more neural network subsets 602 are pruned concurrently and in parallel. Each pruned neural network subset is combined (1108) to one another to generate a target neural network model 510, i.e., the plurality of neural network subsets 602 that are pruned are combined to the target neural network model 510. In some embodiments, the target neural network model 510 is trained (1110) according to a predefined loss function.

[0082] In some embodiments, the target neural network model 510 is provided (1110) to an electronic device (e.g., a client device 104). While pruning each neural network subset 602, the computer system controls (1112) the respective number of filters in each layer 504 of the respective neural network subset 602 according to a filter setting of the electronic device. Further, in some embodiments, each layer of the target neural network 510 has a respective updated number of filters, and the respective updated number is a multiple of 4.

[0083] In some embodiments, the neural network model 502 includes a plurality of weights associated with the respective number of filters 506 of each layer 504. The computer system maintains a float32 format for the plurality of weights while separately pruning each neural network subset 602. After generating the target neural network model 510, the computer system quantizes the plurality weights, e.g., from a float32 format to an int8 , uint8 , intl6 or uintl6 format. Further, in some embodiments, the plurality of weights are quantized based on a precision setting of an electronic device. The target neural network model 510 having quantized weights is provided to the electronic device.

[0084] In some embodiments, each neural network subset 602 is separately pruned.

For each neural network subset 602 (e.g., 602-1), the computer system updates the neural network model 502 by replacing the respective neural network subset with a respective pruned neural network subset 610 (e.g., 610-1), training the updated neural network model (e.g., in the pruning pipeline 606A) according to a predefined loss function, and extracting the respective neural network subset 602 from the updated neural network model after the updated neural network model is trained. Additionally, in some embodiments associated with L2 regularization, the predefined loss function includes a term dedicated to weights of filters of the respective pruned neural network subset 610. Further, in some embodiments, the predefined loss function applied to train the updated neural network model in the pruning pipeline 606 includes a first loss function. Prior to dividing the neural network model, the computer system trains the neural network model 502 according to a second loss function. The first loss function is a combination of the second loss function and the term.

[0085] In some embodiments, for each neural network subset 602, the computer system selects a respective set of filters 506 to be removed from respective neural network subset 602 based on a pruning method. For example, an importance score is determined for each filter 506 in the respective neural network subset 602, and the respective set of filters 506 having the smallest importance scores are selected among the filters 506 in the respective neural network subset 602. The respective set of filters 506 are removed from the respective neural network subset to obtain the respective pruned neural network subset 610.

[0086] Referring to Figure 9A-9B and 12, a computer system obtains (1202) a neural network model 902 having a plurality of layers 906, and each layer 906 has a respective number of filters 908. The computer system prunes (1204) the neural network model 902 to a plurality of pruned neural network models 910. For each pruned neural network model 910, the computer system assigns (1206) a respective distinct set of importance coefficients for the plurality of layers 906, and determines (1208) an importance score /of each filter 908 based on the respective distinct set of importance coefficients. The respective distinct set of importance coefficients includes a subset of importance coefficients for a respective layer to which the respective filter belongs. To obtain each pruned neural network model 910, the computer system ranks (1210) the filters 908 based on the importance score of each filter 908, and in accordance with ranking of the filters, prunes (1212) the neural network model 902 to the respective pruned neural network model 910 by removing a respective subset of filters. The target neural network model 904 is selected (1214) from the plurality of pruned neural network models 910 based on a model selection criterion.

[0087] In some embodiments, the computer system determines an average rank value for each filter 908 of the neural network model 902. For each pruned neural network model 910, the importance score of each filter 908 is determined by combining the average rank value of the respective filter 908 and the subset of importance coefficients of the respective layer to which the respective filter belongs. Additionally, in some embodiments, for each pruned neural network model 910, the distinct set of importance coefficients include a first importance coefficient ai and a second importance coefficient bi for each layer 906. For each layer 906, the first importance coefficient ai is selected from a first set of importance coefficients in a first range, and the second importance coefficient bi is selected from a second set of importance coefficients in a second range. The first range is optionally equal to or distinct from the second range. Further, in some embodiments, for each layer 906, the average rank value Ri for each filter 908 in the respective layer 906 is modified to cq | \Ri \ \ + bi using the first and second importance coefficients ai and bi to generate the importance score /of each filter 908. Alternatively, in some embodiments, for each layer 906, the average rank value Ri for each filter 908 in the respective layer 906 is modified to cq | \Ri \ \ + bi using the first and second importance coefficients ai and bi to generate the importance score / of each filter 908. At least one of the first and second importance coefficients for at least one layer is distinct for every two pruning settings (e.g., ICS-1 and ICS-2 in Figure 9B) of two distinct pruned neural network models 910. Each pruning setting corresponds to the respective distinct set of importance coefficients for the plurality of layers 906 of a respective pruned neural network model 910.

[0088] In some embodiments, the average rank value for each filter 908 of the neural network model 902 is determined using a batch of predefined images. Each filter 908 outputs a depth map that is applied to generate the respective average rank value Ri that is used to generate the respective importance factor / of the respective filter 908.

[0089] In some embodiments, each of the plurality of pruned neural network models

910 is trained for a predefined number of cycles. The target neural network model 904 that has a loss function result better than any other pruned neural network models is selected from the plurality of pruned neural network models 910. Alternatively, in some embodiments, each of the plurality of pruned neural network models 910 is trained completely (e.g., until a loss function has been minimized). The target neural network model 904 that uses the least number of training cycles is selected from the plurality of pruned neural network models 910. [0090] In some embodiments, in accordance with the model selection criterion, the target neural network model 904 corresponds to computational source usage having the least number of floating point operations per second (FLOPS) among the plurality of pruned neural network models 910.

[0091] In some embodiments, the neural network model 902 includes a plurality of weights associated with the respective number of filters 908 of each layer 906. The computer system maintains a float32 format for the plurality of weights while pruning the neural network model 902. After generating the target neural network model 904, the computer system quantizes the plurality weights, e.g., from a float32 format to an int8 , uint8 , inti 6 or uintl6 format. Further, in some embodiments, the plurality of weights are quantized based on a precision setting of an electronic device. The target neural network model 904 having quantized weights is provided to the electronic device.

[0092] Referring to Figure 10A-10B and 13, a computer system obtains (1302) a neural network model 1002 having a plurality of layers 1006, and each layer 1006 has a respective number of filters 1008. The computer system identifies (1304) a target model size to which the neural network model is compressed. One or more intermediate model sizes are derived (1306) from the target model size of the neural network model 1002. The one or more intermediate model sizes and the target model size form (1308) an ordered sequence of model sizes. The computer system implements (1310) a sequence of pruning operations 1010, e.g., to compress the neural network model 1002 gradually. Each pruning operation corresponds to a respective model size in the ordered sequence of model sizes. For each pruning operation, the computer system identifies (1312) a respective subset of filters 1008 of the neural network model 1002 to be removed based on the respective model size, and updates (1314) the neural network model to remove the respective subset of filters 1008, thereby reducing a size of the neural network model 1102 to the respective model size. An order of each pruning operation in the sequence of pruning operations 1010 is consistent with an order of the respective model size in the sequence of model sizes.

[0093] In some embodiments, for each pruning operation 1010, the updated neural network model 1012 is trained according to a predefined loss function. Weights of unpruned filters of the updated neural network model 1012 are adjusted, e.g., to minimize the predefined loss function.

[0094] In some embodiments, the target model size of the neural network model 1002 satisfies a target computation criterion associated with an electronic device, and a target neural network model pruned by the sequence of pruning operations 1010 is provided to the electronic device. Further, in some embodiments, the computer system determines each of the one or more intermediate model sizes to approach the target computation criterion gradually with the one or more intermediate model sizes, e.g., with equal step sizes or varying step sizes. Additionally, in some embodiments, the neural network model 1002 is obtained with a first model size, and the one or more intermediate model sizes are equally distributed between the original model size and the target model size.

[0095] In some embodiments, the respective subset of filters of the neural network model 101 are identified for each pruning operation 1010 by determining an importance score for each filter 1008 of the neural network model 1002 or 1014 (if dilated) and selecting the respective subset of filters 1008 that having the smallest importance scores I among the filters 1008 of the neural network model. Further, in some embodiments, to determine the importance score for each filter 1008, the computer system determines a sum of weights of the respective filter 1008 that are applied to convert inputs of the respective filter 1008, and associates the importance score I of the respective filter with the sum of weights of the respective filter.

[0096] In some embodiments, prior to deriving the one or more intermediate model sizes and implementing the sequence of pruning operations, the computer system dilates the neural network model. The dilated neural network model 1014 has a dilated model size. The one or more intermediate model sizes are derived based on the dilated model size and the target model size of the neutral network model, and the sequence of pruning operation is initiated on the dilated neural network model 1014. In an example, the one or more intermediate model sizes are equally distributed between the dilated model size and the target model size. In another example, the one or more intermediate model sizes are not equally distributed between the dilated model size and the target model size. Further, in some embodiments, the neural network model is dilated to increase a size of the neural network model by a predefined ratio. Additionally, in some embodiments, the neural network model 1002 is dilated by at least one of: adding one or more supplemental layers to the plurality of layers 1006 of the neural network model 1002 and adding one or more supplemental filters to each of a subset of the plurality of layers 1006 of the neural network model 1002. Further, in some embodiments, the dilated neural network model 1014 is re-trained according to the predefined loss function.

[0097] In some embodiments, the neural network model 1002 includes a plurality of weights associated with the respective number of filters 908 of each layer 906. The computer system maintains a float32 format for the plurality of weights during the sequence of pruning operations. After generating the target neural network model 904, the computer system quantizes the plurality weights, e.g., from a float32 format to an int8 , uint8 , inti 6 or uintl6 format. Further, in some embodiments, the plurality of weights are quantized based on a precision setting of an electronic device. The target neural network model 904 having quantized weights is provided to the electronic device.

[0098] It should be understood that the particular order in which the operations in each of Figures 11-13 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to prune the neural network models as described herein. Additionally, it should be noted that details of other processes described above with respect to Figures 5-10 are also applicable in an analogous manner to each of the methods 1100, 1200, and 1300 described above with respect to Figures 11-13. Also, it should be noted that details of each of the methods 1100, 1200, and 1300 described above with respect to Figures 11-13 are also applicable in an analogous manner to any other of the methods 1100, 1200, and 1300. For brevity, these details are not repeated here.

[0099] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. [00100] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[00101] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[00102] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.