Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND SYSTEM FOR EDITING AN IMAGE
Document Type and Number:
WIPO Patent Application WO/2024/047637
Kind Code:
A1
Abstract:
A system for editing an input image comprises an input circuit for receiving the input image, and an image processor configured to convert the input image to a noise array, to construct an image- specific generative adversarial network (GAN) that is specific to the input image, and that synthesizes the input image from the noise array, to vary the noise array, and to apply the image- specific GAN to the varied noise array, to provide a varied generated synthetic image representing an edited version of the image.

Inventors:
SHEFFI EREZ (IL)
ROTMAN MICHAEL (IL)
WOLF LIOR (IL)
Application Number:
PCT/IL2023/050914
Publication Date:
March 07, 2024
Filing Date:
August 28, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SHEFFI EREZ (IL)
ROTMAN MICHAEL (IL)
WOLF LIOR (IL)
International Classes:
G06T11/60; G06F18/214; G06T9/00; G06V10/82; G06N3/08; G06T5/00
Attorney, Agent or Firm:
EHRLICH, Gal et al. (IL)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A method of editing an input image, comprising: converting the input image to a noise array; constructing an image-specific generative adversarial network (GAN) that is specific to the input image, and that synthesizes the input image from said noise array; varying said noise array; and applying said image-specific GAN to said varied noise array, to provide a varied generated synthetic image representing an edited version of the image.

2. The method according to claim 1, wherein said constructing comprises: receiving a non-specific GAN having preset values for a set of parameters; applying said non-specific GAN to said noise array to synthesize a test image; calculating a distance metric between said test image and the input image; applying a machine learning procedure to said distance metric to calculate new values for a set of parameters in a manner that reduces said distance metric; and updating said preset values of said non-specific GAN with said new values, thereby constructing said image- specific GAN.

3. The method according to claim 1, wherein said constructing comprises: receiving a non-specific GAN having preset values for a set of parameters; defining an objective function having a contribution representing a distance metric corresponding to the input image and said image- specific GAN, and a contribution representing a distance metric corresponding to said non-specific GAN and said image- specific GAN; applying a machine learning procedure to said objective function to calculate new values for a set of parameters in a manner that reduces said objective function; and updating said preset values of said non-specific GAN with said new values, thereby constructing said image- specific GAN.

4. The method according to any of claims 2 and 3, wherein said machine learning procedure comprises a set of machine learning modules, each corresponding to a different layer of said non-specific GAN, and being applied independently to said layer.

5. The method according to claim 4, wherein each machine learning module is a deep learning module.

6. The method according to claim 2, wherein said machine learning procedure is applied to gradients of said distance metric with respect to said parameters.

7. The method according to any of claims 3-5, wherein said machine learning procedure is applied to gradients of said distance metric with respect to said parameters.

8. The method according to claim 1, wherein the image is a real world image.

9. The method according to any of claims 2-7, wherein the image is a real world image.

10. The method according to claim 1, wherein said GAN has a StyleGAN architecture.

11. The method according to any of claims 2-9, wherein said GAN has a StyleGAN architecture.

12. The method according to claim 1, wherein said varying said noise array is selected to semantically manipulate the input image.

13. The method according to any of claims 2-11, wherein said varying said noise array is selected to semantically manipulate the input image.

14. The method according to claim 1, wherein said converting is by a technique selected from the group consisting of pSp, e4e, ReStyle, and HyperStyle.

15. The method according to any of claims 2-13, wherein said converting is by a technique selected from the group consisting of pSp, e4e, ReStyle, and HyperStyle.

16. A system for editing an input image, comprising: an input circuit for receiving the input image; and an image processor configured to convert the input image to a noise array, to construct an image- specific generative adversarial network (GAN) that is specific to the input image, and that synthesizes the input image from said noise array, to vary said noise array, and to apply said image- specific GAN to said varied noise array, to provide a varied generated synthetic image representing an edited version of the image.

17. The system according to claim 16, wherein said image processor is configured for: receiving a non-specific GAN having preset values for a set of parameters, thereby providing a synthetic image; applying said non-specific GAN to said noise array to synthesize a test image; calculating a distance metric between said test image and the input image; applying a machine learning procedure to said distance metric to calculate new values for a set of parameters in a manner that reduces said distance metric; and updating said preset values of said non-specific GAN with said new values, thereby constructing said image- specific GAN.

18. The system according to claim 16, wherein said image processor is configured for receiving a non-specific GAN having preset values for a set of parameters; and calculating an objective function having a contribution representing a distance metric corresponding to the input image and said image- specific GAN, and a contribution representing a distance metric corresponding to said non-specific GAN and said image- specific GAN; applying a machine learning procedure to said objective function to calculate new values for a set of parameters in a manner that reduces said objective function; and updating said preset values of said non-specific GAN with said new values, thereby constructing said image- specific GAN.

19. The system according to any of claims 17 and 18, wherein said machine learning procedure comprises a set of machine learning modules, each corresponding to a different layer of said non-specific GAN, and being applied independently to said layer.

20. The system according to claim 19, wherein each machine learning module is a deep learning module.

21. The system according to any of claims 17-20, wherein said machine learning procedure is applied to gradients of said distance metric with respect to said parameters.

22. The system according to any of claims 16-21, wherein the image is a real world image.

23. The system according to any of claims 16-22, wherein said GAN has a StyleGAN architecture.

24. The system according to any of claims 16-23, wherein said image processor is configured to vary said noise array so as to semantically manipulate the input image.

25. The system according to any of claims 16-24, wherein said wherein said image processor is configured to apply for said conversion a technique selected from the group consisting of pSp, e4e, ReStyle, and HyperStyle.

Description:
METHOD AND SYSTEM FOR EDITING AN IMAGE

RELATED APPLICATION

This application claims the benefit of priority of U.S. Provisional Patent Application No. 63/401,669 filed on August 28, 2022, the contents of which are incorporated herein by reference in their entirety.

FIELD AND BACKGROUND OF THE INVENTION

The present invention, in some embodiments thereof, relates to image processing and, more particularly, but not exclusively, to method and system for editing an image.

Known in the art are Generative Adversarial Networks (GAN) capable of producing a realistic image for every vector in their input domain. Enhanced image GANs, such as Progressive Growing GAN (PG-GAN) and StyleGAN use progressive training to encourage each layer to model the variation exhibited at given image resolutions. Based on the progressive training, the models of these GANs are able to manipulate semantic elements of the synthetic image.

SUMMARY OF THE INVENTION

According to some embodiments of the invention the present invention there is provided a method of editing an input image. The method comprises: converting the input image to a noise array; constructing an image- specific generative adversarial network (GAN) that is specific to the input image, and that synthesizes the input image from the noise array; varying the noise array; and applying the image- specific GAN to the varied noise array, to provide a varied generated synthetic image representing an edited version of the image.

According to an aspect of some embodiments of the present invention there is provided a system for editing an input image. The system comprises an input circuit for receiving the input image, and an image processor. The image processor is configured to convert the input image to a noise array, to construct an image- specific generative adversarial network (GAN) that is specific to the input image, and that synthesizes the input image from the noise array, to vary the noise array, and to apply the image- specific GAN to the varied noise array, to provide a varied generated synthetic image representing an edited version of the image.

According to some embodiments of the invention the GAN construction comprises: receiving a non-specific GAN having preset values for a set of parameters; applying the nonspecific GAN to the noise array to synthesize a test image; calculating a distance metric between the test image and the input image; applying a machine learning procedure to the distance metric to calculate new values for a set of parameters in a manner that reduces the distance metric; and updating the preset values of the non-specific GAN with the new values, thereby constructing the image- specific GAN.

According to some embodiments of the invention the method comprises receiving a nonspecific GAN having preset values for a set of parameters, defining an objective function having a contribution representing a distance metric between the input image and the image- specific GAN, and a contribution representing a distance metric between the non-specific GAN and the image- specific GAN, applying a machine learning procedure to the objective function to calculate new values for a set of parameters in a manner that reduces the objective function, and updating the preset values of the non-specific GAN with the new values, thereby constructing the image- specific GAN.

According to some embodiments of the invention the machine learning procedure comprises a set of machine learning modules, each corresponding to a different layer of the nonspecific GAN, and being applied independently to the layer.

According to some embodiments of the invention each machine learning module is a deep learning module. According to some embodiments a number, L, of blocks of layers in the deep learning module is predetermined. According to some embodiments, L=l. According to some embodiments, L > 1. According to some embodiments L < 10.

According to some embodiments of the invention the machine learning procedure is applied to gradients of the distance metric with respect to the parameters.

According to some embodiments of the invention the input image is a real world image.

According to some embodiments of the invention the GAN has a PG- GAN architecture.

According to some embodiments of the invention the GAN has a StyleGAN architecture.

According to some embodiments of the invention the noise array is varied so as to semantically manipulate the input image.

According to some embodiments of the invention the conversion into a noise array is by a technique selected from the group consisting of pSp, e4e, ReStyle, and HyperStyle.

Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.

Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.

For example, hardware for performing selected tasks according to embodiments of the invention could be implemented as a chip or a circuit. As software, selected tasks according to embodiments of the invention could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system. In an exemplary embodiment of the invention, one or more tasks according to exemplary embodiments of method and/or system as described herein are performed by a data processor, such as a computing platform for executing a plurality of instructions. Optionally, the data processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data. Optionally, a network connection is provided as well. A display and/or a user input device such as a keyboard or mouse are optionally provided as well.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.

In the drawings:

FIG. 1 shows results for facial-domain inversion. The method of the present embodiments attains near-perfect image reconstruction: for most images, it is almost impossible to distinguish between the source image and the reconstructed one.

FIG. 2 is a schematic illustration of a structure of one residual block of a gradient modification module, according to some embodiments of the present invention.

FIGs. 3A-G shows visual comparison of image reconstruction quality on a dataset known as the CelebA-HQ dataset. FIGs. 4A-G shows visual comparison of image reconstruction quality on a dataset known as the Stanford Cars dataset.

FIGs. 5A-G shows visual comparison of image reconstruction quality on a dataset known as the LSUN Church dataset;

FIG. 6 shows demonstration of pose, smile and age editing on images from the CelebA- HQ dataset, by employing an editing method known as the InterFaceGAN method.

FIGs. 7A-G shows visual comparison of editing quality on the Stanford Cars dataset, by employing an editing method known as the GANSpace method.

FIGs. 8A-G shows visual comparison of editing quality on the LSUN Church dataset, by employing an editing method known as the GANSpace method.

FIGs. 9A-C shows visual comparison of the effect of training with as a multi-layer faceparsing distance metric. FIG. 9A shows the ground, FIG. 9B shows results of training with as the multi-layer face-parsing distance metric the ground, and FIG. 9C shows results of training without the multi-layer face-parsing distance metric. As shown, when training with parsing loss, the resulting reconstruction better preserve the identity.

FIGs. 10A and 10B show results obtained by training on an image from the Celeb A-HQ dataset with (FIG. 10A) and without (FIG. 10B) localization.

FIG. 11 is a flowchart diagram of a method suitable for editing an input image according to various exemplary embodiments of the present invention.

FIG. 12 is a schematic illustration of a computing platform suitable for execution selected operations of the method according to some embodiments of the present invention.

DESCRIPTION OF SPECIFIC EMBODIMENTS OF THE INVENTION

The present invention, in some embodiments thereof, relates to image processing and, more particularly, but not exclusively, to method and system for editing an image.

Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.

Referring now to the drawings, FIG. 11 is a flowchart diagram of a method suitable for editing an input image. It is to be understood that, unless otherwise defined, the operations described hereinbelow can be executed either contemporaneously or sequentially in many combinations or orders of execution. Specifically, the ordering of the flowchart diagrams is not to be considered as limiting. For example, two or more operations, appearing in the following description or in the flowchart diagrams in a particular order, can be executed in a different order (e.g., a reverse order) or substantially contemporaneously. Additionally, several operations described below are optional and may not be executed.

At least part of the operations described herein can be implemented by a data processing system, e.g., a dedicated circuitry or a general purpose processor, configured for executing the operations described below. At least part of the operations can be implemented by a cloudcomputing facility at a remote location.

Computer programs implementing the method of the present embodiments can commonly be distributed to users by a communication network or on a distribution medium such as, but not limited to, a floppy disk, a CD-ROM, a flash memory device and a portable hard drive. From the communication network or distribution medium, the computer programs can be copied to a hard disk or a similar intermediate storage medium. The computer programs can be run by loading the code instructions either from their distribution medium or their intermediate storage medium into the execution memory of the computer, configuring the computer to act in accordance with the method of this invention. During operation, the computer can store in a memory data structures or values obtained by intermediate calculations and pull these data structures or values for use in subsequent operation. All these operations are well-known to those skilled in the art of computer systems.

Processing operations described herein may be performed by means of processer circuit, such as a DSP, microcontroller, FPGA, ASIC, etc., or any other conventional and/or dedicated computing system.

The method of the present embodiments can be embodied in many forms. For example, it can be embodied in on a tangible medium such as a computer for performing the method operations. It can be embodied on a computer readable medium, comprising computer readable instructions for carrying out the method operations. In can also be embodied in electronic device having digital computer capabilities arranged to run the computer program on the tangible medium or execute the instruction on a computer readable medium.

The method begins at 10 and optionally and preferably continues to 11 at which the input image is received. The image is preferably a digital image and can be received from an external source, such as a storage device storing the image in a computer-readable form, and/or be transmitted to a data processor executing the method operations over a communication network, such as, but not limited to, the internet. In various exemplary embodiments of the invention the image is other than a synthetic image, namely a real world image captured by an imaging device. In some embodiments of the present invention the method comprises capturing the image using an imaging system, e.g., a scanner or a digital camera.

The method proceeds to 12 at which the input image is converged to a noise array. This can be done using an artificial neural network that cats as an encoder E and that is trained to map an input image I into a latent space w in the same style space as a generative adversarial network (GAN) generator.

Representative examples of GAN generators suitable for the present embodiments, include, without limitation, a StyleGAN generator, a DCGAN generator, and a ProGAN generator.

The present embodiments contemplate many types of encoders suitable for use as encoder E. Representative examples, include, without limitation, a variational autoencoder, a hierarchical variational encoder, an invertible neural network, an adversarial inversion encoder, Pixel2Style2Pixel (pSp) encoder [38], Residual-Based encoder (e.g., ReStyle [3]), a hypernetwork based encoder (e.g., Hyperstyle [4]), and an encoder known as e4e (encoder for editing) [45].

The method proceeds to 13 at which an image-specific GAN, G&, is constructed. The GAN, Ge 1 , is "image- specific" in the sense it synthesizes the input image I (within a predetermined tolerance) from the noise array w. The image- specific GAN, Ge 1 , preferably has the same architecture as the GAN for which the encoder E is trained. In various exemplary embodiments of the invention the image-specific GAN has a StyleGAN architecture. Other architectures, such as those employed by the aforementioned DCGAN generator and ProGAN generator, are also contemplated.

The GAN, Ge 1 , is optionally and preferably constructed based on a non-specific GAN, Ge, having preset values for a set of parameters { 0i] .

The non-specific GAN, Ge, is "non-specific" in the sense that the image that it synthesizes from the noise array w obtained at 12 is not the same as the input image received at 11. In particular, the set of parameters {Oi} of Ge is not adapted specifically to synthesize the input image from the array w.

The image- specific GAN, Ge 1 , is optionally and preferably constructed by applying the non-specific GAN, Ge, to the noise array w to synthesize a test image Itest, and calculating a distance metric L between the test image I te st and the input image I. Once the distance metric L is calculated, a machine learning procedure is applied to the distance metric so as to calculate new values 0'i for the set of parameters of the non-specific GAN, Ge, in a manner that reduces the distance metric. The image- specific GAN, G&, is then constructed by updating the preset values 0i, of the parameters of the non-specific GAN Ge, with the calculated new values 0).

The distance metric £ preferably comprises a contribution for a pixel-wise loss, which can be calculated as a distance (e.g., Euclidian distance or Li smooth distance, Chebyshev distance, Minkowski distance, etc.) between the images, and may optionally and preferably also include one or more additional contributions, such as, but not limited to, a contribution for a perceptual similarity loss, a contribution for identity-preserving loss, and a contribution for a multi-layer face-parsing loss.

Thus, the distance metric £ preferably has the form distance metric £ = £1 + £2 + ..., wherein each £i, z=l,2,... represents another type of loss. More details regarding these losses are provided in the Examples section that follows.

Representative examples of machine learning procedures suitable for use according to some embodiments of the present invention include, without limitation, artificial neural networks, clustering, association rule algorithms, feature evaluation algorithms, subset selection algorithms, support vector machines, classification rules, cost-sensitive classifiers, vote algorithms, stacking algorithms, Bayesian networks, decision trees, neural networks, instance-based algorithms, linear modeling algorithms, k-nearest neighbors (KNN) analysis, ensemble learning algorithms, probabilistic models, graphical models, logistic regression methods (including multinomial logistic regression methods), gradient ascent methods, extreme gradient boosting, singular value decomposition methods and principle component analysis.

Preferably, the machine learning procedure comprises an artificial neural network. Artificial neural network are a class of algorithms based on a concept of inter-connected computer code elements referred to as "artificial neurons" (oftentimes abbreviated as "neurons"). In a typical neural network, neurons contain data values, each of which affects the value of a connected neuron according to connections with pre-defined strengths, and whether the sum of connections to each particular neuron meets a pre-defined threshold. By determining proper connection strengths and threshold values (a process also referred to as training), a neural network can achieve efficient recognition of images and characters. Oftentimes, these neurons are grouped into layers in order to make connections between groups more obvious and to each computation of values. Each layer of the network may have differing numbers of neurons, and these may or may not be related to particular qualities of the input data.

In one implementation, called a fully-connected neural network, each of the neurons in a particular layer is connected to and provides input value to those in the next layer. These input values are then summed and this sum compared to a bias, or threshold. If the value exceeds the threshold for a particular neuron, that neuron then holds a positive value which can be used as input to neurons in the next layer of neurons. This computation continues through the various layers of the neural network, until it reaches a final layer. At this point, the output of the neural network routine can be read from the values in the final layer. Unlike fully-connected neural networks, convolutional neural networks operate by associating an array of values with each neuron, rather than a single value. The transformation of a neuron value for the subsequent layer is generalized from multiplication to convolution.

Convolutional neural networks (CNNs) include one or more convolutional layers in which the transformation of a neuron value for the subsequent layer is generated by a convolution operation. The convolution operation includes applying a convolutional kernel (also referred to in the literature as a filter) multiple times, each time to a different patch of neurons within the layer. The kernel typically slides across the layer until all patch combinations are visited by the kernel. The output provided by the application of the kernel is referred to as an activation map of the layer. Some convolutional layers are associated with more than one kernel. In these cases, each kernel is applied separately, and the convolutional layer is said to provide a stack of activation maps, one activation map for each kernel. Such a stack is oftentimes described mathematically as an object having D+l dimensions, where D is the number of lateral dimensions of each of the activation maps. The additional dimension is oftentimes referred to as the depth of the convolutional layer. For example, in CNNs that are configured to process two- dimensional image data, a convolutional layer that receives the two-dimensional image data provides a three-dimensional output, with two-dimensional activation maps and one depth dimension.

In some embodiments of the present invention an objective function Ztotai is defined. The objective function Ztotai optionally and preferably has a first contribution ZIG representing the distance metric L that corresponds to the input image I and the image- specific GAN, Ge 1 , and a second contribution ZGG representing the application of the distance metric L that corresponds to the two GANs, Ge, and Ge 1 . In these embodiments, the machine learning procedure is applied to reduce the objective function Ztotai. A representative example for the first contribution ZIG includes the distance metric Z between the input image I and the image synthesized by Ge 1 from to the noise array w obtained at 12. A representative example for the second contribution ZGG includes the distance metric Z between two images that are synthesized from the result of a mapping f that is applied to a stochastically selected array, z, wherein one of these images is synthesized by Ge 1 and the other of these images is synthesized by Ge, and wherein the mapping/ is defined by the non-specific GAN, Ge. The stochastic selection is according to a predetermined statistical distribution, such as, but not limited to, Gaussian, Poisson, Binomial, Bernoulli, Gamma, Log-Normal, Beta, Cauchy, Weibull, Chi-Square, and the like. In experiments performed by the Inventors, Gaussian distribution was employed.

In some embodiments of the present invention the machine learning procedure comprises a set of machine learning modules, where each module corresponds to a different layer of the non-specific GAN, Ge, and is applied independently to the layer. The modules are optionally and preferably gradient modification modules, wherein the ith module Mi maps between a gradient of the objective function for the ith parameter 0i, and a correction A0, for this parameter. The correction A0, can then be used to update the parameters of Ge to provide the image-specific GAN, Ge 1 . For example, following the substitution of each parameter 0i of Ge with 0i'=0i+A0i, the GAN Ge becomes the GAN Ge 1 , to obtain replaced with In some embodiments of the present invention each module is a deep learning module. Each such module contains a sequential set of I = 1 ... L residual blocks, wherein L can be predetermined or be an input parameter. Typical values for L include, without limitation, L=l, L=2, L=3, L=4, L=5. In some embodiments of the present invention L<10 and in some embodiments of the present invention L>10. A set of gradient modification modules suitable for the present embodiments is provided in the Examples section that follows.

Once the image- specific GAN, Ge 1 , is obtained, the method can proceed to 14 at which the noise array is varied. The variation is preferably selected to semantically manipulate the input image. Semantic image manipulation aims to vary the input image such that the varied image matches the content the source image. Representative examples of semantic image manipulation suitable for the present embodiments, including, without limitation, object replacement, modification of one or more attributes (color, size, orientation, etc.) of objects in the image, contextual change (e.g., adding or removing objects), changing the visual style (texture, hue, etc.) while maintaining the content of the input image, changing facial expressions, and the like.

The method can then proceed to 15 at which the image- specific GAN is applied to the varied noise array. This provides a varied generated synthetic image that represents an edited version of the image.

The method ends at 16.

FIG. 12 is a schematic illustration of a client computer 130 having a hardware processor 132, which typically comprises an input/output (I/O) circuit 134, a hardware central processing unit (CPU) 136 (e.g., a hardware microprocessor), and a hardware memory 138 which typically includes both volatile memory and non-volatile memory. CPU 136 is in communication with I/O circuit 134 and memory 138. Client computer 130 preferably comprises a graphical user interface (GUI) 142 in communication with processor 132. I/O circuit 134 preferably communicates information in appropriately structured form to and from GUI 142. Also shown is a server computer 150 which can similarly include a hardware processor 152, an I/O circuit 154, a hardware CPU 156, a hardware memory 158. I/O circuits 134 and 154 of client 130 and server 150 computers can operate as transceivers that communicate information with each other via a wired or wireless communication. For example, client 130 and server 150 computers can communicate via a network 140, such as a local area network (LAN), a wide area network (WAN) or the Internet. Server computer 150 can be in some embodiments be a part of a cloud computing resource of a cloud computing facility in communication with client computer 130 over the network 140.

GUI 142 and processor 132 can be integrated together within the same housing or they can be separate units communicating with each other. GUI 142 can optionally and preferably be part of a system including a dedicated CPU and VO circuits (not shown) to allow GUI 142 to communicate with processor 132. Processor 132 issues to GUI 142 graphical and textual output generated by CPU 136. Processor 132 also receives from GUI 142 signals pertaining to control commands generated by GUI 142 in response to user input. GUI 142 can be of any type known in the art, such as, but not limited to, a keyboard and a display, a touch screen, and the like.

Client 130 and server 150 computers can further comprise one or more computer-readable storage media 144, 164, respectively. Media 144 and 164 are preferably non-transitory storage media storing computer code instructions for executing the method as further detailed herein, and processors 132 and 152 execute these code instructions. The code instructions can be run by loading the respective code instructions into the respective execution memories 138 and 158 of the respective processors 132 and 152.

Each of storage media 144 and 164 can store program instructions which, when read by the respective processor, cause the processor to receive an input image as further detailed hereinabove.

The program instructions can also cause the processor to convert the input image to a noise array, to construct an image- specific GAN, to vary the noise array, and to apply the imagespecific GAN to the varied noise array to provide a varied generated synthetic image representing an edited version of the image, as further detailed hereinabove.

In some embodiments of the present invention all the input image is received by computer 130 locally, for example, using GUI 142 and/or storage 144 and/or by means of an imaging system 146 configured to transmit digital image data to processor 132. In these embodiments computer 130 can execute the operations of the method described herein. Alternatively, computer 130 can transmit the received input image to computer 150 via communication network 140, in which case computer 130 executes the operations of the method described herein, and transmits the varied generated synthetic image back to computer 130 for generating a displayed output on GUI 142.

In some embodiments of the present invention the input image is stored in storage 164, and that input is received by computer 130 over communication network 140. In these embodiments, computer 130 can execute the operations of the method described herein using the input received from computer 150 over communication network 140.

As used herein the term “about” refers to ± 10 %

The terms "comprises", "comprising", "includes", "including", “having” and their conjugates mean "including but not limited to".

The term “consisting of’ means “including and limited to”.

The term "consisting essentially of" means that the composition, method or structure may include additional ingredients, steps and/or parts, but only if the additional ingredients, steps and/or parts do not materially alter the basic and novel characteristics of the claimed composition, method or structure.

As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.

Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween. It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

Various embodiments and aspects of the present invention as delineated hereinabove and as claimed in the claims section below find experimental support in the following examples.

EXAMPLES

Reference is now made to the following examples, which together with the above descriptions illustrate some embodiments of the invention in a non limiting fashion.

StyleGAN2 was demonstrated to be a powerful image generation engine that supports semantic editing. However, in order to manipulate a real-world image, one needs to be able to recover a vector in StyleGAN's encoding space that is decoded to an image as close as possible to the desired image. For many real-world images, such a vector does not exist. This Example presents a per-image optimization method that tunes a StyleGan generator such that it achieves a local edit to the generator's weights, resulting in almost perfect inversion, while still allowing image editing, by keeping the rest of the mapping between an input vector and an output image relatively intact. The method is based on a one- shot training of a set of shallow update networks that modify the layers of the generator. After training, a modified generator is obtained by a single application of these networks to the original parameters, and as a result, the previous editing capabilities of the generator are maintained. The experiments described below show a sizable gap in performance over the current state of the art in this very active domain.

Introduction

The ability of an observer to distinguish between synthetic images and real ones has become increasingly challenging since the introduction of Generative Adversarial Networks (GAN) [16]. Advanced GAN networks produce a realistic image for every vector in their input domain.

The StyleGAN [23] family of generators can produce, for example, realistic faces based on random input vectors. However, for most real-world face images, one cannot find an input vector that would result in exactly the same image. This is a limitation of GANs, with far-reaching implications at the application level. It has been repeatedly shown that the StyleGAN latent space displays semantic properties that make it especially suited for image editing applications, see [6] for a survey. However, without the ability to embed real-world images within this space, these capabilities are limited mostly to synthetic images.

A set of techniques were, therefore, developed in order to tune the StyleGAN generator G such that it would produce a desired image [23, 39, 4]. Existing methods can be divided into two categories: methods that optimize one image at a time, and methods that employ pretrained feedforward networks to produce a modification of G, given an input image I. The former is believed to be more accurate, but slower at inference time than the latter.

This Example combines elements from both approaches. It provides an optimization procedure that is applied for a single image, and it uses a trainable network. The role of these networks is to estimate the change one wishes to apply to the parameters of G. By optimizing the networks to control this change rather than applying it directly, the change being applied is regularizes. The per-layer networks that is trained adapt the parameters of G based on previous parameter variations. Thus, these changes are applied in a very local manner, separately to each layer. Through weight sharing, the capacity of these networks is limited, and ensures that this mapping is relatively simple.

In addition to a novel architecture, the Inventors also modify the loss term that is used and show that a face-parsing network provides a strong signal, side by side with a face-identification network that is widely used in the relevant literature.

This Example demonstrate: (i) a far more faithful inversion in comparison to the state-of- the-art methods, (ii) a more limited effect on the result of applying G to other vectors in its input space, (iii) a superb ability to support downstream editing applications.

Adversarial image generation methods and ways to control the generated image to fit an input image, will now be described.

Generative Adversarial Networks

Generative Adversarial Networks (GAN) [16] are a family of generative models composed of two neural networks: a generator and a discriminator. The generator is tasked with learning a mapping from a latent space to a given data distribution, whereas the discriminator aims to distinguish between generated samples and real ones. GANs have been widely applied in many computer vision tasks, such as generating super-resolution images [26, 48], image-to-image translation [56], and face generation [22, 23, 24]. Once trained, given an input vector, the generator produces a realistic image. Since the mapping between the input vector and the image space is not trivial, it is hard to predict the generated image from the input vector.

One way of controlling the synthesis is by feeding additional information to the GAN during the training phase, for instance, by adding additional discrete [32, 35], or continuous [12] labels as inputs. This approach requires additional supervision. To bypass this limitation, other approaches constrain the input vector space directly, either by applying tools from information theory [7], or by limiting this space to a non-trivial topology [41]. As a result, the input vector space is disentangled: different entries of the input vector control a different aspect or "generative factor" of the generated image.

It has been observed [23, [37] that a continuous translation between two vectors in the GAN latent space leads to a continuous change in the generated image. Shen et al. [42] has further expanded this observation to face generation, where it was shown that facial attributes lie in different hyperplanes, and are therefore controllable using vector arithmetic in the latent space. Further investigation into the latent space structure has been performed by applying unsupervised methods, such as Principal Component Analysis (PCA) [18] or eigenvalue decomposition [43], or by using semantic labels [2, 42].

The existence of an underlying structure, and the presence of a semantic algebra, makes it possible to edit the generated images.

StyleGAN [23] is a style -based generator architecture. Unlike the traditional GAN, which maps a noisy signal directly to images, it splits the noisy signal into two: (i) a style generating signal z e R 512 to a style latent space, w e W = R 512 , which is used globally as a set of layer-wise parameters for the GAN (ii) additional noise, which is added to the feature maps after each convolutional layer. This design benefits from both stochastic variation and scale-specific feature synthesis. StyleGAN2 [24] introduced an path-length regularization term that enables smoother style mapping between Z and W, together with a better normalization scheme for the generator.

Inverting GANs

In order to edit a real image using a latent space modification, the originating point in the latent space is identified. The solutions of the inverse problem can be divided into three families: (i) optimization-based (ii) encoder-based (iii) generator-modifying. Unlike the last approach, the first two families do not alter any of the generator parameters. In the optimization-based approach, a latent code w* is evaluated given an input image, 7, in an iterative manner [10, [29], such that, I » G(yv*, 0), where 0 are the generator's parameters. For the task of face generation, Karras et al. [24] proposed an optimization-based inversion scheme for StyleGAN2, where both the latent code w and the injected noise are optimized together, combined with a regularization term that minimizes the auto-correlation of the injected noise at different scales. Abdal et al. [1] extended this direction, by expanding latent space W to W*. To accomplish this, instead of constraining the latent code, w e W, to be identical for all convolutional layers, each layer is now fed with a different set of parameters. This modification extended the dimensionality of the latent from R 512 to R 18x512 . To preserve spatial details during inversion, Zhang et al [53] considered a spatially-structured latent space, replacing the original one-dimensional representation, W.

The second family of solutions is encoder-based. These methods utilize an encoder, E. that maps between the image space and the latent space, w* = E(T). Unlike the iterative family, these encoders are trained on a set of samples [17, 30]. These image encoders have also been applied for face generation, where they are employed during training, by combining the autoencoder framework with a GAN [36], or more commonly, on pre-trained generative models, which require far less training data. These approaches focus on mapping an image to an initial latent code, weW, which may later be fine-tuned using an optimization method [55]. pSp [38] employs a different scheme, in which an additional fine-tuning of w is not performed; instead, a pyramid-based encoder is designed for each style vector of the StyleGAN2 framework. e4e [45] further expands on this idea by limiting the hierarchical structure of the w codes to be of a residual type, so that each style vector is the sum of a basis style vector and a residual part. Further improvement was achieved for the inversion problem by iteratively encoding the image onto the latent space and feeding the generated image back to the encoder as an input [3].

In the third family of solutions, an initial w latent code is evaluated using an encoder. As a second step, the generator is tuned to produce the required image from the w [39]. This tuning procedure has been further improved by employing hypemetworks [4]. In this scenario, instead of directly modifying the generator weights, a set of residual weights is evaluated using an additional neural network. The input to this network is the target image. Therefore, given a new image, a new set of weights is computed.

A faithful edit is also used to preserve the original identity. Liang et al [28] applied Neural Spline Networks to find faithful editing directions. Editing local aspects was accomplished by manipulating specific parts of the feature maps throughout the generation process [47]. Method

Given a candidate target image I to be edited, a corresponding latent code, w e W* = R 18x512 , is estimated using the off-the-shelf encoder e4e [45]. The latent code, w, is fed into the pre-trained generator to produce a reconstructed image, G(it') = G(w, 0), where the right-hand side explicitly states the parameters 0 of network G. Since w is not an exact solution to the inverse problem, the reconstructed image is usually of poor quality. In the present Example, w does not change. Instead, the method tunes the generator parameters 0 to improve the generated image, obtaining G(w, O') ~ I, where O' are the tuned parameters.

Consider the following image similarity loss function, which is the sum of four different terms. The first term is a pixel-wise loss, the Lz or the Li smooth [15] distance between the images h and h. The second term, Zi pips , is a perceptual similarity loss [54], that relies on feature maps from an pre-trained AlexNet [44] on the ImageNet dataset. Zsim is an identity-preserving similarity loss that is applied on a pair of real and reconstructed images. This loss term accounts for the cosine distance of feature vectors extracted from a pre-trained ArcFace [11] facial recognition network for the facial domain, and a pre-trained MoCo [8] for the non-facial domains, following [45, 4]. The ZI P term is a multi-layer face-parsing loss. Similarly to Zsim, it measures the layer-wise aggregated cosine distance of all the feature vectors from the contracting path of the pre-trained facial parsing network P [27], with a U-Net [40] backbone. In total, five feature vectors are used for the cosine distance evaluation.

Tuning the G's parameters, 0, preferably comprises the estimation of the modified parameters, 0', using a set of feed-forward networks. Let k be the zth layer of G, and let 0, and Ll Qi be its learnable parameters and the gradients of the objective function w.r.t to these parameters, respectively. Unlike PTI [39], which applies a regular gradient step to update 0i, the gradient updates are based on the mapping, where M={Mi] is a set of gradient modification modules, each mapping between the original gradients, , and the parameter correction, A0 ( . Each module Mi contains a sequential set of I = 1 ... L residual blocks [20]. Let yi be the input to the Zth residual block. The output of block /, yi+/ is then: where σ is the LeakyReLU [31] activation function, with a slope of 0.01 and SN is the Scale Normalization [34]. The learnable parameters of each block reside in two linear operators, Wi l and W 2 l , in the bias term, b l , and in the scale coefficient of the SN layer.

The parameters of W 1 l and W 2 l are initialized according to [19] from the normal distribution, with an initial standard deviation factored by 0.1, whereas the bias, b l is sampled uniformly. A block is illustrated in FIG. 2.

All the Mi networks assigned to convolutional layers with the number of channels, tin = tout = 512, share the same parameters, Ψ i = { W 1 l , W 2 l , b l , SN 1 l , SN 2 l ], where I runs from 1 to L, and L is an input parameter.

During the optimization process, (only) the parameters { Ψ i] of the set M are optimized to minimize the loss function,

(EQ. 5) where , and /'is the style mapping of StyleGAN [23] so style mapping, f(z) lies in W (and not in W*). The first term in EQ. 5 is responsible for generating a high-quality image.

The optimization procedure preferably does not dramatically alter the mapping G: w I as a large variation in the mapping would require the re-identification of the editing directions. The second term in EQ. 5, a localization regularizer [33, 39], prevents the generator from drifting by forcing the generator to produce identical images for randomly sampled latent codes.

A description of the method is available in Algorithm 1, below.

Experiments

This Example describes an extensive set of experiments performed on various image synthesis datasets. For all image domains, the method of the present embodiments utilizes a pretrained StyleGAN2. For the facial domain, this pre-trained network was optimized for generating facial images distributed according to the FFHQ [23] dataset, whereas the inversion and editing capabilities of the method of the present embodiments are evaluated over images from the test set of the Celeb-HQ [22] dataset. The method of the present embodiments was also evaluated on Church and Horses from the LSUN [52] dataset, on automobiles from Stanford Cars [25], and on wildlife images from AFHQ-WILD [9] . The pre-trained models for these domains were acquired from the e4e [45] and HyperStyle [4] official github repositories. The initial inverted vectors, w ∈ W*, were also obtained using the e4e encoders in these repositories. Since there is no pretrained encoder to W + space for the AFHQ-WILD dataset, the method of the present embodiments is evaluated on this using the W latent space.

For all experiments, the Ranger [50] optimizer was used, with a learning rate of 0.001. The number of iterations and the loss coefficients used for the method of the present embodiments appear in Table 1, below. For the CelebA-HQ and AFHQ-Wild datasets, = L2 and for Stanford Cars, LSUN Church and LSUN Horses, with 0=0.1. The difference in the number of iterations depends on the quality of the reconstruction of the original generator from w. Both LSUN Church and LSUN Horses resulted in a poor initial reconstruction, and required twice as many iterations. Table 1

The loss coefficients and number of running iterations that were used by the method of the present embodiments for each of the different datasets. Quantitative Evaluation

Table 2, below presents four evaluation criteria for reconstruction quality over the CelebA-HQ dataset. The four metrics compare the input image I with the one generated by the inversion method, G(n'.Q'). These metrics are the pixel-wise similarity, which is the Euclidean norm L2, the perceptual similarity, LPIPS [54], the structural similarity score, MS-SSIM [49], and identity similarity, ID [21], between the input image, 7, and its reconstruction, G(w,0').

Table 2

Reconstruction metrics on the CelebA-HQ[22] test set. Results for other methods are taken from [4], except for Near Perfect GAN Inversion[14], HFGI[5], Hyperlnverter[13] and Feature-Style Encoder[51] whose results were obtained from their respective papers.

The method of the present embodiments outperforms all other methods in all metrics except LPIPS, where the original StyleGAN2 inversion approach matches the method of the present embodiments. Table 3, below evaluates the reconstruction quality for the Stanford Cars dataset, Table 4 for the AFHQ-Wild dataset, Table 5 for the LSUN church dataset and Table 6 for the LSUN horses dataset, using three evaluation criteria: the pixel-wise Euclidean norm, L2, the perceptual similarity, LPIPS, and the MS-SSIM [49] score.

Table 3

Reconstruction metrics on the Stanford Cars dataset [25] test set. Results for other methods are taken from [4], except for Near Perfect GAN Inversion, whose results were obtained from [14]. Table 4

Reconstruction metrics for the AFHQ-Wild [9] test set. Results for other methods are taken from [4], except for Near Perfect GAN Inversion, whose results were obtained from [14]

Table 5

Reconstruction metrics on the LSUN churches (outdoor scene) dataset [52] test set. Results for other methods are taken from [51], except for HFGI, whose results were obtained from [5]

Table 6

Reconstruction metrics on the LSUN horses dataset [52] (on 500 randomly sampled images). Results for ReStyle and Near Perfect GAN Inversion are taken from [14].

The method of the present embodiments outperforms all other methods on the AFHQ- Wild and LSUN church datasets. In the Stanford Cars and the LSUN horses datasets, the method of the present embodiments surpasses other approaches in all evaluation metrics, except for the Lz metric, where it is second to Near Perfect GAN Inversion [14].

Qualitative Analysis

Inversion Quality

Following is a qualitative evaluation of reconstructed images. FIGs. 3A-G demonstrate the reconstruction of facial images taken from the Celeb A-HQ [22] dataset, and FIGs. 4A-G demonstrate the reconstruction of car images taken from the Stanford Cars [25] dataset. The reconstructed images produced by the method of the present embodiments are compared with the following previous approaches: pSp [38], e4e [45], ReStyle p s P [3], ReStyle e 4e [3], HyperStyle [4].

As shown in FIGs. 3A-G, the method of the present embodiments is able to generate almost identical reconstructions. The first and third rows demonstrate the reconstruction of difficult examples. The method of the present embodiments is able to produce near-identical reconstruction, whereas all other methods struggle to achieve inversion of a good quality. The second row demonstrates the reconstruction of a relatively easy example. Although all methods are able to produce meaningful reconstruction, only the method of the present embodiments is truly able to preserve identity and properly reconstruct fine details (such as gaze, eye color, dimples, etc.).

As shown in FIGs. 4A-G, there is a large gap between the method of the present embodiments and the other ones. Specifically, all baseline methods struggle to reconstruct various elements (fine or coarse). In all three rows, only the method of the present embodiments is able to reconstruct the coarse shape of the car properly. In the first row, it is shown that only the method of the present embodiments is able to reconstruct the following elements properly: (1) the coarse shape of the car, (2) the shape of the head-light, (3) the fact that the lights are on, (4) the absence of a roof, (5) the car's logo. In the second row, only the method of the present embodiments is able to reconstruct the following elements properly: (1) shape of the head-lights, (2) side-mirrors, (3) color tone, (4) reflection. In the third row, only the method of the present embodiments is able to reconstruct the following elements properly: (1) wheels, (2) placement and size of passenger window, (3) side-mirror shape and color.

Editing Quality

The ability of the method of the present embodiments to retain the original editing directions of StyleGAN is seen in FIG. 6 for the CelebA-HQ dataset, editing directions and tools were taken from InterFaceGAN [42]. A visual comparison to other approaches for Stanford Cars and LSUN Church datasets is shown in FIGs. 7A-G and FIGs. 8A-G, respectively. For these datasets, the editing directions were taken from GANSpace [18] }. As shown, unlike other approaches, the method of the present embodiments is able to retain the editing direction without harming image quality, for instance, on the automotive image, it is the only approach which modifies the color uniformly, whereas in the outdoor image, it does not introduce artifacts or change the image's content.

Ablation study

In order to assess the contribution of the normalization scheme utilized by the method of the present embodiments, and the effectiveness of the parsing loss, ZI P, the method of the present embodiments was evaluated on the first 200 samples from the test set of CelebA-HQ. First, it was tested whether the inputs to the module Mi have to be normalized using SN, Instance Normalization (IN) [46], or kept untouched. Additionally, the experiments tested which normalization scheme is better to be applied inside each residual block. Table 7, below summarizes the different normalization choices.

Table 7

Different normalization schemes applied to 200 samples from the test set of the CelebA-HQ dataset. MN stands for normalizing the inputs prior to the application of Mi, and BN stands for the normalization scheme applied in each residual block. SN is the Scale Normalization [34] and IN is the onedimensional Instance Normalization [46]. As shown, not applying normalization in the residual block harms all evaluation metrics, whereas normalizing the inputs prior to feeding them into the gradient modification modules, Mi, is not necessary.

Table 8 below shows the benefits of using the parsing loss. Whereas it harms identity preservation, ID, and LPIPS, it does in fact lead to better looking images.

Table 8

Effect of the applying ZI P to 200 samples from the test set of the CelebA-HQ dataset

FIGs. 9A-C show a comparison when the parsing loss is applied both images from the CelebA-HQ and AFHQ-Wild datasets. As shown, for the facial image, when using the parsing loss, the reconstructed image retains the fine details in the lips. Surprisingly, even though the parsing loss utilizes a network that was trained on the task of facial semantic segmentation, it is also improves the reconstruction quality for other non-human facial domains. The benefits of using the localization term in EQ 5 during optimization shown in FIGs. 10A-B. The localization term prevents model drifting, which is often manifested as additional artefacts, and in this case, the diffusion of the skin color to the teeth.

Conclusions

The proliferation of work devoted to effectively inverting StyleGANs indicates an acute need for such technologies, in order to perform image editing and semantic image manipulations. This Example presented a novel method that employs learned mappings between the loss gradient of a layer and a suggested shift to the layer's parameters. These shifts are used within an iterative optimization process, in order to fine-tune the StyleGAN generator. The learning of the mapping networks and the optimization of the generator occur concurrently and on a single sample.

A set of experiments that is considerably more extensive than what has been presented in recent relevant contributions was conducted. The experiments described herein show that the modified generator produces the target image more accurately than all other methods in this very active field. Moreover, it supports downstream editing tasks more convincingly than the alternatives. Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims. It is the intent of the applicant(s) that all publications, patents and patent applications referred to in this specification are to be incorporated in their entirety by reference into the specification, as if each individual publication, patent or patent application was specifically and individually noted when referenced that it is to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting. In addition, any priority document(s) of this application is/are hereby incorporated herein by reference in its/their entirety.

REFERENCES:

[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan: How to embed images into the stylegan latent space? 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4431-4440, 2019.

[2] Rameen Abdal, Peihao Zhu, Niloy J Mitra, and Peter Wonka. Styleflow: Attribute- conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Transactions on Graphics (TOG), 40(3):l-21, 2021.

[3] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Restyle: A residual-based stylegan encoder via iterative refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021.

[4] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit H Bermano. Hyperstyle: Stylegan inversion with hypemetworks for real image editing. arXiv preprint arXiv:2111.15666, 2021.

[5] Qingyan Bai, Yinghao Xu, Jiapeng Zhu, Weihao Xia, Yujiu Yang, and Yujun Shen. High- fidelity gan inversion with padding space. ArXiv, abs/2203.11105, 2022.

[6] Amit H Bermano, Rinon Gal, Yuval Alaluf, Ron Mokady, Yotam Nitzan, Omer Tov, Or Patashnik, and Daniel Cohen-Or. State-of-the-art in the architecture, methods and applications of stylegan. arXiv preprint arXiv:2202.14020, 2022.

[7] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Hya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29, 2016.

[8] Xinlei Chen, Haoqi Fan, Ross B. Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. ArXiv, abs/2003.04297, 2020.

[9] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.

[10] Antonia Creswell and Anil Anthony Bharath. Inverting the generator of a generative adversarial network. IEEE transactions on neural networks and learning systems, 30(7): 1967-1974, 2018.

[11] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690-4699, 2019. [12] Xin Ding, Yongwei Wang, Zuheng Xu, William J Welch, and Z. Jane Wang. Cc{gan}: Continuous conditional generative adversarial networks for image generation. In International Conference on Learning Representations, 2021.

[13] Tan M. Dinh, A. Tran, Rang Ho Man Nguyen, and Binh-Son Hua. Hyperinverter: Improving stylegan inversion via hypernetwork. ArXiv, abs/2112.00719, 2021.

[14] Qianli Feng, Viraj Shah, Raghudeep Gadde, Pietro Perona, and Aleix Martinez. Near perfect gan inversion. ArXiv, abs/2202.11833, 2022.

[15] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440-1448, 2015.

[16] Ian Goodfellow, Jean Pouget- Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

[17] Shanyan Guan, Ying Tai, Bingbing Ni, Feida Zhu, Feiyue Huang, and Xiaokang Yang. Collaborative learning for faster stylegan embedding. arXiv preprint arXiv:2007.01758, 2020.

[18] Erik H’ ark'onen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. Ganspace: Discovering interpretable gan controls. Advances in Neural Information Processing Systems, 33:9841-9850, 2020.

[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026-1034, 2015.

[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630-645. Springer, 2016.

[21] Yuge Huang, Yuhan Wang, Ying Tai, Xiaoming Liu, Pengcheng Shen, Shaoxin Li, Jilin Li, and Feiyue Huang. Curricularface: adaptive curriculum learning loss for deep face recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5901- 5910, 2020.

[22] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv: 1710.10196, 2017.

[23] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401-4410, 2019. [24] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110-8119, 2020.

[25] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR- 13), Sydney, Australia, 2013.

[26] Christian Ledig, Lucas Theis, Ferenc Husz'ar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681-4690, 2017.

[27] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. Maskgan: Towards diverse and interactive facial image manipulation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[28] Hanbang Liang, Xianxu Hou, and Linlin Shen. Ssflow: Style-guided neural spline flows for face image manipulation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 79-87, 2021.

[29] Huan Ling, Karsten Kreis, Daiqing Li, SeungWook Kim, Antonio Torralba, and Sanja Fidler. Editgan: High-precision semantic image editing. In Advances in Neural Information Processing Systems (NeurlPS), 2021.

[30] Junyu Luo, Yong Xu, Chenwei Tang, and Jiancheng Lv. Learning inverse mapping by autoencoder based generative adversarial nets. In International Conference on Neural Information Processing, pages 207-216. Springer, 2017.

[31] Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. Rectifier nonlinearities improve neural network acoustic models. In Proc, icml, volume 30(1), page 3. Citeseer, 2013.

[32] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv: 1411.1784, 2014.

[33] Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. Fast model editing at scale. CoRR, 2021.

[34] Toan Q Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. arXiv preprint arXiv: 1910.05895, 2019.

[35] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning, pages 2642-2651. PMLR, 2017. [36] Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco Doretto. Adversarial latent autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14104-14113, 2020.

[37] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv: 1511.06434, 2015.

[38] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. arXiv preprint arXiv:2008.00951, 2020.

[39] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-based editing of real images. arXiv preprint arXiv:2106.05744, 2021.

[40] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234-241. Springer, 2015.

[41] Michael Rotman, Amit Dekel, Shir Gur, Yaron Oz, and Lior Wolf. Unsupervised disentanglement with tensor product representations on the torus. In International Conference on Learning Representations, 2022.

[42] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9243-9252, 2020.

[43] Yujun Shen and Bolei Zhou. Closed-form factorization of latent semantics in gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1532-1540, 2021.

[44] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large- scale image recognition. arXiv preprint arXiv: 1409.1556, 2014.

[45] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. Designing an encoder for stylegan image manipulation. ACM Transactions on Graphics (TOG), 40(4):l-14, 2021.

[46] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv: 1607.08022, 2016.

[47] Rui Wang, Jian Chen, Gang Yu, Li Sun, Changqian Yu, Changxin Gao, and Nong Sang. Attributespecific control units in stylegan for fine-grained image manipulation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 926-934, 2021. [48] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0- 0, 2018.

[49] ZhouWang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398-1402. leee, 2003.

[50] Less Wright. Ranger - a synergistic optimizer. www(dot)github(dot)com/lessw2020/Ranger-Deep- Learning-Optimizer, 2019.

[51] Xu Yao, Alasdair Newson, Yann Gousseau, and Pierre Hellier. Feature-style encoder for style-based gan inversion. ArXiv, abs/2202.02183, 2022.

[52] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv: 1506.03365, 2015.

[53] Lingyun Zhang, Xiuxiu Bai, and Yao Gao. Sals-gan: Spatially-adaptive latent space in stylegan for real image embedding. In Proceedings of the 29th ACM International Conference on Multimedia, pages 5176-5184, 2021.

[54] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586-595, 2018.

[55] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-domain gan inversion for real image editing. Lecture Notes in Computer Science, page 592-608, 2020.

[56] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223-2232, 2017.