Abstract:
The inversion of real images into StyleGAN's latent space is a well-studied problem. Nevertheless, applying existing approaches to real-world scenarios remains an open challenge, due to an inherent trade-off between reconstruction and editability: latent space regions which can accurately represent real images typically suffer from degraded semantic control. Recent work proposes to mitigate this trade-off by fine-tuning the generator to add the target image to well-behaved, editable regions of the latent space. While promising, this fine-tuning scheme is impractical for prevalent use as it requires a lengthy training phase for each new image. In this work, we introduce this approach into the realm of encoder-based inversion. We propose HyperStyle, a hypernetwork that learns to modulate StyleGAN's weights to faithfully express a given image in editable regions of the latent space. A naive modulation approach would require training a hypernetwork with over three billion parameters. Through careful network design, we reduce this to be in line with existing encoders. HyperStyle yields reconstructions comparable to those of optimization techniques with the near real-time inference capabilities of encoders. Lastly, we demonstrate HyperStyle's effectiveness on several applications beyond the inversion task, including the editing of out-of-domain images which were never seen during training.
Videos
CVPR 2022
Overview
HyperStyle introduces hypernetworks for learning to refine the weights of a pre-trained StyleGAN generator with respect to a given input image.
Doing so enables optimization-level reconstructions with encoder-like inference times and high editability.
Most works studying inversion search for a latent code that most accurately reconstructs a given image.
Some recent works have proposed a per-image fine-tuning of the generator weights to achieve a high-quality reconstruction for a given target image.
With HyperStyle, we aim to bring these generator tuning approaches to the realm of interactive applications by adapting it to an encoder-based approach.
We train a single hypernetwork to learn how to refine the generator weights with respect to a desired target image.
By learning this mapping, HyperStyle efficiently predicts the desired generator weights in less than 2 seconds per image, making it applicable to a wide-range of applications.
Below, we illustate HyperStyle's reconstructions shown alongside the original input images.
We additionally demonstrate that even after modifying the pre-trained generator weights, HyperStyle preserves the appealing structure and semantics of the original latent space and generator.
This allows one to easily leverage off-the-shelf editing techniques such as StyleCLIP and InterFaceGAN on the resulting inversions.
Below, we show various edits achieved with HyperStyle and StyleCLIP including changes to expression, facial hair, make-up, and hairstyle.
These semantics are also well-preserved in other domains such as cars, allowing us to easily apply various edits using GANSpace:
Interestingly, weight offsets learned by HyperStyle trained on FFHQ are also applicable for modifying fine-tuned generators.
This allows us to transfer a source image to a target style while preserving identity and other facial features.
HyperStyle generalizes well to out-of-domain images, even when unobserved during the training of the hypernetwork or generator.
This hints that HyperStyle does not only learn to correct flawed attributes, but rather learns to refine the generator in a more general sense.
Reference
@misc{alaluf2021hyperstyle,
title={HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing},
author={Yuval Alaluf and Omer Tov and Ron Mokady and Rinon Gal and Amit H. Bermano},
year={2021},
eprint={2111.15666},
archivePrefix={arXiv},
primaryClass={cs.CV}
}