StyleGAN : A Style-Based Generator Architecture for Generative Adversarial Networks
- 제목 : A Style-Based Generator Architecture for Generative Adversarial Networks
- 아카이브 ID : 1812.04948
- 깃허브 코드 : https://github.com/NVlabs/stylegan
- 저자 : T.Karras, S.Laine, T.Aila 등
- 발표 년도 : 2018
Overview
- Problem : While generative adversarial networks (GAN) based image generation has met rapid improvement in resolution and quality, the process remains unknown.
- Properties of latent space, and other aspects of the image synthesis process is still not fully understood.
- Solution : re-design GAN to expose novel-ways to control the process.
- Starts from constant input, and for each passing convolution layer, adjust the “style” of image.
- Leads to automatic, unsupervised separation of high-level attributes (pose, identity) from stochastic variation (freckles, etc) in generated images.
- No modification of discriminator or loss f.
- Embed the input latent code into a intermediate latent space
- Free input latent space from being entangled with training data.
- Metrics : perceptual path length and linear separability.
- 지금까지 존재해 온 생성적 적대 모델 (GAN)을 사용한 얼굴 이미지 합성 기술들은 해상도와 전반적인 질에 있어서 가파른 발전을 보여 왔지만, 그 합성의 세부 과정들 및 잠재 공간 (Latent Space) 은 여전히 미지의 영역으로 남겨져 왔습니다.
- 또한, 이로 인해 얼굴 합성 시 성별이나 연령 등, 사진의 세부요소
Style-based generator
- Mapping Network:
- Instead of directly injecting the latent code \(\mathbb{z} \in \mathcal{Z}\), the paper maps the latent code to intermediate latent space \(f : \mathcal{Z} \rightarrow \mathcal{W}\) \((\mathbb{w} \in \mathcal{W})\)
- This can be viewed as a way to draw samples for each style from a learned distribution.
- The dimensionality of \(\mathcal{Z}\) and \(\mathcal{W}\) are the same (512)
- Mapping is implemented with 8 fully-connected layers.
- Synthesis network (Generator):
- The mapped \(\mathbb{w}\) is then shaped to styles \(y = (y_{s}, y_{b})\) for adaptive instance normalization (AdaIN) which is applied to each layer of the generator.
- This can be viewed as a way to generate a novel image based on a collection of styles.
- \[AdaIN(x_{i}, y) = y_{s,i}\frac{x_{i}-\mu(x_{i})}{\sigma({x_{i}})} + y_{b,i}\]
- Each feature map is normalized separately, and then scaled/biased from style y.
- Thanks to the normalization, the style statistics does not depend on the original statistics, and modifies the relative importance of features.
- Thus, modifying a specific subset of the styles can be expected to affect only certain aspects of the image
- Style Mixing
- Mixing regularization : given percentage of images (90%) are generated using two random latent codes instead of one during training.
- How? Style-mixing : At a randomly selected point in the generator, switch the latent code to another.
- Prevents the network from assuming that adjacent styles are correlated.
- Stochastic variation
- Stochastic factors (ex : hair placement, freckles) can be randomized
- Previous : network had to invent way to generate spatially-varying pseudorandom numbers; consumes network capacity, and not always successful.
- Solution : per-pixel noise after each convolution.
- Noise only affects inconsequential stochastic variation
- Spatially invariant statistics (gram matrix, channel-wise mean/variance, etc) reliably encode the style of an image, while variant features encode specific instances.
- Since style features are scaled/biased with same values, global effects can be controlled coherently.
- The noise is applied per-pixel, thus suited for controlling stochastic variation.
- If the generator attempted to use noise to control global effects, the decision would be spatially inconsistent and thus penalized by the discriminator.
Disentanglement studies
- Previous : given a training set with some combination is missing (ex : long haired males), the input latent space would have a missing space, forcing the latent space to be nonlinear and prevents from full disentanglement.
- Solution : the intermediate latent space is not fixed in the same manner, and instead induced by the learned piecewise continuous mapping.
- Since it should be easier for the generator to generate images based on a disentangled representation rather than a entangled one, the generator forces the variational factors to be more linear.
- Previous metrics proposed for disentanglement requires an encoder network which is unsuitable for STYLEGAN architecture, two new ways are introduced
- Perceptual path length
- Interpolation of latent space vectors can show non-linear changes in the image (ex : new feature appearing out of nowhere from linear interpolation), which indicates entanglement of latent space.
- Measure the level of change during interpolation in latent space:
- Less “curved” (i.e more linear) latent space should show more smoother transition.
- Perceptual path length : weighted difference between VGG16 embeddings.
- Given \(z_1, z_2\) from input latent space, \(t \sim Uniform(0,1)\), \(G\) as the generator, and \(d(.,.)\) as the perceptual difference between two generated images, \(l_{Z} = \mathbb{E}[\frac{1}{\epsilon^2}d(G(slerp(z_1,z_2;t)), G(slerp(z_1,z_2;t+\epsilon)))]\)
- Total perceptual length between path = sum of the perceptual differences over each segment.
- Epsilon
Written with StackEdit.