PIE: Portrait Image Embedding for Semantic Control

제목 : PIE : Portrait Image Embedding with Semantic Control
아카이브 ID : 2009.09485
깃허브 코드 :
저자 : A.Tewari, M.Zollhofer, C.Theobalt 등
발표 년도 : 2020
컨퍼런스 :

Abstract

주제 : 얼굴 이미지 편집
선행조건 : Parametrized control with semantic meaning
이전시도 : Synthetic StyleGAN image 한정
해결책 : Real Image 대상의 Pose / Expression / Illumination 편집
방법 1 : StyleRig (pretrained NN)
- 3DMM –> StyleRig –> StyleGAN latent space
- non-linear optimization problem
방법 2 : Identity preservation energy term ($E(w)$)
- 중요 1 : $StyleGAN(w)$ 의 high fidelity 보장
- 중요 2 : Expression 등의 편집을 가능케 함
- 중요 3 : Identity 등의 보존
- Non-linear optimization : Network weight를 추가로 배우는 것이 아니며, pretrained 된 네트워크를 기반으로 구성되므로 ground truth image 가 필요없다.

Notation

$I$ : 진짜 얼굴 이미지 (입력 이미지)
$w$ : I 의 편집을 가능케 하는, StyleGAN latent embedding (출력 코드)
- $v$ : 다른 general StyleGAN embedding
$\theta$ : 3DMM parameter
- $\theta =(\phi, \rho, \alpha, \delta, \beta, \gamma ) \in \mathbb{R}^{257}$
  - $(\phi, \rho) \in \mathbb{R}^{6}$ : rotation / translation
  - $\alpha \in \mathbb{R}^{80}$ : identity
  - $\beta \in \mathbb{R}^{64}$ : expression
  - $\delta \in \mathbb{R}^{80}$ : texture
  - $\gamma \in \mathbb{R}^{27}$ : illumination
- $\tau \in {\phi, \beta, \gamma }$ : editable semantic variables
  - $\theta^{\tau}$ : extraction of $\tau$ component
  - $\theta’ = [\theta^{\bar{\tau}}_1, \theta^{\tau}_2]$ : combine $\bar{\tau}$ components of $\theta_1$ with $\tau$ components of $\theta_2$
  - $\theta(v), \theta_v$ : StyleGAN 이미지 $I(v)$ 에서 추출된 3DMM parameter
    - Pretrain 된 Model-based Face AutoEncoder (MoFA) 네트워크를 사용하여 StyleGAN을 통해 생성된 이미지에서 3DMM parameter를 추출한다.

1. Person-specific Video Editing (Model based)

단일 인물의 많은 양의 사진을 필요로 하는 경우; 긴 길이의 비디오가 필요하며, 특정 인물의 단일 이미지로 사용될 수 없음

Source Video & Target Video
- Face2Face (Thies, 2016~2019)
  - (Source Video Expression) + (Target Video)
  - Target의 비디오 필요, pose / illumination 편집 불가
- Deep Video Portraits (Kim, 2018)
  - Expression / Pose 편집
  - Speaking Style 편집 불가
- Neural Style-Preserving Visual Dubbing (Kim, 2019)
  - Speaking Style 을 보존하면서 Expression / Pose 편집
    - Style Translation Network
Source Video & Target Face
- Recycle-GAN (Bansal, 2018)
  - Speaking Style 을 보존하면서 Expression / Pose 편집
  - Spatio-temporal video domain의 recycle loss

3. Few-shot Editing

단일 인물의 적은 양의 사진을 필요로 하는 경우;

X2Face (Wiles, 2018)
- Encoder-decoder 구조, 정면 얼굴 및 음성을 입력으로 받는다.
Few-shot Video-to-Video (Wang, 2019)
- 스케치 영상을 사진으로 만드는 네트워크 + attention mechanism
Realistic Neural Talking Head Models (Zakharov, 2019)
- 1. Generator (Landmark –> Image)
- 1. Embedding (Representation for generator conditioning)
- 1. Discriminator

4. Single-shot Editing

paGAN (Nagano, 2018)
- Personalized avatar from a single image
- Does not synthesize photo-realistic hair
Bringing Portraits to Life (Averbuch-Elor, 2017)
- Perform 2D Warp to face image, to animate expression and pose
Warp-Guided GAN (Geng, 2018)
- Deep Generative models for higher quality
- Spatial Motion Field
- Network 1(Skin Detail)
- Network 2(Mouth Interior)
First Order Model (Siarohin, 2019)
- Detect keypoints & generate warp field out of it.
- Can work both for face and full-body image

5. Image Editing with StyleGAN

Image2StyleGAN++
- Embed real image into StyleGAN latent space, with high fidelity
InterFaceGAN
StyleFlow
In-domain GAN Inversion
StyleRig

Semantic Editing of Real Facial Images

StyleRig의 한계 : 진짜 얼굴 이미지가 아닌, StyleGAN 합성 이미지만 사용 가능.
- 해결책 : 얼굴 이미지를 StyleGAN latent embedding으로 변환
  - 가장 말이 되는 방향으로 찾는 점이 중요

$E(w) = E_{synth}(w) + E_{identity}(w) + E_{edit}(w) + E_{invariance}(w) + E_{recognition}(w)$

High-Fidelity Image Synthesis (E_synth)

$E_{synth}(w) = \lambda_{l_2}|I-I_w|^2_2+\lambda_{p}|\Phi(I)-\Phi(I_w) |^2_2$

latent code $w$ 를 바탕으로 구성한 StyleGAN 이미지 $I_w$ 가 원본 이미지 $I$와 유사하도록 한다.
Term 1 : 두 이미지 간 L2 Distance
Term 2 : 두 이미지 간 Perceptual Distance
- $\Phi(\cdot)$ : VGG-16 layer 에서 얻어지는 feature들
이 공식 만 가지고 원본 이미지와 유사한 이미지를 생성할 수 있지만, 편집을 하는 목적에는 suboptimal 하다.

Face Image Editing : Identity Preservation (E_identity)

$E_{identity}(w) = \lambda_{identity}|w-RigNet(w, \theta^\tau_w) |^2_2$

RigNet을 통하여 latent code $w$를 재구성 하여도 원래 $w$ 와 똑같아야 한다.

Face Image Editing : Editing Property (E_edit)

$\forall v : I_v=I_{([\theta^{\bar{\tau}}_v,\theta^{\tau}_{RigNet(w, \theta^{\tau}_v)}])}$ : Edit Property $(\theta^{\tau}_{v} \approx RigNet(w, \theta^{\tau}_{v}))$

$\theta_v$의 non-semantic component와, RigNet을 통하여 재구성된 $\theta_v$의 semantic component를 합친 결과로 이미지를 구성시, 원래 StyleGAN 이미지 $I_v$ 와 같아야 한다.
- 물론 StyleGAN 이미지와 mesh rendered 이미지가 같을 순 없지만, 그 차이를 최소화 해야한다.

$\ell(I’, \theta) = \lambda_{photometric}|I’ - I_\theta|^2_{face} + \lambda_{landmark}|\mathcal{L}_{I’} - \mathcal{L}_\theta |^2_F$

Term 1 : Photometric loss : $I’$ 와 3DMM parameter $\theta$를 토대로 렌더링한 $I_\theta$ 사이, 얼굴 부분에만 한정 지은 L2 Distance
Term 2 : Landmark loss : 두 이미지의 랜드마크 ($\mathcal{L}_I \in \mathbb{R}^{66\times2}$) 간 Frobenius norm

$E_{edit}(w) = \lambda_{edit}\mathbb{E}_v[\ell(I_v, [\theta^{\bar{\tau}}_v, \theta^{\tau}_{RigNet(w, \theta^{\tau}_v)}])]$

Face Image Editing : Editing Property (E_invariance)

$\forall v : I=I_{([\theta^{\bar{\tau}}_{RigNet(w, \theta^{\tau}_v)}, \theta^{\tau}_I, ])}$ : Invariance Property $(\theta^{\bar{\tau}}_w \approx RigNet(w, \theta^{\tau}_v)$

RigNet을 통하여 재구성된 $\theta_w$의 non-semantic component와, $\theta_I$의 semantic component를 합친 결과로 이미지를 구성시, 원래 이미지 $I$와 같아야 한다.

$E_{invariance}(w) = \lambda_{invariance}\mathbb{E}_v[\ell(I, [\theta^{\bar{\tau}}_{RigNet(w, \theta^{\tau}_v)}, \theta^{\tau}_I])]$

Face Recognition

VGG-Face에서 사람이 얼굴 인지와 관련된 feature들을 사용하여 Face Recognition Consistency를 유지한다.
- $\Psi(I)$ : VGG-Face를 통해 얻어진 feature들

$\ell_{recog}(I’, v) = | \Psi(I’) - \Psi(I_v)|^2_F$ $E_{recognition} = \lambda_{r_w}\ell_{recog}(I, w) + \lambda_{r_{\hat{w}}}\mathbb{E}_{V}[\ell_{recog}(I, RigNet(w, \theta^{\tau}_{v})]$

Optimization

$E(w)$ 에는 StyleGAN, MoFA, $\Psi$, $\Phi$ 등의 non-linear function등이 pretrained neural network의 형태로 사용된다.
Tensorflow 기반의 AdaDelta optimization을 사용하며, 각 Iteration 마다 다른 v값을 사용한다.
Hierarchical Optimization:
- StyleGAN 의 Latent Space 은 Hierarchical order로, 여러개의 $ W = 512$ 가 쌓여서 $ W^{+} = 18 \times 512$를 이루는 형태.
- $W^{+}$는 이미지를 표현하는데 최상의 latent space이지만, 직접적으로 $W^{+}$ 에 optimization을 실행하는 것은 좋은 퀄러티의 결과물을 생산하지 않는다.
  - 해당 결과물은 StyleGAN의 prior distribution과 멀리 떨어진 경우가 많기 때문
- 따라서, $W$ 상에서 optimization을 실행한 후 (2000 step), $W^{+}$ 로 결과물을 변환시킨후 다시 optimization을 실행한다 (1000 step)
  - 이리 함으로서 coarse-to-fine 구조를 확립시킨다.
    
    Written with StackEdit.