StyleRig : Rigging StyleGAN for 3D Control over Portrait Images
- 아카이브 ID : 2004.00121
- 깃허브 코드 :
- 저자 : A.Tewari, M.Zollhofer, C.Theobalt 등
- 발표 년도 : 2020
- 컨퍼런스 :
- StyleGAN : $I_w = StyleGAN(w)$
- latent code w : $w \in \mathbb{R}^l$
- $l = 18 \times 512$
- output of the mapping network in StyleGAN, disentangled
- 18 latent vectors are used for different resolutions.
- result portrait image : $I_w \in \mathbb{R}^{3 \times w \times h}$
- Quality and resolution is good, but no semantic control over output (pose, expression, illumination)
- StyleRig : $\hat{w} = RigNet(w, p)$ such that $I_{\hat{w}}=StyleGAN(\hat{w})$
- 3DMM parameters : $p=(\alpha, \beta, \delta, \gamma, R, t) \in \mathbb{R}^f$
- Identity : $\alpha \in \mathbb{R}^{80}$
- Texture : $\beta \in \mathbb{R}^{80}$
- Expression : $\delta \in \mathbb{R}^{64}$
- Illumination : $\gamma \in \mathbb{R}^{27}$
- Rotation : $R \in SO(3)$
- Translation : $t \in \mathbb{R}^3$
Network architecture
- Linear Two-layer perceptron (MLP)
- 2-way cycle consistency losses
- Differentiable Face Reconstruction
- Maps latent code to 3DMM parameter
- $\mathcal{F} : \mathbb{R}^l \to\mathbb{R}^f$
- $p_w = \mathcal{F}(w)$
- 3 Layer MLP, ELU activation, self supervised
- Render Layer : maps 3DMM to rendered face image
- $\mathcal{R} : \mathbb{R}^{f} \to \mathbb{R}^{3 \times w \times h}$
- $S_w = \mathcal{R}(p_w)$
- Rendering layer: combination between photometric alignment loss and sparse landmark loss
- $\mathcal{L}{render}(I_w,p) = \mathcal{L}{photo}(I_w, p) + \lambda_{land}\mathcal{L}_{land}(I_w, p)$
- Photometric alignment loss : L2 loss between $I_w$ and $R(p_w)$, inside the binary mask $M$ (region where the face mesh is rendererd
- $\mathcal{L}_{photo}(I_w, p) = | M \odot (I_w - \mathcal{R}(p_w))|^2_2$
- Sparse landmark loss : L2 loss between 66 automatically computed landmarks on images $I_w$ and $R(p_w)$
- $\mathcal{L}{land}(I_w, p) = | L{I_w} - L_{M}|^2_2$
- Statisicial regularization done on the parameters of the face model
- Once trained, weights of $\mathcal{F}$ are fixed
RigNet Encoder & Decoder
- Encoder : Maps latent code to lower-dimensional vector $I$
- $I \in \mathbb{R}^{18\times32}$
- Independent linear encoders for each $i \in {0, …, 17 }$
- Decoder : Maps $I$ and 3DMM parameters $p$ to output $\hat{w}$
- Independent linear encoders for each $i \in {0, …, 17 }$
- Each layer concatenates $I_i$ and $p$ and transforms to $d_i$
- Final output $\hat{w} = d + w$
- Input : Two separate images & latent code pair; $(v, I_v)$, $(w, I_w)$
- Output : latent code $\hat{w}$ that has the identity, texture, etc of original image $w$ while having the pose, expression, etc of new image $v$.
- Loss structure : $\mathcal{L}{total} = \mathcal{L}{rec} + \mathcal{L}{edit} + \mathcal{L}{consist}$
- Optimization : AdaDelta w/ lr = 0.01
Reconstruction Loss
- L2 loss between the original latent code and the reconstructed latent code
- Anchors the learned mapping at the right location, prevents degradation of image quality.
- Note that DFR ($\mathcal{F}$) is pretrained, thus the semantics of control space are enforced.
- $\mathcal{L}_{rec} = | RigNet(w, \mathcal{F}(w))-w|^2_2$
Cycle-consistent, Per-pixel loss
- No ground truth (only one picture per person), thus we use cycle-consistent editing & consistency loss.
- Input $(w, I_w)$ : latent code & image which semantics are transferred to
- Input $(v, I_v)$ : latent code & image which semantics are transferred from
- Goal: $I_{\hat{w}}$ should correspond to $I_w$ while is modified according to $p_v$.
- $p_v = \mathcal{F}(v)$
- $\hat{w} = RigNet(w, p_v)$
- $I_{\hat{w}} = StyleGAN(\hat{w})$
- $p_{\hat{w}} = \mathcal{F}(\hat{w})$
- Enforces that $\hat{w}$ contains the modified parameters of $p_v$, (ex : $p_{\hat{w}}$’s rotation should be similar to $p_v$, etc), while maintaining the original parameters of $p_w$
- Why not compare $p_v$ and $p_{\hat{w}}$ directly?
- Bad practice; perceptual effect of different parameters in image space can be different
Editing Loss
- $p_{edit}$ : modified version of $p_v$ , which all parameters that needs editing are replaced with $p_{\hat{w}}$’s.
- $\mathcal{L}{edit} = \mathcal{L}{render}(I_v, p_{edit})$
Consistency Loss
- $p_{consist}$ : modified version of $p_w$ , which all parameters that needs to be kept are replaced with $p_{\hat{w}}$’s.
- $\mathcal{L}{consist} = \mathcal{L}{render}(I_w, p_{consist})$
Simaese Training
- Reverse order of $w$ and $v$, and train again
- Two-way cycle consistency loss
Style Mixing
- StyleGAN : Latent codes at different resolutions are copied from source to target to generate new images.
- Coarse style : Identity, pose
- Medium style : Expression, hair structure, illumination
- Fine style : color scheme
- StyleRig : similar, but more control on expression, illumination and pose
- Pose (rotation) : Coarse latent code
- Expression : Coarse + Medium latent code
- Illumination : Medium + Fine latent code
Interactive Rig Control
- Explicit semantic control of StyleGAN generated images thru 3DMM parameters
- UI that allows interaction with face mesh, which are fed to RigNet
- RigNet cannot transfer all modes of parametric control to similar changes in StyleGAN
- Ex : in-plane rotation of face mesh is ignored.
- Ex : many expressions do not transfer well
- Why? Bias in StyleGAN trainset
- No in-plane rotations, minimum expressions, limited lighting
Conditional Image Generation
- Fix pose / expression /illumination to generate images.
- Efficient way to generate image
