By optimizing volumetric scene function using sparse set of input views, we can synthesize novel views of complex scenes.
Input : Scenes consisted of continuous spatial locations $(x, y, z)$ and viewing directions $(\theta, \phi)$
Output : Volume density and view-dependent color at spatial location.
- Density : differential opacity controlling the amount of radiance accumulated by a ray passing thru position.
Model : MLP, without convolutional layers.
- Because basic implementation does not converge to sufficient representation, we use the following;
  - We transform the input coordinates with positional encoding, to represent higher frequency functions.
  - We propose hierarchical sampling procedure to reduce number of queries.
Using traditional volume rendering techniques, we can project the output into synthesized images.

Input :
- 3D Location $\mathbf{x} = (x, y, z)$
- 2D Viewing direction $\mathbf{d} = (\theta, \phi)$
Output :
- Emitted color $\mathbf{c} = (r, g, b)$
- Volume density $\sigma$
Network
- $F_\Theta : (\mathbf{x}, \mathbf{d}) \rightarrow (\mathbf{c}, \sigma)$
- In order to make the representation multiview consistent, restrict the network such that volume density $\sigma$ is only predicted by location $x$, regardless of direction.
  - $\mathbf{x} \rightarrow MLP(*8, 256) \rightarrow (\sigma, f.v)$
  - $Concat(f.v, \mathbf{d}) \rightarrow MLP(*1, 128) \rightarrow (\mathbf{c})$

Written with StackEdit.