Novel View Synthesis of Dynamic Scenes with Globally Coherent Depths from a Monocular Camera

Jae Shin Yoon1, Kihwan Kim2, Orazio Gallo2, Hyun Soo Park1, and Jan Kautz2

1University of Minnesota                       2NVIDIA

Figure 1: We present a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene. (left) A dynamic scene is captured from a monocular camera from the locations V0 to Vk. Each image captures people jumping at each time step (t=0 to t=k). (Middle) A novel view from an arbitrary location between V0 and V1 (denoted as an orange frame) is synthesized with the dynamic contents observed at the time t=k. The estimated depth at Vk is shown in the inset. (Right) For the novel view (orange frame), we can also synthesize the dynamic content that appeared across any views in different time (traces of the foreground in each time step are shown).

Abstract

This paper presents a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene. A key challenge for the novel view synthesis arises from dynamic scene reconstruction where epipolar geometry does not apply to the local motion of dynamic contents. To address this challenge, we propose to combine the depth from single view (DSV) and the depth from multi-view stereo (DMV), where DSV is complete, i.e., a depth is assigned to every pixel, yet view-variant in its scale, while DMV is view-invariant yet incomplete. Our insight is that although its scale and quality are inconsistent with other views, the depth estimation from a single view can be used to reason about the globally coherent geometry of dynamic contents. We cast this problem as learning to correct the scale of DSV, and to refine each depth with locally consistent motions between views to form a coherent depth estimation. We integrate these tasks into a depth fusion network in a self-supervised fashion. Given the fused depth maps, we synthesize a photorealistic virtual view in a specific location and time with our deep blending network that completes the scene and renders the virtual view. We evaluate our method of depth estimation and view synthesis on a diverse real-world dynamic scenes and show the outstanding performance over existing methods.

Paper

Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, and Jan Kautz "Novel View Synthesis of Dynamic Scenes with Globally Coherent Depths from a Monocular Camera", CVPR 2020 [Paper, PDF_supplementary]


Supplementary Video

10 Minuate Overview




Dataset with ground-truth
: Input images of dynamic scenes, foreground masks, camera calibration, and GT of view synthesis (multiview images) and depth estimation

Jumping [dataset]

Skating [dataset]

Truck [dataset]


DynamicFace [dataset]

Umbrella [dataset]

Balloon1 [dataset]


Balloon2 [dataset]

Teadybear [dataset]

Playground [dataset]



Dataset without ground-truth
: Input images of dynamic scenes, foreground masks, and camera calibration

Teatime [dataset]

Feeling [dataset]

Hand [dataset]

Zebra [dataset]




*Full multiview videos can be found in here. Note that you need to calibrate and undistort the images.


Reference
: Please cite the following paper, if you use our dataset.

@article{yoon2020dynamic,
title={Novel View Synthesis of Dynamic Scenes with Globally Coherent Depths from a Monocular Camera},
author={Yoon, Jae Shin and Kim, Kihwan and Gallo, Orazio and Park, Hyun Soo and Kautz, Jan},
booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month={June},
year={2020}
}