Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

Cascade-Zero123:
One Image to Highly Consistent 3D
with Self-Prompted Nearby Views

ECCV 2024

1Department of Computer Science and Engineering, Shanghai Jiao Tong University
2Huawei Inc.
3Huazhong University of Science and Technology
4Department of Electronic Engineering, Shanghai Jiao Tong University
* denotes equal contributions.

overview

Abstract

Synthesizing multi-view 3D from one single image is a significant and challenging task. For this goal, Zero-1-to-3 methods aim to extend a 2D latent diffusion model to the 3D scope. These approaches generate the target-view image with a single-view source image and the camera pose as condition information. However, the one-to-one manner adopted in Zero-1-to-3 incurs challenges for building geometric and visual consistency across views, especially for complex objects. We propose a cascade generation framework constructed with two Zero-1-to-3 models, named Cascade-Zero123, to tackle this issue, which progressively extracts 3D information from the source image. Specifically, a self-prompting mechanism is designed to generate several nearby views at first. These views are then fed into the second-stage model along with the source image as generation conditions. With self-prompted multiple views as the supplementary information, our Cascade-Zero123 generates more highly consistent novel-view images than Zero-1-to-3. The promotion is significant for various complex and challenging scenes, involving insects, humans, transparent objects, and stacked multiple objects .etc.

Framework

The architecture of Cascade-Zero123. Cascade-Zero123 can be divided into two parts. The left part is Base-0123, which takes a set of R and T values as input to generate corresponding multi-view images. These output images are concatenated with the input condition image and its corresponding camera pose, forming a self-prompted input denoted as a set of c(xc, ∆R, ∆T ) for the right part Refiner-0123. The corresponding camera pose transition for each condition image to the target image needs to be recalculated as shown in detailed camera pose rotations. After each iteration of training, Base-0123 is updated through exponential moving average(EMA) using Refiner-0123.

overview

Rectifying Structural Errors in the Zero-1-to-3

Polishing the Backside Colors of Zero-1-to-3

Addressing the Problem that Transparent or High-brightness Objects are Learned as White Clouds.

Novel View Synthesis Comparision with Zero-1-to-3 and SyncDreamer

overview

overview

Citation

@inproceedings{chen2024cascadezero123,
          author = {Yabo Chen, Jiemin Fang, Yuyang Huang, Taoran Yi, Xiaopeng Zhang, Lingxi Xie, Xinggang Wang, Wenrui Dai, Hongkai Xiong,and Qi Tian},
          title = {Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views},
          year = {2024},
          booktitle = {ECCV}}