Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

One Image to Highly Consistent 3D
with Self-Prompted Nearby Views

1Department of Computer Science and Engineering, Shanghai Jiao Tong University
2Huawei Inc.
3Huazhong University of Science and Technology
4Department of Electronic Engineering, Shanghai Jiao Tong University
* denotes equal contributions.



Synthesizing multi-view 3D from one single image is a significant and challenging task. For this goal, Zero-1-to-3 methods aim to extend a 2D latent diffusion model to the 3D scope. These approaches generate the target-view image with a single-view source image and the camera pose as condition information. However, the one-to-one manner adopted in Zero-1-to-3 incurs challenges for building geometric and visual consistency across views, especially for complex objects. We propose a cascade generation framework constructed with two Zero-1-to-3 models, named Cascade-Zero123, to tackle this issue, which progressively extracts 3D information from the source image. Specifically, a self-prompting mechanism is designed to generate several nearby views at first. These views are then fed into the second-stage model along with the source image as generation conditions. With self-prompted multiple views as the supplementary information, our Cascade-Zero123 generates more highly consistent novel-view images than Zero-1-to-3. The promotion is significant for various complex and challenging scenes, involving insects, humans, transparent objects, and stacked multiple objects .etc.


The architecture of Cascade-Zero123. Cascade-Zero123 can be divided into two parts. The left part is Base-0123, which takes a set of R and T values as input to generate corresponding multi-view images. These output images are concatenated with the input condition image and its corresponding camera pose, forming a self-prompted input denoted as a set of c(xc, ∆R, ∆T ) for the right part Refiner-0123. The corresponding camera pose transition for each condition image to the target image needs to be recalculated as shown in detailed camera pose rotations. After each iteration of training, Base-0123 is updated through exponential moving average(EMA) using Refiner-0123.


Rectifying Structural Errors in the Zero-1-to-3

Polishing the Backside Colors of Zero-1-to-3

Addressing the Problem that Transparent or High-brightness Objects are Learned as White Clouds.

Novel View Synthesis Comparision with Zero-1-to-3 and SyncDreamer




  author = {Yabo Chen, Jiemin Fang, Yuyang Huang, Taoran Yi, Xiaopeng Zhang, Lingxi Xie, Xinggang Wang, Wenrui Dai, Hongkai Xiong,and Qi Tian},
  title = {Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views},
  year = {2023},
  journal={arXiv preprint arXiv:2312.04424}