AGG: Amortized Generative 3D Gaussians for Single Image to 3D

TL;DR: We design a novel cascaded generation pipeline that produces 3D Gaussian-based objects without per-instance optimization. Our Amortized Generative 3D Gaussians (AGG) framework involves a coarse generator that predicts a hybrid representation for 3D Gaussians at a low resolution and a super-resolution module that delivers dense 3D Gaussians in the fine stage.

Abstract

Given the growing need for automatic 3D content creation pipelines, various 3D representations have been studied to generate 3D objects from a single image. Due to its superior rendering efficiency, 3D Gaussian splatting-based models have recently excelled in both 3D reconstruction and generation. 3D Gaussian splatting approaches for image to 3D generation are often optimization-based, requiring many computationally expensive score-distillation steps. To overcome these challenges, we introduce an Amortized Generative 3D Gaussian framework (AGG) that instantly produces 3D Gaussians from a single image, eliminating the need for per-instance optimization. Utilizing an intermediate hybrid representation, AGG decomposes the generation of 3D Gaussian locations and other appearance attributes for joint optimization. Moreover, we propose a cascaded pipeline that first generates a coarse representation of the 3D data and later upsamples it with a 3D Gaussian super-resolution module. Our method is evaluated against existing optimization-based 3D Gaussian frameworks and sampling-based pipelines utilizing other 3D representations, where AGG showcases competitive generation abilities both qualitatively and quantitatively while being several orders of magnitude faster.

Video (narrated)

Method

We introduce a cascaded generation pipeline that produces 3D Gaussian-based objects without per-instance optimization. Our AGG framework involves a coarse generator that predicts a hybrid representation for 3D Gaussians at a low resolution. And a super-resolution module that delivers dense 3D Gaussians in the fine stage. Our model is trained via rendering loss defined on multi-view images and is warmed up with Chamfer distance loss using 3D Gaussian pseudo labels.

We first use a DINOv2 image encoder to extract essential features. Then, two transformers individually map learnable query tokens to Gaussian locations and a texture field. The texture field accepts location queries from the geometry branch, and a decoding MLP further converts the interpolated plane features into Gaussian attributes (e.g. color, opacity).

Once we obtain the first stage result and the image feature, we unite them through cross-attention at the intermediate feature maps. We perform super-resolution at the bottleneck feature map and decode the features jointly.

Results

Our method is evaluated against existing optimization-based 3D Gaussian frameworks and sampling-based pipelines utilizing other 3D representations, where AGG showcases competitive generation abilities both qualitatively and quantitatively while being several orders of magnitude faster.

Citation

If you want to cite our work, please kindly use:

@article{xu2024agg,
  title={AGG: Amortized Generative 3D Gaussians for Single Image to 3D},
  author={Xu, Dejia and Yuan, Ye and Mardani, Morteza and Liu, Sifei and Song, Jiaming and Wang, Zhangyang and Vahdat, Arash},
  journal={arXiv preprint 2401.04099},
  year={2024}
}