Idea23D : Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs


Junhao Chen*1,2 , Xiang Li*3 , Xiaojun Ye4 , Chao Li5 , Zhaoxin Fan†6 , Hao Zhao†1

1 Institute for AI Industry Research (AIR), Tsinghua University
2 Tsinghua Shenzhen International Graduate School, Tsinghua University
3 School of Software and Microelectronics, Peking University
4 College of Computer Science, Zhejiang University
5 College of Computer Science and Technology, Harbin Engineering University
6 Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University
* Indicates Equal Contribution
Indicates Corresponding Author

📰 News: Our paper has been accepted by COLING 2025! 🎉 See you in Abu Dhabi, UAE, from January 19 to 24, 2025!

The Idea-2-3D framework synergizes the capabilities of the Large Multimodal Model (LMM), Text-to-Image (T2I), and Image-to-3D (I23D) models to transform complex multimodal input IDEAs into tangible 3D models. This process begins with the user articulating high-level 3D design requirements IDEA. Following this, the LMM generates textual prompts (Prompt Generation) that are then converted into 3D models. These models are evaluated through a Multiview Image Generation and Evaluation process, leading to the Selection of an Optimal 3D Model. Subsequently, the Text-to-3D (T-2-3D) prompt is refined (Feedback Generation) using insights from the GPT-4V. Additionally, an integrated memory module (see Sec. Memory Module), while not depicted here, meticulously records each iteration, facilitating a multimodal, iterative self-refinement cycle within the framework.

Method


Pipeline Diagram

Overview of the framework of Idea-2-3D, which employs LMM to explore the T-2-3D model's potential through multimodal iterative self-refinement to provide valid T-2-3D prompts for the input user IDEA. Green rounded rectangles indicate steps completed by GPT-4V. Purple rounded rectangles indicate T-2-3D modules, including T2I models and I-2-3D models. The yellow rounded rectangle indicates the off-the-shelf 3D model multi-view generation algorithm. To get a better reconstruction, we remove the background of the image between steps 2 and 3. The blue color indicates the memory module, which saves all the feedback from previous rounds, the best 3D model, and the best text prompt.

Evaluation Dataset and Showcase




Citation


@article{chen2024idea23d,
  title={Idea-2-3D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs},
  author={Junhao Chen and Xiang Li and Xiaojun Ye and Chao Li and Zhaoxin Fan and Hao Zhao},
  year={2024},
  eprint={2404.04363},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}