SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

PDF Viewer

BUILDER'S SANDBOX

Build This Paper

Use an AI coding agent to implement this research.

OpenAI Codex
OpenAI CodexAI Agent

Lightweight coding agent in your terminal.

Claude Code
Claude CodeAI Agent

Agentic coding tool for terminal workflows.

AntiGravity IDE
AntiGravity IDEScaffolding

AI agent mindset installer and workflow scaffolder.

Cursor
CursorIDE

AI-first code editor built on VS Code.

VS Code
VS CodeIDE

Free, open-source editor by Microsoft.

MVP Investment

$9K - $12K
6-10 weeks
Engineering
$8,000
Cloud Hosting
$240
SaaS Stack
$300
Domain & Legal
$100

6mo ROI

2-4x

3yr ROI

10-20x

Lightweight AI tools can reach profitability quickly. At $500/mo average contract, 20 customers = $10K MRR by 6mo, 200+ by 3yr.

Talent Scout

J

Jun Luo

Peking University

J

Jiaxiang Tang

NVIDIA

R

Ruijie Lu

Peking University

G

Gang Zeng

Peking University

Find Similar Experts

3D experts on LinkedIn & GitHub

References (46)

[1]
SAGE: Scalable Agentic 3D Scene Generation for Embodied AI
2026Hongchi Xia, Xuan Li et al.
[2]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
2025Z-Image Team, Huanqia Cai et al.
[3]
Text-to-Scene with Large Reasoning Models
2025Frédéric Berdoz, Luca A. Lanzendörfer et al.
[4]
SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent
2025Yandan Yang, Baoxiong Jia et al.
[5]
HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing
2025Zixuan Bian, Ruohan Ren et al.
[6]
Towards Geometric and Textural Consistency 3D Scene Generation via Single Image-guided Model Generation and Layout Optimization
2025Xiang Tang, Ruotong Li et al.
[7]
ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary
2025Zeqi Gu, Yin Cui et al.
[8]
Agentic 3D Scene Generation with Spatially Contextualized VLMs
2025Xinhang Liu, Yu-Wing Tai et al.
[9]
3D Scene Generation: A Survey
2025Beichen Wen, Haozhe Xie et al.
[10]
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation
2025Lu Ling, Chen-Hsuan Lin et al.
[11]
Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes
2025Huanyu Zhang, Chengzu Li et al.
[12]
SceneX: Procedural Controllable Large-Scale Scene Generation
2025Mengqi Zhou, Yuxi Wang et al.
[13]
Generative AI for Film Creation: A Survey of Recent Advances
2025Ruihang Zhang, Borou Yu et al.
[14]
Global-Local Tree Search in VLMs for 3D Indoor Scene Generation
2025Weiwei Deng, Mengshi Qi et al.
[15]
WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents
2025Xinhang Liu, Chi-Keung Tang et al.
[16]
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
2025Zibo Zhao, Zeqiang Lai et al.
[17]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
2025Adam Suma, Sam Dauncey
[18]
LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models
2024Fan-Yun Sun, Weiyu Liu et al.
[19]
The Scene Language: Representing Scenes with Programs, Words, and Embeddings
2024Yunzhi Zhang, Zizhang Li et al.
[20]
Visual Prompting in Multimodal Large Language Models: A Survey
2024Junda Wu, Zhehao Zhang et al.

Showing 20 of 46 references

Founder's Pitch

"SceneAssistant transforms text commands into high-quality 3D scenes with minimal user input."

3D Content CreationScore: 8View PDF ↗

Commercial Viability Breakdown

0-10 scale

High Potential

3/4 signals

7.5

Quick Build

4/4 signals

10

Series A Potential

4/4 signals

10

Sources used for this analysis

arXiv Paper

Full-text PDF analysis of the research paper

GitHub Repository

Code availability, stars, and contributor activity

Citation Network

Semantic Scholar citations and co-citation patterns

Community Predictions

Crowd-sourced unicorn probability assessments

Analysis model: GPT-4o · Last scored: 3/12/2026

🔭 Research Neighborhood

Generating constellation...

~3-8 seconds

Why It Matters

This research streamlines the creation of 3D scenes from text, reducing the manual effort needed in industries like gaming and virtual reality.

Product Angle

To productize, integrate SceneAssistant into a cloud service platform for digital artists and developers that offers plug-and-play 3D scene generation via API.

Disruption

It replaces current labor-intensive methods of 3D content creation that require expertise in complex software.

Product Opportunity

There's a significant market in gaming, animation, and virtual reality sectors where there's demand for rapid, high-quality scene creation. Studios and content creators would pay for tools that reduce development time and cost.

Use Case Idea

A commercial application could be an online platform where users describe scenes in natural language, and the platform generates 3D models for games or VR environments instantly.

Science

The paper introduces an agentic framework using Vision-Language Models (VLMs) for open-vocabulary 3D scene generation. By leveraging visual feedback and a suite of action APIs, the system iteratively refines 3D scenes based on natural language descriptions.

Method & Eval

Tested through a combination of qualitative human evaluation and quantitative benchmarks showing superior performance over existing methods in terms of spatial accuracy and scene coherence.

Caveats

The approach depends on the inherent capabilities of VLMs, which might not fully capture or interpret user intent in complex scenarios.

Author Intelligence

Jun Luo

LEAD
Peking University

Jiaxiang Tang

NVIDIA

Ruijie Lu

Peking University

Gang Zeng

Peking University

Related Papers

Loading…