State of the Field
The field of generative video is currently focused on enhancing realism and interactivity in video synthesis, with recent advancements addressing critical challenges in human-object interaction and embodied intelligence. New frameworks are being developed to generate talking avatars that can interact with their environments based on text prompts, significantly improving the quality of grounded human-object interactions. Concurrently, benchmarks like RBench are being established to evaluate robotic video generation, highlighting deficiencies in physical realism and driving the creation of extensive annotated datasets to support model training. Additionally, counterfactual video generation techniques are being explored to mitigate hallucinations in video-language models, while innovations in memory-augmented video editing aim to maintain consistency across iterative edits. These developments not only enhance the fidelity and usability of generative video systems but also have potential applications in fields such as education, entertainment, and robotics, where high-quality, context-aware video content is increasingly in demand.
Papers
1–10 of 24CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models
Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order. Existing mitigation strategies, suc...
Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory
Recent foundational video-to-video diffusion models have achieved impressive results in editing user provided videos by modifying appearance, motion, or camera movement. However, real-world video edit...
Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars
Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-o...
Rethinking Video Generation Model for the Embodied World
Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical ...
ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling
This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction wi...
PrevizWhiz: Combining Rough 3D Scenes and 2D Video to Guide Generative Video Previsualization
In pre-production, filmmakers and 3D animation experts must rapidly prototype ideas to explore a film's possibilities before fullscale production, yet conventional approaches involve trade-offs in eff...
Morphe: High-Fidelity Generative Video Streaming with Vision Foundation Model
Video streaming is a fundamental Internet service, while the quality still cannot be guaranteed especially in poor network conditions such as bandwidth-constrained and remote areas. Existing works mai...
PaperTok: Exploring the Use of Generative AI for Creating Short-form Videos for Research Communication
The dissemination of scholarly research is critical, yet researchers often lack the time and skills to create engaging content for popular media such as short-form videos. To address this gap, we expl...
SkeletonGaussian: Editable 4D Generation through Gaussian Skeletonization
4D generation has made remarkable progress in synthesizing dynamic 3D objects from input text, images, or videos. However, existing methods often represent motion as an implicit deformation field, whi...
Flow caching for autoregressive video generation
Autoregressive models, often built on Transformer architectures, represent a powerful paradigm for generating ultra-long videos by synthesizing content in sequential chunks. However, this sequential g...