State of the Field
Recent advancements in robotics AI are increasingly focused on enhancing the generalization and efficiency of robotic manipulation tasks. Researchers are developing frameworks that integrate multimodal inputs, such as vision and tactile feedback, to improve robots' ability to understand and interact with complex environments. For instance, new models are addressing the challenge of "Information Collapse," where robots overly rely on visual cues, by incorporating language instructions into their decision-making processes. This shift is crucial for enabling robots to follow diverse commands in real-world settings. Additionally, innovative training methods, such as using world models for reinforcement learning, are showing promise in reducing the reliance on extensive expert demonstrations and improving real-world performance. These developments not only enhance the capabilities of robots in tasks like bimanual coordination but also pave the way for more efficient and adaptable systems that can operate effectively in dynamic environments, potentially transforming applications in industries ranging from manufacturing to healthcare.
Papers
1–10 of 11Generalizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation
Humanoid robot manipulation is a crucial research area for executing diverse human-level tasks, involving high-level semantic reasoning and low-level action generation. However, precise scene understa...
BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in c...
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapt...
ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning
Tactile information plays a crucial role in human manipulation tasks and has recently garnered increasing attention in robotic manipulation. However, existing approaches mostly focus on the alignment ...
Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning
Continual learning is a long-standing challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work...
World-Gymnast: Training Robots with Reinforcement Learning in a World Model
Robot learning from interacting with the physical world is fundamentally bottlenecked by the cost of physical interaction. The two alternatives, supervised finetuning (SFT) from expert demonstrations ...
Attention-Based Neural-Augmented Kalman Filter for Legged Robot State Estimation
In this letter, we propose an Attention-Based Neural-Augmented Kalman Filter (AttenNKF) for state estimation in legged robots. Foot slip is a major source of estimation error: when slip occurs, kinema...
BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predo...
Affordances Enable Partial World Modeling with LLMs
Full models of the world require complex knowledge of immense detail. While pre-trained large models have been hypothesized to contain similar knowledge due to extensive pre-training on vast amounts o...
Off-Policy Actor-Critic with Sigmoid-Bounded Entropy for Real-World Robot Learning
Deploying reinforcement learning in the real world remains challenging due to sample inefficiency, sparse rewards, and noisy visual observations. Prior work leverages demonstrations and human feedback...