Google’s DeepMind Reveals New Methods For Training Robots With Video And Large Language Models

News

2024 is poised to be a groundbreaking year for the intersection of generative AI, large foundational models, and robotics. Google’s DeepMind Robotics researchers have unveiled ongoing research aimed at providing robots with a better understanding of human expectations.

Key Takeaway

Google’s DeepMind Robotics team is pioneering the use of large language models and video input to train robots, showcasing significant advancements in robotics and AI integration.

AutoRT: Harnessing Large Foundational Models

Traditionally, robots have been limited to performing singular tasks repeatedly. However, the newly announced AutoRT by DeepMind is designed to utilize large foundational models for various purposes. By leveraging a Visual Language Model (VLM), the system enhances situational awareness by managing a fleet of robots equipped with cameras to understand their environment and the objects within it. Additionally, a large language model suggests tasks that can be accomplished by the hardware, reducing the need for hard-coding skills.

RT-Trajectory: Leveraging Video Input for Robotic Learning

DeepMind has also introduced RT-Trajectory, which utilizes video input for robotic learning. This approach overlays a two-dimensional sketch of the arm in action over the video, providing practical visual hints to the model as it learns its robot-control policies. The training with RT-Trajectory showed promising results, with a 63% success rate compared to 29% in testing 41 tasks, marking a significant improvement over previous methods.