Google DeepMind introduces Robotics Transformer 2, a vision-language-action AI model for robot control

TL;DR:

  • Google DeepMind introduces Robotics Transformer 2 (RT-2), a VLA AI model for robot control.
  • RT-2 uses a fine-tuned Language Learning Model (LLM) to generate motion commands.
  • It outperforms baseline models by up to 3x in emergent skill evaluations.
  • RT-2 comes in two variants, based on PaLM-E and PaLI-X, with 12B and 55B parameters, respectively.
  • The model learns a new language to process user commands and robot workspace images.
  • Google’s prior LLM-based systems include SayCan and Code-as-Policies.
  • RT-2 builds on RT-1’s concept of direct robot command output.
  • It excels in symbol understanding, reasoning, and human recognition tasks, surpassing baseline performance.
  • The model, however, does not acquire skills beyond its initial training data.

Main AI News:

Google DeepMind has unveiled Robotics Transformer 2 (RT-2), a cutting-edge vision-language-action (VLA) AI model poised to revolutionize robot control. RT-2 harnesses the power of a finely-tuned Language Learning Model (LLM) to deliver precise motion control commands, opening doors to a new era of robotic capabilities. What sets RT-2 apart is its remarkable ability to execute tasks not explicitly part of its training data, boasting a staggering improvement of up to 3 times over baseline models when evaluated for emergent skills.

DeepMind’s innovation comes in two distinct flavors, as they’ve trained two variants of RT-2. The first is a 12-billion-parameter version rooted in PaLM-E, while the second, more robust iteration packs a staggering 55 billion parameters, rooted in PaLI-X. The LLM at the heart of RT-2 undergoes meticulous co-fine-tuning, drawing from a blend of general vision-language datasets and robot-specific information. In essence, RT-2 learns a novel language—a vector of robot motion commands—comprising a string of integers. This groundbreaking approach enables the model to receive an image of the robot’s workspace coupled with a user command, such as “retrieve the teetering bag from the table’s edge,” and autonomously generate motion commands to fulfill the task. According to DeepMind,

This marks yet another stride in Google Robotics and DeepMind’s journey, punctuating their extensive work in leveraging LLMs for robot control. As we reported in 2022, Google unveiled SayCan, an LLM-powered system that crafts high-level action plans for robots, alongside Code-as-Policies, which employs an LLM to generate Python code for executing robot control. Both solutions tap into text-only LLMs to process user input, while vision components are handled by separate robot modules. Earlier this year, our coverage delved into Google’s PaLM-E, a system adept at handling multimodal input data from robotic sensors and translating them into a series of high-level action steps.

RT-2 evolves from its predecessor, RT-1, following the core principle of training models to directly issue robot commands. This marks a significant departure from prior efforts that produced higher-level abstractions of motion. Both RT-2 and RT-1 share the same input format—an image and a textual task description. However, where RT-1 relied on a sequence of discrete vision modules to generate visual tokens for LLM input, RT-2 takes a more streamlined approach by utilizing a unified vision-language model, exemplified by PaLM-E.

DeepMind subjected RT-2 to rigorous evaluation, conducting over 6,000 trials to scrutinize its emergent capabilities. Specifically, the research team focused on its prowess in executing tasks absent from the robot-specific training data but arising from its vision-language pre-training. RT-2 underwent trials in three pivotal task categories: symbol comprehension, reasoning, and human recognition. Remarkably, when pitted against baselines, RT-2 emerged as the victor, achieving an “average success rate exceeding threefold” that of its closest rival. Nevertheless, it’s worth noting that the model did not acquire any physical skills beyond what was initially ingrained in the robot’s training data.

Conclusion:

DeepMind’s RT-2 presents a groundbreaking leap in robot control, leveraging vision-language AI to redefine robotic capabilities. This advancement signifies a pivotal shift towards more adaptable and precise robot interactions, with potential implications for industries reliant on autonomous systems, from manufacturing to healthcare and beyond. Businesses should monitor and explore the integration of such AI models to enhance operational efficiency and expand the horizons of automation.

Source