Research
Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data and translates this knowledge into generalized instructions for robotic control
High-capacity visual language models (VLMs) are trained on web-scale datasets, making these systems extremely good at recognizing visual or linguistic patterns and operating across different languages. But for robots to achieve a similar level of competence, they will need to collect robot data, first-hand, on every object, environment, task and situation.
In ours paperwe present Robotic Transformer 2 (RT-2), a novel vision-language-action (VLA) model that learns from both web and robotics data and translates this knowledge into generalized instructions for robotic control while maintaining web-scale capabilities.
A Visual Language Model (VLM) pretrained on web-scale data learns from RT-1 robotics data to become RT-2, a Visual Language Action Model (VLA) that can control a robot.
This project is based on Robotic Transformer 1 (RT-1), a model trained on multi-task demonstrations that can learn combinations of tasks and objects seen in robotic data. More specifically, our work used RT-1 robot demonstration data collected with 13 robots over 17 months in an office kitchen environment.
RT-2 shows improved generalization capabilities and semantic and visual understanding beyond the robotic data it was exposed to. This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about classes of objects or high-level descriptions.
We also show that incorporating chain reasoning allows RT-2 to perform multi-stage semantic reasoning, such as deciding which object could be used as a makeshift hammer (rock) or which type of drink is best for a tired person (energy drink).
Adaptation of VLM for robotic control
RT-2 is based on VLMs that take one or more images as input and produce a sequence of tokens that, conventionally, represent natural language text. Such VLMs existed successfully trained on web-scale data to perform tasks such as visual question answering, image captioning or object recognition. In our work, we adapt the Pathways language and image model (PaLI-X) and Pathways Language Model Embodied (PalM-E) to act as the backbone of the RT-2.
To control a robot, it must be trained to perform actions. We address this challenge by representing actions as tokens in model output – similar to language tokens – and describe actions as strings that can be processed based on a standard Natural language tokenizersappears here:
Representation of an action string used in RT-2 training. An example of such a string could be a sequence of robot token numbers, e.g. “1 128 91 241 5 101 127 217”.
The string begins with a flag indicating whether to continue or terminate the current episode, without executing subsequent commands, and is followed by the end effector’s position and rotate commands, as well as the desired robot handle extension.
We use the same discretized version of robot actions as in RT-1 and show that converting it to a string representation makes it possible to train VLM models on robot data – as the input and output spaces of such models need not be changed.
RT-2 architecture and training: We tune a pre-trained VLM model for robotics and web data. The resulting model takes robot camera images and instantly predicts actions a robot should perform.
Generalization and emergent skills
We performed a series of qualitative and quantitative experiments on our RT-2 models, over 6,000 robotic tests. In exploring the emerging capabilities of RT-2, we first looked for tasks that require a combination of knowledge from web-scale data and the robot’s experience, and then defined three skill categories: symbol understanding, reasoning, and human recognition.
Each task required the understanding of visual-semantic concepts and the ability to perform robotic control to operate with these concepts. Commands such as “pick up the bag to fall off the table” or “move the banana to the sum of two plus one” are required – where the robot is asked to perform a manipulation task on objects or scenarios that have never been seen in the robotic data knowledge that is translated from internet based data to work.
Examples of emerging robotic skills that do not exist in the robotics data and require knowledge transfer from online pre-training.
Across all categories, we observed increased generalization performance (more than 3-fold improvement) compared to previous baselines, such as previous RT-1 models and models such as Visual Cortex (VC-1), which were pre-trained on large visual datasets.
Success rates of emerging skills assessments: our RT-2 models outperform previous robotic transformer (RT-1) and visual pre-training (VC-1) baselines.
We also performed a series of quantitative assessments, starting with the initial RT-1 tasks, for which we have examples in the robot data, and continuing with varying degrees of previously unseen objects, backgrounds, and environments by the robot that required the robot to learn generalization from VLM pre-training.
Examples of environments not previously seen by the robot, where RT-2 generalizes to new situations.
RT-2 maintained performance on the original tasks seen in the robot data and improved performance on scenarios not previously seen by the robot, from RT-1’s 32% to 62%, demonstrating the significant benefit of large-scale pre- education.
In addition, we observed significant improvements over baselines trained on visual-only tasks such as VC-1 and reusable representations for robotic manipulation (R3M), and algorithms using VLM for object recognition, such as open-world object manipulation (MOO).
RT-2 achieves high performance on in-distribution tasks and outperforms multiple baselines in out-of-distribution tasks.
Evaluation of our open source model Language Table a series of robotic tasks, we achieved a 90% simulation success rate, significantly improving over previous baselines, including BC-Z (72%), RT-1 (74%) and LAVA (77%).
We then evaluated the same model in the real world (as it was trained on simulation and real data) and demonstrated its ability to generalize to novel objects, as shown below, where none of the objects except the blue cube were present in the training set data.
The RT-2 performs well in real-world Robot Language Table tasks. None of the objects except the blue cube were present in the training data.
Inspired by chain of thought prompting methods used in LLMswe looked at our models to combine robotic control with chain-of-mind reasoning to enable learning of long-term planning and low-level skills in a single model.
Specifically, we optimized a variant of RT-2 for only a few hundred gradient steps to increase its ability to use language and actions together. We then augmented the data to include an additional “Plan” step, first describing the purpose of the action the bot is going to take in natural language, followed by the “Action” and action tokens. Here we show an example of such reasoning and the resulting robot behavior:
Chain reasoning enables the learning of a self-contained model that can design long-range skill sequences and predict the robot’s actions.
With this process, the RT-2 can perform more involved commands that require reasoning about the intermediate steps required to execute a user command. Thanks to the VLM backbone, RT-2 can also draw from image and text commands, enabling visually grounded design, while current draw and action approaches such as SayCan he cannot see the real world and rely entirely on language.
Advanced robotic control
RT-2 shows that vision language models (VLMs) can be transformed into powerful vision-language-action (VLA) models, which can directly control a robot by combining VLM pre-training with robotic data.
With two VLA instances based on PaLM-E and PaLI-X, RT-2 leads to greatly improved robotic policies and, more importantly, leads to significantly better generalization performance and emergent capabilities, inherited from the vision language at scale Web. -education.
The RT-2 is not only a simple and effective modification to existing VLM models, but also shows the promise of building a general-purpose physical robot that can reason, solve problems, and interpret information to perform a variety of tasks in reality world.