DeepMind RT-2

Google DeepMind Shows Off New Generative AI-Based RT-2 Robot Command Language

Google DeepMind has unveiled a communication language for instructing robots based on a large language model (LLM). The Robotic Transformer 2 (RT-2) vision-language-action (VLA) model leverages Internet data and generative AI  to enhance a robot’s understanding of language and commands it receives, similar to how ChatGPT converses with users, albeit with an expanded focus on completing tasks in the physical world.

RT-2 Robots

RT-2’s ultimate goal is to enable robots to follow orders and grasp the meaning of commands as well as a human without any specialized language. The LLM is trained on text and images online, with the AI incorporating the data into pattern recognition algorithms so that it could theoretically complete a task without specific training. For instance, training on instructions and images describing fetching a wrench from a tool bench could be generalized to hammers and other tools or locations besides a tool bench. RT-2 shows how a robot could apply what it knows about one scenario to a whole category of them, including understanding how a command applies to a new context. The LLM aspect is similar to generative AI chatbots but with more physical context and not just raw information.

“Unlike chatbots, robots need “grounding” in the real world and their abilities. Their training isn’t just about, say, learning everything there is to know about an apple: how it grows, its physical properties, or even that one purportedly landed on Sir Isaac Newton’s head. A robot needs to be able to recognize an apple in context, distinguish it from a red ball, understand what it looks like, and most importantly, know how to pick it up,” DeepMind head of robotics Vincent Vanhoucke explained in a blog post. “RT-2’s ability to transfer information to actions shows promise for robots to more rapidly adapt to novel situations and environments. In testing RT-2 models in more than 6,000 robotic trials, the team found that RT-2 functioned as well as our previous model, RT-1, on tasks in its training data, or “seen” tasks. And it almost doubled its performance on novel, unseen scenarios to 62% from RT-1’s 32%. In other words, with RT-2, robots are able to learn more like we do — transferring learned concepts to new situations.”

Robot Seagulls and Dogs

Better robot languages are a goal for many tech firms. Amazon has also been ways of reducing the friction in robot interactions with its AI research, including the Alexa Prize SimBot Challenge. The University of Michigan recently won the first SimBot Challenge with the Seagull virtual robot, an “interactive embodied agent” trained in virtual space for potential use in physical robots.

Vanhoucke’s point about how LLMs need to go beyond generative AI chatbots to perform well as a robot language makes sense, but that hasn’t stopped people from trying. In fact, engineers at robotic AI software developer Levatas embedded ChatGPT into one of the Boston Dynamics robot dogs and it performed surprisingly well. Combining ChatGPT with Google’s Text-to-Speech synthetic voice API, the robot dog understood commands and would attempt to carry out tasks when asked. Though far from 100% in understanding or being able to carry out commands, just having ChatGPT as an interface made it possible to communicate with the robot in a more casual manner.


Google Augments Helper Robot With Natural Language Understanding

ChatGPT Lets Robot Dog Say a Lot More Than Woof

University of Michigan Wins First Alexa Prize SimBot Challenge