Imagine you’re visiting a friend abroad and looking inside their fridge to see what they’d make for a great breakfast. Many of the items seem foreign to you at first, each encased in unfamiliar packaging and containers. Despite these visual distinctions, you begin to understand what each is used for and take them as needed.
Inspired by humans’ ability to manipulate unfamiliar objects, a team from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) designed Feature Fields for Robotic Manipulation (F3RM), a system that combines 2D images with ground model features in 3D scenes to help robots recognize and perceive nearby objects. F3RM can interpret open-ended language prompts from humans, making the method useful in real-world environments containing thousands of objects, such as warehouses and households.
F3RM offers robots the ability to interpret plain text messages using natural language, helping machines manipulate objects. As a result, machines can understand less specific requests from humans and complete the desired task. For example, if a user asks the robot to “lift a tall mug,” the robot can locate and grab the object that best fits that description.
“Building robots that can really generalize to the real world is incredibly difficult,” says Ge Yang, a postdoctoral fellow at the National Science Foundation AI Institute for Artificial Intelligence and Fundamental Interactions and MIT CSAIL. “We really want to figure out how to do that, so with this project, we’re trying to push for an aggressive level of generalization, from three or four objects to anything we find at MIT’s Stata Center. We wanted to learn how to make robots as flexible as we are, since we can grasp and place objects even if we’ve never seen them before.”
Learning “what is where by looking”
The method could help robots pick items in large fulfillment centers with inevitable clutter and unpredictability. In these warehouses, robots are often given a description of the inventory they need to identify. Bots must match the text provided on an item, regardless of packaging variations, so that customer orders are shipped correctly.
For example, the fulfillment centers of large online retailers can contain millions of items, many of which a robot will never have encountered before. To operate at such a scale, robots must understand the geometry and semantics of different objects, some in tight spaces. With F3RM’s advanced spatial and semantic capabilities, a robot could become more efficient at locating an object, placing it in a bin, and then sending it for packaging. Ultimately, this would help factory workers ship customer orders more efficiently.
“One thing that often surprises people with F3RM is that the same system also works at room and building scale and can be used to create simulation environments for robot learning and large maps,” says Yang. “But before we scale this project further, we first want to get this system up and running very quickly. That way, we can use this type of representation for more dynamic robotic control tasks, hopefully in real time, so that robots handling more dynamic tasks can use it for perception.”
The MIT team notes that F3RM’s ability to understand different scenes could make it useful in urban and home environments. For example, the approach could help personalized robots locate and pick up specific objects. The system helps robots understand their environment – both physically and perceptually.
“Visual perception was defined by David Marr as the problem of knowing ‘what is where by looking,'” says senior author Phillip Isola, MIT associate professor of electrical engineering and computer science and CSAIL principal investigator. “Recent foundation models have gotten really good at knowing what they’re looking at. They can recognize thousands of object categories and provide detailed image text descriptions. At the same time, radiation fields have become very good at representing where material is in a scene. Combining these two approaches can create a representation of what’s in 3D, and what our work shows is that this combination is particularly useful for robotic tasks that require manipulating objects in 3D.”
Creating a “digital twin”
F3RM begins to understand its environment by taking pictures on a selfie stick. The mounted camera takes 50 images in different poses, allowing it to create one neural radiation field (NeRF), a deep learning method that takes 2D images to construct a 3D scene. This RGB photo collage creates a “digital twin” of its surroundings in the form of a 360-degree representation of what’s nearby.
In addition to a highly detailed neural beam field, F3RM also generates a feature field to augment the geometry with semantic information. The system uses PAPER CLIP, a vision foundation model trained on hundreds of millions of images for efficient visual concept learning. By reconstructing 2D CLIP features for selfie stick images, F3RM effectively elevates 2D features into a 3D representation.
Keeping things open
After receiving a few demonstrations, the robot applies what it knows about geometry and semantics to grasp objects it has never encountered before. Once a user submits a text query, the bot searches through the space of possible captures to identify those most likely to succeed in getting the item the user is requesting. Each possible choice is scored based on its relevance to the prompt, similarity to the demonstrations the robot has been trained on, and whether it causes conflict. The capture with the highest score is then selected and executed.
To demonstrate the system’s ability to interpret open-ended requests from humans, the researchers had the robot take on Baymax, a character from Disney’s “Big Hero 6.” While F3RM was never directly trained to pick up a toy of the cartoon superhero, the robot used the spatial awareness and vision language features from the base models to decide which object to grab and how to pick it up.
F3RM also allows users to specify which object they want the robot to manipulate at different levels of linguistic detail. For example, if there is a metal mug and a glass mug, the user can ask the robot for the “glass mug”. If the robot sees two glass mugs and one is filled with coffee and the other with juice, the user can ask for the “glass mug with coffee”. The fundamental model attributes embedded in the attribute field enable this level of open understanding.
“If I showed a person how to grasp a mug by the rim, they could easily transfer that knowledge to pick up objects with similar geometry, such as bowls, measuring cups, or even rolls of tape. For robots, achieving this level of adaptability has been quite difficult,” says MIT PhD student, CSAIL collaborator and co-lead author William Shen. “F3RM combines geometric understanding with semantics from foundation models trained on Internet-scale data to enable this level of aggressive generalization from just a small number of demonstrations.”
Shen and Yang wrote the paper under Isola’s supervision, with co-authors MIT professor and CSAIL principal investigator Leslie Pack Kaebling and undergraduate students Alan Yu and Jansen Wong. The team was supported, in part, by Amazon.com Services, the National Science Foundation AI Institute for Artificial Intelligence Fundamental Interactions, the Air Force Office of Scientific Research, the Office of Naval Research’s Multidisciplinary University Initiative, the Office of Research Army, the MIT-IBM Watson AI Lab and the MIT Quest for Intelligence. Their work will be presented at the 2023 Conference on Robot Learning.