A key aspect of intelligence is the ability to quickly learn how to perform a new task when given a brief instruction. For example, a child can identify real animals at the zoo after seeing some pictures of the animals in a book, despite the differences between the two. But for a typical visual model to learn a new task, it would need to be trained on tens of thousands of examples labeled specifically for that task. If the goal is to count and identify the animals in an image, as in the “three zebras,” one would have to collect thousands of images and annotate each image with their quantity and type. This process is inefficient, expensive and resource intensive, requiring large amounts of annotated data and the need to train a new model each time it faces a new task. As part of DeepMind’s mission to solve intelligence, we investigated whether an alternative model could make this process easier and more efficient by considering only limited task-specific information.
Today, in our template paperwe import Flamingo, a unified visual language model (VLM) that sets a new state of the art in low-shot learning across a wide range of open-ended multitasking. This means that Flamingo can tackle a range of difficult problems with only a few examples of specific tasks (in “few shots”), without requiring additional training. Flamingo’s simple interface makes this possible by taking as input a prompt consisting of interpolated images, video and text and then outputting the relevant language.
Similar to the behavior of large language models (LLMs), which can tackle a language task by processing examples of the task in their text prompt, Flamingo’s visual and textual interface can direct the model to solve a multimodal task. Given some example pairs of visual inputs and expected text responses written in the Flamingo prompt, the model can be asked a question with a new image or video and then generate a response.
In the 16 tasks we studied, Flamingo outperforms all previous single-shot learning approaches when given just four examples per task. In several cases, the same Flamingo model outperforms methods optimized and optimized for each task independently and using multiple orders of magnitude more task-specific data. This will allow non-specialists to quickly and easily use accurate visual language models in new tasks.
In practice, Flamingo combines large language models with powerful visual representations – each individually pre-trained and frozen – adding new architectural elements in between. It is then trained on a mixture of complementary large-scale multimodal data sourced only from the web, without using any annotated data for machine learning purposes. Following this method, we start from Chinchilla, our recently introduced 70B parameter language model, to train our final Flamingo model, an 80B parameter VLM. After completing this training, the Flamingo can be adapted directly to sighting tasks by simply learning a few shots without additional task-specific tuning.
We also tested the quality capabilities of the model beyond our current benchmarks. As part of this process, we compared our model’s performance when captioning images related to gender and skin color, and ran the captions generated by our model through Google’s Perspective API, which assesses the toxicity of text. Although the initial results are positive, more research into the assessment of ethical risks in multimodal systems is vital, and we urge people to carefully evaluate and consider these issues before considering the deployment of such systems in the real world.
Multimodal capabilities are essential for important AI applications, such as helping the visually impaired with daily visual challenges or improving hate content recognition on the network. Flamingo makes it possible to efficiently adapt these examples and other tasks on-the-fly without modifying the model. Interestingly, the model demonstrates multimodal dialog capabilities out of the box, as seen here.
Flamingo is an effective and efficient family of general-purpose models that can be applied to image and video understanding tasks with minimal task-specific examples. Models like Flamingo hold great promise to benefit society in practical ways, and we continue to improve their flexibility and capabilities so that they can be deployed safely for the benefit of all. Flamingo’s capabilities pave the way for rich interactions with learned visual language models that can enable better interpretation and exciting new applications, such as a visual assistant to help people in everyday life – and we’re thrilled with the results so far.