Coping with multiple tasks with a visual language model

A key aspect of intelligence is the ability to quickly learn how to perform a new task when given a brief instruction. For example, a child can identify real animals at the zoo after seeing some pictures of the animals in a book, despite the differences between the two. But for a typical visual model to learn a new task, it would need to be trained on tens of thousands of examples labeled specifically for that task. If the goal is to count and identify the animals in an image, as in the “three zebras,” one would have to collect thousands of images and annotate each image with their quantity and type. This process is inefficient, expensive and resource intensive, requiring large amounts of annotated data and the need to train a new model each time it faces a new task. As part of DeepMind’s mission to solve intelligence, we investigated whether an alternative model could make this process easier and more efficient by considering only limited task-specific information.

Today, in our template paperwe import Flamingo, a unified visual language model (VLM) that sets a new state of the art in low-shot learning across a wide range of open-ended multitasking. This means that Flamingo can tackle a range of difficult problems with only a few examples of specific tasks (in “few shots”), without requiring additional training. Flamingo’s simple interface makes this possible by taking as input a prompt consisting of interpolated images, video and text and then outputting the relevant language.

Similar to the behavior of large language models (LLMs), which can tackle a language task by processing examples of the task in their text prompt, Flamingo’s visual and textual interface can direct the model to solve a multimodal task. Given some example pairs of visual inputs and expected text responses written in the Flamingo prompt, the model can be asked a question with a new image or video and then generate a response.

In the 16 tasks we studied, Flamingo outperforms all previous single-shot learning approaches when given just four examples per task. In several cases, the same Flamingo model outperforms methods optimized and optimized for each task independently and using multiple orders of magnitude more task-specific data. This will allow non-specialists to quickly and easily use accurate visual language models in new tasks.

In practice, Flamingo combines large language models with powerful visual representations – each individually pre-trained and frozen – adding new architectural elements in between. It is then trained on a mixture of complementary large-scale multimodal data sourced only from the web, without using any annotated data for machine learning purposes. Following this method, we start from Chinchilla, our recently introduced 70B parameter language model, to train our final Flamingo model, an 80B parameter VLM. After completing this training, the Flamingo can be adapted directly to sighting tasks by simply learning a few shots without additional task-specific tuning.

We also tested the quality capabilities of the model beyond our current benchmarks. As part of this process, we compared our model’s performance when captioning images related to gender and skin color, and ran the captions generated by our model through Google’s Perspective API, which assesses the toxicity of text. Although the initial results are positive, more research into the assessment of ethical risks in multimodal systems is vital, and we urge people to carefully evaluate and consider these issues before considering the deployment of such systems in the real world.

Multimodal capabilities are essential for important AI applications, such as helping the visually impaired with daily visual challenges or improving hate content recognition on the network. Flamingo makes it possible to efficiently adapt these examples and other tasks on-the-fly without modifying the model. Interestingly, the model demonstrates multimodal dialog capabilities out of the box, as seen here.

Flamingo is an effective and efficient family of general-purpose models that can be applied to image and video understanding tasks with minimal task-specific examples. Models like Flamingo hold great promise to benefit society in practical ways, and we continue to improve their flexibility and capabilities so that they can be deployed safely for the benefit of all. Flamingo’s capabilities pave the way for rich interactions with learned visual language models that can enable better interpretation and exciting new applications, such as a visual assistant to help people in everyday life – and we’re thrilled with the results so far.

AI lie detectors are better than humans at spotting lies

What are AI agents? | MIT Technology Review

A way to let robots learn by listening will make them more useful

AI companies are finally being forced to cough up training data

NanoNets AI solution feeds delivery information to Jamix

Guide to Statistical Analysis: Definition, Types, and Careers

Understanding YOLOv5 Loss: A Comprehensive Analysis

Master Advanced Prompt Engineering with LangChain for Context-Aware Language Models

Arduino vs Raspberry Pi: What’s the difference?

Top 20 Generative AI Applications/ Use Cases Across Industries

Coping with multiple tasks with a visual language model

Understanding the visual knowledge of language models | MIT News

Natural Language Processing (NLP) and its Applications in AI

How Bend Works: A Parallel Programming Language That “Feels Like Python but Scales Like CUDA” | by Lucas de Lima Nogueira | Jun, 2024

Researchers leverage shadows to model 3D scenes, including objects blocked from view | MIT News

What you need to know about this new Chinese text-to-video AI model

Nvidia’s ‘Nemotron-4 340B’ model redefines synthetic data generation, rivals GPT-4

AI lie detectors are better than humans at spotting lies

LLM Alignment: Reward-Based vs Reward-Free Methods | by Anish Dubey | Jul, 2024

AI and Data Extraction for Business Benefits

Our Picks

AI lie detectors are better than humans at spotting lies

LLM Alignment: Reward-Based vs Reward-Free Methods | by Anish Dubey | Jul, 2024

AI and Data Extraction for Business Benefits

Subscribe to Updates

Coping with multiple tasks with a visual language model

Related Posts