Data is the new soil, and in this fertile new soil, MIT researchers are planting more than pixels. Using synthetic images to train machine learning models, a team of scientists recently surpassed the results obtained from traditional “real image” training methods.
At the core of the approach is a system called StableRep, which doesn’t just use synthetic images. creates them through extremely popular text-to-image models like Stable Diffusion. It’s like creating worlds with words.
So what’s in StableRep’s secret sauce? A strategy called “multiple contrastive learning.”
“We’re teaching the model to learn more about high-level concepts through context and variation, not just by feeding it data,” says Lijie Fan, an MIT doctoral student in electrical engineering who is an associate at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). . ), lead researcher on the paper. “When multiple images, all generated from the same text, are all treated as representations of the same underlying thing, the model dives deeper into the concepts behind the images, say the object, not just their pixels.”
This approach considers multiple images derived from identical text messages as positive pairs, providing additional information during training, not just adding more diversity but identifying in the vision system which images are similar and which are different. Remarkably, StableRep outperformed top-tier models trained on real images, such as SimCLR and CLIP, on extensive datasets.
“While StableRep helps mitigate the challenges of data acquisition in machine learning, it also ushers in a step toward a new era of AI training techniques. The ability to produce high-caliber, diverse synthetic images on command could help limit cumbersome costs and resources,” says Fan.
The data collection process has never been simple. In the 1990s, researchers had to manually take photographs to compile datasets about objects and faces. The 2000s saw people searching the internet for data. However, this raw, unedited data often contained discrepancies compared to real-world scenarios and reflected societal biases, presenting a distorted view of reality. The task of cleaning data sets through human intervention is not only expensive, but also extremely difficult. Imagine, however, if this painstaking data collection could be distilled down to something as simple as issuing a command in natural language.
A key aspect of StableRep’s triumph is the adaptation of the “guide scale” to the production model, which ensures a fine balance between the diversity and fidelity of the synthesized images. When optimized, the synthetic images used to train these self-supervised models were found to be just as effective, if not more, than real images.
Taking it a step further, language supervision was added to the mix, creating an improved variant: StableRep+. When trained with 20 million synthetic images, StableRep+ not only achieved superior accuracy, but also demonstrated remarkable performance compared to CLIP models trained with an astounding 50 million real images.
However, the path ahead is not without its bumps. Researchers clearly face several limitations, including the current slow rate of image generation, semantic mismatches between text prompts and resulting images, potential enhancement of biases, and complexity in image rendering, which are imperative to address for future developments. . Another issue is that StableRep requires first training the generative model on large-scale real data. The team recognizes that starting with real data remains a necessity. However, once you have a good production model, you can reuse it for new tasks, such as training recognition models and visual representations.
The team notes that they haven’t overcome the need to start with real data. Simply put, once you have a good production model, you can reuse it for new tasks, such as training recognition models and visual representations.
While StableRep offers a good solution by reducing reliance on huge collections of real images, it raises concerns about hidden biases in the unedited data used for these text-to-image models. The selection of text messages, which is an integral part of the image composition process, is not entirely free of bias, “suggesting the essential role of careful text selection or possible human curation,” says Fan.
“Using the latest text-to-image models, we’ve gained unprecedented control over image creation, enabling a variety of images from a single text input. This surpasses real-world image collection in efficiency and flexibility. It proves particularly useful in specialized tasks such as image variety balancing in long-tail recognition, presenting a practical complement to using real images for training,” says Fan. “Our work marks a step forward in visual learning, towards the goal of offering cost-effective training alternatives, while highlighting the need for continuous improvements in data quality and synthesis.”
“A dream of learning genetic models has long been to be able to generate data useful for training discriminative models,” says Google DeepMind researcher and University of Toronto computer science professor David Fleet, who was not involved in the work . “While we have seen some signs of life, the dream has been elusive, especially in complex large-scale areas such as high-resolution imagery. This paper provides hard evidence, for the first time to my knowledge, that the dream is coming true. They show that adversarial learning from massive amounts of synthetic image data can produce representations that outperform those learned from real data at scale, with the potential to improve myriad downstream vision tasks.”
Fan is joined by Yonglong Tian PhD ’22 as lead authors on the paper, as well as MIT associate professor of electrical engineering and computer science and CSAIL principal investigator Phillip Isola. Google researcher and OpenAI technical staff member Huiwen Chang. and Google staff research scientist Dilip Krishnan. The team will present StableRep at the 2023 Conference on Neural Information Processing Systems (NeurIPS) in New Orleans.