Perceiver AR: general-purpose, long-frame autoregressive generation

In recent years, autoregressive transformants have brought a steady stream of discoveries to genetic modeling. These models generate each element of a sample – the pixels of an image, the characters of a text (usually in “token” chunks), the samples of an audio waveform, and so on – by predicting one element after another. When predicting the next item, the model can look back at those generated earlier.

However, each of the layers of a transformer becomes more expensive as more elements are used as input, and practitioners can only afford to deep train transformers on sequences of length not exceeding 2,048 elements. So most Transformer-based models ignore all evidence beyond the most recent past (about 1,500 words or 1/6 of a small image) when making a prediction.

In contrast, recently developed Perceiver models give excellent results on a variety of real-world tasks with up to 100,000 or so elements. Detectors use cross-attention to encode inputs into a latent space, decoupling the computational requirements of the input from the depth of the model. Perceivers also spend a fixed cost, regardless of input size, at almost every level.

While latent space encoding handles all elements in a single pass, autorecursive generation assumes that processing is done one element at a time. To address this problem, Perceiver AR proposes a simple solution: align the latent elements one-to-one with the final elements of the input, and carefully mask the input so that the latents only see earlier elements.

The result is an architecture (shown above) that accepts up to 50 times larger inputs as standard transformers, while deploying as widely (and essentially as easily) as standard decoder-only transformers.

The Perceiver AR scales significantly better with size than the standard Transformers and Transformer-XL models over a range of sequence lengths in real terms. This property allows us to create very efficient large frame models. For example, we find that a 60-layer Perceiver AR with a context length of 8192 outperforms the 42-layer Transformer-XL in a book-building task, while running faster in real-world wall clock terms.

In typical large-environment (ImageNet 64×64), language (PG-19) and music (MAESTRO) output standards, Perceiver AR produces state-of-the-art results. Increasing the input frame by decoupling the input size from the computational budget leads to several interesting results:

The budget calculation can be adjusted at any time, allowing us to spend less and smoothly degrade quality, or spend more for improved production.
A wider frame allows the Perceiver AR to perform better than the Transformer-XL, even while spending the same on computation. We find that the larger framework leads to improved model performance even at an affordable scale (~1B parameters).
Perceiver AR’s sample quality is much less sensitive to the order in which it generates elements. This makes it easy to apply Perceiver AR to settings that do not have a natural left-to-right order, such as data such as images, with structure that spans more than one dimension.

Using a piano music dataset, we trained Perceiver AR to create new pieces of music from scratch. Because each new note is predicted based on the full sequence of resulting notes, Perceiver AR is able to produce pieces with a high level of melodic, harmonic and rhythmic coherence:

Learn more about using Perceiver AR:

Download the JAX code for training the Perceiver AR on Github
Read our paper arXiv
Check out our spotlight presentation at ICML 2022

Check out Google Magenta suspension with more music!

A way to let robots learn by listening will make them more useful

AI companies are finally being forced to cough up training data

NanoNets AI solution feeds delivery information to Jamix

Why harmonize bank statements? Explain the importance and benefits

Que sont les règles métier ? : The wizard is not complete

Understanding YOLOv5 Loss: A Comprehensive Analysis

Master Advanced Prompt Engineering with LangChain for Context-Aware Language Models

Arduino vs Raspberry Pi: What’s the difference?

Top 20 Generative AI Applications/ Use Cases Across Industries

Top 35+ Finance Interview Questions And Answers

Perceiver AR: general-purpose, long-frame autoregressive generation

Nvidia’s ‘Nemotron-4 340B’ model redefines synthetic data generation, rivals GPT-4

Looking ahead to the Seoul AI Summit

Introduction of the border security framework

How AI turned a Ukrainian YouTuber into a Russian

AI-generated text and video watermark with SynthID

Illness took away her voice. AI created a replica she carries in her phone

A way to let robots learn by listening will make them more useful

How Forex Trading Robots Are Transforming Financial Markets

U.S. Awards $504 Million for ‘Tech Hubs’ in Overlooked Regions

Our Picks

A way to let robots learn by listening will make them more useful

How Forex Trading Robots Are Transforming Financial Markets

U.S. Awards $504 Million for ‘Tech Hubs’ in Overlooked Regions

Subscribe to Updates

Perceiver AR: general-purpose, long-frame autoregressive generation

Related Posts