In recent years, autoregressive transformants have brought a steady stream of discoveries to genetic modeling. These models generate each element of a sample – the pixels of an image, the characters of a text (usually in “token” chunks), the samples of an audio waveform, and so on – by predicting one element after another. When predicting the next item, the model can look back at those generated earlier.
However, each of the layers of a transformer becomes more expensive as more elements are used as input, and practitioners can only afford to deep train transformers on sequences of length not exceeding 2,048 elements. So most Transformer-based models ignore all evidence beyond the most recent past (about 1,500 words or 1/6 of a small image) when making a prediction.
In contrast, recently developed Perceiver models give excellent results on a variety of real-world tasks with up to 100,000 or so elements. Detectors use cross-attention to encode inputs into a latent space, decoupling the computational requirements of the input from the depth of the model. Perceivers also spend a fixed cost, regardless of input size, at almost every level.
While latent space encoding handles all elements in a single pass, autorecursive generation assumes that processing is done one element at a time. To address this problem, Perceiver AR proposes a simple solution: align the latent elements one-to-one with the final elements of the input, and carefully mask the input so that the latents only see earlier elements.
The result is an architecture (shown above) that accepts up to 50 times larger inputs as standard transformers, while deploying as widely (and essentially as easily) as standard decoder-only transformers.
The Perceiver AR scales significantly better with size than the standard Transformers and Transformer-XL models over a range of sequence lengths in real terms. This property allows us to create very efficient large frame models. For example, we find that a 60-layer Perceiver AR with a context length of 8192 outperforms the 42-layer Transformer-XL in a book-building task, while running faster in real-world wall clock terms.
In typical large-environment (ImageNet 64×64), language (PG-19) and music (MAESTRO) output standards, Perceiver AR produces state-of-the-art results. Increasing the input frame by decoupling the input size from the computational budget leads to several interesting results:
- The budget calculation can be adjusted at any time, allowing us to spend less and smoothly degrade quality, or spend more for improved production.
- A wider frame allows the Perceiver AR to perform better than the Transformer-XL, even while spending the same on computation. We find that the larger framework leads to improved model performance even at an affordable scale (~1B parameters).
- Perceiver AR’s sample quality is much less sensitive to the order in which it generates elements. This makes it easy to apply Perceiver AR to settings that do not have a natural left-to-right order, such as data such as images, with structure that spans more than one dimension.
Using a piano music dataset, we trained Perceiver AR to create new pieces of music from scratch. Because each new note is predicted based on the full sequence of resulting notes, Perceiver AR is able to produce pieces with a high level of melodic, harmonic and rhythmic coherence:
Learn more about using Perceiver AR:
- Download the JAX code for training the Perceiver AR on Github
- Read our paper arXiv
- Check out our spotlight presentation at ICML 2022
Check out Google Magenta suspension with more music!