Increasingly, the AI industry is moving towards generative AI models with larger frameworks. But models with large context windows tend to be computationally intensive. Or Dagan, head of product at startup AI21 Labs, claims that doesn’t have to be the case — and his company is releasing a production model to prove it.
Contexts or context windows refer to input data (eg text) that a model examines before generating output (more text). Models with small context windows tend to forget the content of even very recent conversations, while models with larger contexts avoid this pitfall — and, as an added bonus, better understand the flow of data they receive.
AI21 Labs’ Jamba, a new text generation and analysis model, can perform many of the same tasks that models such as OpenAI’s ChatGPT and Google’s Gemini do. Trained on a combination of public and proprietary data, Jamba can write text in English, French, Spanish and Portuguese.
Jamba can handle up to 140,000 tokens while running on a single GPU with at least 80GB of memory (such as a high-end Nvidia A100). That translates to about 105,000 words or 210 pages — a decent-sized novel.
Meta’s Llama 2, by comparison, has an ambient window of 32,000 tons – on the smaller side by today’s standards – but only requires a GPU with ~12GB of memory to run. (Context windows are typically measured in tokens, which are chunks of raw text and other data.)
On his face, Jabba is unremarkable. There are many freely available downloadable AI models, from the recently released DBRX by Databricks to the aforementioned Llama 2.
But what makes the Jamba unique is what’s under the hood. It uses a combination of two architectural models: transformers and state space models (SSM).
Transformers are the architecture of choice for complex reasoning tasks, powering models such as GPT-4 and Google’s Gemini, for example. They have many unique features, but the defining characteristic of transformers is their “attention mechanism”. For each piece of input data (eg a sentence), transformers weigh the relevance of every other input (other sentences) and draw from them to produce the output (a new sentence).
SSMs, on the other hand, combine various properties of older types of artificial intelligence models, such as recurrent neural networks and convolutional neural networks, to create a more computationally efficient architecture capable of handling large data sequences.
Now, SSMs have their limitations. However, some early incarnations, including an open-source model called Mamba by Princeton and Carnegie Mellon researchers, can handle larger inputs than their transformer-based counterparts, while performing better at language generation tasks.
The Jamba actually uses the Mamba as part of the base model — and Dagan claims it delivers three times the performance in large environments compared to comparably sized transformer-based models.
“While there are some initial academic examples of SSM models, this is the first commercial-grade, production-scale model,” Dagan told TechCrunch. “This architecture, in addition to being innovative and interesting for further research by the community, opens up great performance and performance potential.”
Now, while Jamba has been released under the Apache 2.0 license, an open source license with relatively few usage restrictions, Dagan emphasizes that it is a research release not intended for commercial use. The model has no safeguards to prevent it from generating toxic text or mitigations to address potential bias. a refined, ostensibly “safer” version will be available in the coming weeks.
But Dagan claims that Jamba demonstrates the promise of the SSM architecture even at this early stage.
“The added value of this model, both due to its size and its innovative architecture, is that it can easily be placed on a single GPU,” he said. “We believe performance will improve further as Mamba receives additional tweaks.”