Chang She, previously vice president of engineering at Tubi and a Cloudera veteran, has years of experience building data tools and infrastructure. But when he started working in the AI space, he quickly ran into problems with the traditional data infrastructure—problems that prevented him from bringing AI models to production.
“Machine learning engineers and AI researchers are often stuck with a degraded development experience,” she told TechCrunch in an interview. “Data infrastructure companies don’t really understand the machine learning data problem at a fundamental level.”
So Chang—who is one of the co-creators of Pandas, the wildly popular Python data science library—teamed up with software engineer Lei Xu to co-release LanceDB.
LanceDB makes the eponymous LanceDB open source database software, which is designed to support multimodal artificial intelligence models — models that train and generate images, videos, and more in addition to text. Backed by Y Combinator, LanceDB this month raised $8 million in a seed funding round led by CRV, Essence VC and Swift Ventures, bringing its total to $11 million.
“If multimodal AI is critical to your company’s future success, you want your very precise AI team to focus on modeling and bridging AI to business value,” Chang said. “Unfortunately, today, AI teams spend most of their time dealing with low-level data infrastructure details. LanceDB provides the foundation AI teams need so they can be free to focus on what really matters to business value and bring AI products to market much faster than otherwise possible”.
LanceDB is essentially a vector database — a database that contains strings of numbers (“vectors”) that encode the meaning of unstructured data (eg images, text, and so on).
As my colleague Paul Sawers recently wrote , vector databases are having a moment as the AI hype cycle peaks. This is because they are useful for all kinds of AI applications, from content recommendations on e-commerce and social media platforms to the reduction of hallucinations.
Vector database competition is fierce — see Qdrant, Vespa, Weaviate, Pinecone, and Chroma to name a few vendors (not counting the Big Tech incumbents). So what makes LanceDB unique? Better flexibility, performance and scalability, according to Chang.
For one, Chang says, LanceDB — which is built on top of Apache Arrow — powered by a custom data format, the Lance format, which is optimized for multimodal AI training and analysis. Lance Format enables LanceDB to handle up to billions of vectors and petabytes of text, images and video, and enables engineers to manage various forms of metadata associated with that data.
“Until now, there has never been a system that can bring together training, exploration, search and large-scale data processing,” Chang said. “Lance Format enables AI researchers and engineers to have a single source of truth and lightning-fast performance across their entire AI pipeline. It’s not just about storing vectors.”
LanceDB makes money by selling fully managed versions of its open source software with added features like hardware acceleration and governance controls — and business appears to be going strong. The company’s client list includes text-to-image platform Midjourney, chatbot unicorn Character.ai, self-driving car startup WeRide and Airtable.
Chang insisted that LanceDB’s recent VC backing won’t take his attention away from the open source project, however, which he says is now seeing about 600,000 downloads a month.
“We wanted to create something that would make it 10x easier for AI teams working with large-scale multimodal data,” he said. “LanceDB offers – and will continue to offer – a very rich set of ecosystem integrations to minimize adoption effort.”