How vector databases can revolutionize our relationship with generative AI
Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More
Generative AI has received a lot of attention already this year in the tech world and beyond. Whether it’s ChatGPT’s prose or Stable Diffusion’s art, 2022 provided an insight into the potential for AI to disrupt creative industries.
But behind the headlines, 2022 brought an even more important development in AI: the rise of the vector database.
While their impacts are less immediately obvious, the adoption of vector databases could completely upend the way we interact with our devices, along with dramatically improving our productivity in a vast range of administrative and clerical tasks.
Ultimately, vector databases will be essential infrastructure in bringing about the societal and economic changes promised by AI.
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
But what is a vector database? To understand that, we have to make sense of the underlying problem it addresses: unstructured data.
The database dilemma
Databases are one of the software industry’s longest-lasting and most resilient verticals. The total spend on databases and database management solutions doubled from $38.6B in 2017 to $80B in 2021. And since 2020, databases have only further entrenched their position as one of the most rapidly growing software categories, owing to further digitization following mass shifts to remote working.
However, the modern database is still constrained by a problem that has persisted for decades: the problem of unstructured data. This is the up to 80% of data stored globally that has not been formatted, tagged or structured in a way that allows it to be rapidly searched or recalled.
For a simple analogy of structured vs. unstructured data, think of a spreadsheet with multiple columns per row. In this case, a row of “structured data” has all the relevant columns filled in, whereas a row of “unstructured data” does not. In the case of the unstructured entry, it may be that the data has been automatically imported into the first column of the row; someone now needs to break up that cell and populate data into relevant columns.
Why is unstructured data a problem? In short, it makes it harder to sort, search, review and use information in a database. However, our understanding of unstructured data is relative to how data is usually structured.
Missing tags or misaligned formatting means that unstructured entries can be missed in searches or incorrectly excluded/included from filtering. This introduces risks of error to many database operations, which we have to address through manually structuring the data. This often requires us to manually review unstructured entries. This doesn’t mean that the data itself is necessarily unstructured; it just requires more manual intervention than our usual means of data storing.
We often hear about the burden of manual review with claims such as data scientists spending 80% of their time on data preparation. But in practice, this is something we all do to some extent, or at least live with the effects of. If you’ve had to wrestle with a file explorer to find something on your hard drive or spend lots of time screening out irrelevant search engine results, you’ve likely been hit by the unstructured data problem.
This wasted time on manual formatting, reviewing and filtering is not a new or exclusively digital problem. For example, librarians manually arrange books according to the Dewey Decimal System. The unstructured data problem is just a digital version of a fundamental challenge with every record-keeping task humans have had since we invented writing: We need to classify information to store and use it.
This is where vector databases prove particularly exciting. Rather than relying on distinct categories and lists to organize our records, vector databases instead place them on a map.
Vectors and mapping
Vector databases use a concept in machine learning and deep learning called vector embeddings. Vector embedding is a technique where words or phrases in a text are mapped to high-dimensional vectors, also known as word embeddings. These vectors are learned in such a way that semantically similar words are close together in the vector space.
This representation allows deep neural networks to process textual data more effectively, and has proven very useful in a variety of natural language processing tasks such as text classification, translation and sentiment analysis.
In the database context, vector embedding is effectively a numerical representation of a group of properties we want to measure.
To create an embedding, we take a trained machine learning model and instruct it to monitor for those properties in entries in a dataset.
In the case of a text string, for example, the model could be told to log the average word length, sentiment analysis scores, or occurrence of specific words.
The final embedding takes the form of a series of numbers corresponding to the “scores” logged in the audit of properties. A vector database takes the scores of the vector embeddings and plots them on a graph. Every property we measure in a vector embedding constitutes a dimension of the graph, resulting in it usually having many more than the three dimensions we can conventionally visualize.
With all this information plotted, we can still calculate how “far” away any one embedding is from another embedding in the same way we can in any other graph. Perhaps more importantly, we can engage in a novel way of searching data. By generating a vector embedding of an inputted search query, we plot a point on the graph we want to target. Then, we can discover the embeddings that are the nearest to our search point.
Vector embeddings are not a perfect solution for everything. They are typically learned in an unsupervised manner, making it difficult to interpret their meaning and how they contribute to the overall model performance. Pre-trained embeddings can also contain biases present in the training data, such as gender, racial or political biases, which can negatively impact model performance.
The potential of vector search
A vector database doesn’t rely on tags, labels, metadata or other tools typically used to structure data. Instead, because a vector embedding can track any property we deem relevant, vector databases allow us to obtain search results based on overall similarity.
Whereas current searches of unstructured data involve manual reviewing and interpreting, vector databases will allow searches to actually reflect the meaning behind our queries rather than superficial properties like keywords.
This change stands to revolutionize data handling, record-keeping and most administrative work and clerical tasks. Because of the reduction in “false positive” search results and a reduced need to pre-screen and format queries to a system, vector databases can dramatically boost the productivity and efficiency of just about any job in the knowledge economy.
Aside from gains in administrative productivity, these advanced search capabilities will allow us to rely on databases to engage more effectively with creative and open-ended queries.
This is an ideal complement to the rise of generative AI. Because vector databases reduce the need to structure data, we can substantially speed up training times for generative AI models by automating much of the work around processing unstructured data for training and production.
As a result, many organizations can simply import their unstructured data into a vector database and tell it what properties they want to be measured in their embeddings. With those embeddings generated, an organization can rapidly train and deploy a generative model by simply letting it search the vector database to gather information for tasks.
The vector database is set to dramatically improve our productivity and revolutionize how we field queries to computers. Altogether, this makes vector databases one of the most important emergent technologies of the coming decade.
Rick Hao is partner at Speedinvest.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!