Published on 04/07/2023 by Kostas Pardalis
Hey, since SaaS is not in vogue anymore, and free tiers for mailing list software are not VC subsidized, for updates just follow us on Twitter: Kostas & Demetrios
This is a series of articles explaining WTF Vector Databases are.
Articles in this series
It’s obviously a database, right? 😄 but how is it different from whatever you’ve heard until now that is a database? Like MySQL or PostgreSQL?
Let’s start by going through the basics and trust me, by the end of this you will have a much better understanding of WTF is a Vector Database!
WTF is a Vector Database?
It’s a database
I’ll perform some plagiarism here but it’s better to hear from someone who knows much better than me of what a database is.
“01 Course Intro & Relational Model - Intro to database systems (15-445/645)” Andy Pavlo, Carnegie Mellon University.
Vector databases do organize inter-related data that models some aspect of the real-world! They are not a core component of most computer applications yet but maybe if the AI revolution proves its current hype, they might be.
Databases usually come packaged as Database Management Systems (DBMS) you probably have also heard this term already and it’s important to keep in mind the difference between a database and a DBMS.
A set of CSV files in your file system can definitely be a database. It absolutely follows the above definition. It can contain information that is inter-related and that models some aspect of the real-world.
Database to DBMS
But what turns a database into a DBMS is what makes databases hard in general. A DBMS includes functionality for:
- Ensuring Data Integrity
- Data manipulation and access, i.e. add new data
- Durability, i.e. what if the database crashes?
If in the above functionality we also add APIs for generic software to interact with the database for storing and processing data, then we have the definition of what a DBMS is.
Data Models & Databases
Let’s see what CMU-DB and Prof. Pavlo have to say about data models.
“01 Course Intro & Relational Model - Intro to database systems (15-445/645)” Andy Pavlo, Carnegie Mellon University.
And most importantly let’s see some examples of Data models.
“01 Course Intro & Relational Model - Intro to database systems (15-445/645)” Andy Pavlo, Carnegie Mellon University.
The course is about Relational databases but you might have noticed that there’s a mention to vectors in there!
This is important here because it gives us the first concrete definition of what a vector database is.
As we will see it’s pretty easy to add support for vectors in most of the existing relational databases that exist today but what makes vector databases a different breed of databases is the native support they have for specific operations around vectors that are important useful for Machine Learning and AI.
What is a vector?
Lets take a trip down memory lane. Hopefully the following definition rings some good memories from your youth.
This is the definition of a vector that most people have encountered at some point in their life.
If we go a little bit deeper into Wikipedia, we will also find the following general definition of vectors.
To the above definition let’s add the one that refers to what a vector is in computer science.
The above definition refers to “array” but array and vector are used interchangeably.
We will talk more about vector spaces and features and all the cool stuff of AI a bit later but for now the above definitions are what matter.
First, you have to forget what you might think of vectors at school, we are not talking about euclidian vectors here. Magnitude and direction are not important.
What is important is the way we plan to represent the world in our database.
The above gives us the how we want to represent the world and the how to store this information in a way that a machine can process.
Why do we need vectors?
tldr - vectors can allow machines to understand how things like, text, photos and video are related to one another
So far we’ve been a bit too technical and offering definitions that might make things a bit more clear but we haven’t talked at all about why we even care about vectors. What is wrong with whatever traditional relational databases already offer?
It all started with our need to represent rich text documents not just syntactically but also semantically.
The idea is that we can try to represent a document as vectors of identifiers. These vectors now define a document or vector space which happens to also be an algebraic model.
Because of that, we hope that we can use the mathematical tools of algebraic vector spaces to do interesting things like figuring out how similar two documents are!
This idea is not new. Do you know this guy?
By Tim Bray (talk) - I created this work entirely by myself., CC BY-SA 3.0,
In case you don’t, this is Doug Cutting and he’s the author of Apache Lucene that was open sourced in 1999. Lucene is probably the first and most well known library for indexing and searching text. Lucene implements the “vector space model” we talked about.
Vectors and vector spaces are a powerful way to represent information in a way that we can perform search and comparisons beyond what the standard scalar operators allow us to do.
But hopefully you are already wondering why although Lucene and the vector space model concept exists since the 90’s, we care about vector databases today. Also, is Lucene a vector database?
To understand why, we need to talk about a few more things first. But before we do that, let’s summarize.
Embeddings
Since 1999 and Lucene, it took us about another 13 years to do the next step in information retrieval.
Welcome to 2013 and to the work of Tomas Mikolov at Google, called Word2Vec.
Word2Vec is a technique that uses neural networks to learn word associations in from a large corpus of text. These neural networks are generating what is usually called in the NLP literature, word embeddings, which are representations of words.
the representation is a real-valued vector that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning.
I hope you see how often the term “vector” is being used.
The beauty of these algorithms is that after we have created these embeddings or vectors, we can use mathematical functions like the cosine similarity, to measure the semantic similarity of words.
It’s also important to note that these embeddings or representations are represented as real-valued vectors.
Enter Transformers
Today, Word2Vec is not the state of the art in generating embeddings anymore. Instead we are using Transformers that are deep learning models. Models like GPT are based on Transformers.
Regardless of the model used though, the output remains the same. Our information is represented as a real-valued vector and we can still use math to retrieve semantic information from our data.
let’s put everything together
In 2023 we have some amazing technologies that can take rich information as input, e.g. a novel, and turn it into a new representation which we can query using machines.
To do that, these technologies turn the information into real-valued vectors.
To work with this information we now need efficient systems to store and process these real-valued vectors and do that at scale.
That’s exactly what a vector database is.
Let’s see now what are the unique characteristics of a Vector Database and how one is built.
Written by Kostas Pardalis
← Back to blog