What is vector search and why might you need it?

What is vector search and why might you need it?

Vector search and vector databases have been in the news more often recently. If you haven’t used it you might be wondering what it is and why you might care.

This post gives a high level overview of what vector search is and why you might use it. It covers use-cases but not implementation.

Vector search is increasingly common. Google appears to have switched over to vector search entirely. Machine learning approaches to search, recommendation and personalization increasingly use vector representations of things rather than explicit features.

Previous generation search engines represent documents primarily using the text that they contain. They perform various processing steps on that text to improve the performance of search against that text. Things like preprocessing, stemming, lemmatization, query expansion, synonym lists etc.

All of these things are attempt to solve the problems of synonymy and generating a good representation of the underlying meaning in a text. How do you match 2 pieces of text when the meaning is the same but the text doesn’t match exactly? For example, if I search for car, but a text contains automobile.

In addition to not being a good approach to tackling synonymy, representing a document as a list of the tokens it contains, (and possible a list of other tokens that have similar meaning) is rather shallow. It works well enough for some queries, but it doesn’t really capture the underlying meaning of a document. It’s pattern matching on a large scale.

Since deep learning has become popular we’ve started to represent things using embeddings, rather than explicit features.

Embeddings

An embedding is a learned set of vectors. I’ll use the terms embedding/vector/vectorset interchangeably in this post. But when I refer to vectors or vector sets in this post I’m referring to learned embedding vectors. 1

For text that means we now represent tokens as vectors of floats, rather than using the raw strings. The vectors themselves are learned from the cooccurrence of text in large corpora. A document is then represented by combining the vectors for each token to get a document vector.

One of the first commonly used approaches to learning these vectors was word2vec. Word2vec learns a vector for each token in its training data based on how often words tend to co-occur. Words that often are used in the same context end up in the same area of the resulting vector-space. Here’s a good explanation of how word2vec works.

This fastText tutorial shows how to train a word embedding on text from wikipedia 2.

What you end up with at the end of this process is a set of token vectors, where each token in the vocabulary is represented by a 100-dimensional vector of floats. These vectors represent points in a 100-dimensional space, and we’ve learned - based on words occurrence in the training data - where to place each token in this space such that words that occur in similar contexts tend to be placed in the same locale in that vector space.

Word2vec wasn’t just used to learn vectors for text tokens. A similar approach can be used to learn vectors for other kinds of sequential data. So people use it to learn vectors for paragraphs and documents, graphs, products, customers, searches etc.

More recently the advent of language models such as BERT and its successors has led to an explosion of using vector representations of text. These models produce richer vector representations that those produced by word2vec et al. Most importantly they produce contextual vectors. Each word doesn’t get a single vector, rather the model produces different vectors for a word depending on the context in which it was used. This gives a deeper, richer vector representation of text than we would get with previous approaches. When using a BERT model, you would pass in a token and its surrounding context and BERT will produce a 768-dimensional vector of floats that represents that token in that context 3.

Is vector search better?

Vector representations of documents are richer than what went before. They capture more of the semantic meaning of the underlying document. Searches using vectors built from language models are much closer to being a search based on the semantic meaning of the query rather than matching against the raw tokens of the query and documents. So for a lot of queries they perform much better than previous approaches. However there are still lots of queries that language models don’t do a good job on. Sometimes exact matching is important. Sometimes the user wants to restrict results to those matching some particular token. In these cases semantic search alone isn’t good enough. A good semantic search approach needs to also support falling back to exact match filtering. This doesn’t mean using previous search representation, but rather supplementing the vector search with exact match filtering. E.g. search using vectors, but limited to candidates where we have an exact match on some other field. 4

Searching through vector collections

So now companies using ML are representing everything with vectors - documents, users, sessions, products etc. Everything is a vector. These vectors are used for search, recommendations, personalization. These vector representations become building blocks that are reused as input into various ML systems.

One of the new problems that arises from working with these vectors is how do you effeciently search through that set of vectors?

Usually you have a vector (representing something, e.g. a product, or a query, or a user) and you want to find k-most-similar vectors. When you have a small set of vectors you can brute-force search through the entire collection to find the most similar vectors. But that gets slow quickly as the number of vectors increases.

As your vector set gets bigger, to maintain decent query performance you need something better. Commonly Approximate-Nearest-Neighbour (ANN) algorithms are used.

ANN algorithms

ANN algorithms trade off search time against accuracy. They generally work by building an index that groups similar vectors into buckets and/or searching and skipping these instead of searching through all the vectors. The more “approximate” you search is, the faster, and the less accurate, it is. This is fine for many applications.

The most commonly used ANN algorithm is HNSW. It’s implemented by almost all the vector search engines. Googles Vertex AI matching engine uses new approach ScaNN and is substantially faster than HNSW on large vector-sets.

There are several libraries that implements ANN for vector search. You can see some comparisons of various approaches at ANN benchmarks. Some commonly used ones are faiss, annoy, nmslib.

These allow you to build an index from a set of vectors and run fast nearest-neighbours search against it.

For many applications, using one of these libraries is enough. E.g. if you were building a simple recommendation system you might learn a set of vectors for each item, create a faiss index from those vectors, then load that index into memory in your application so that you can search it. I’ve used faiss with fastapi to make vector-sets available for querying.

This works well if you want to search across one smallish set of vectors. But it gets awkward as you start to grow your vectors sets or you start to deal with multiple different vector sets. You might want to start attaching metadata to your vectors (e.g. taxonomy information for product vectors, tags for document vectors). Instead of querying the entire collection, you might want to query a subset of vectors that have some particular attribute (filtered search). For example in an online store you might want to run search through products only in a certain category. You also want convenient methods for managing multiple sets of vectors (tables).

This is where vector search engines come in.

Vector search engines

Vector search engines do for vectors what traditional search engines do for text or what relational databases to for tabular data. They provide ways of indexing large sets of vectors in tables and doing fast querying and retrieval of those vectors. They use Approximate Nearest Neighbour (ANN) algorithms to provide fast retrieval. In addition they often provide additional features such adding additional attributes to vector items, and doing filtered search using those attributes (where you want to do vector search only across items with particular attributes).

You can find a list of vector search engines at awesome vector search. I’ve tried several. If you want a simple vector search engine that is easy to setup, supports filtered search and has has decent performance I’d recommend Weaviate. Each of the major cloud providers also provide vector search services. I’ve used Vertex AI matching engine. It has some limitations (updates are batch and not immediate, limited filter types for searching) but its very fast and can handle very large vector sets without any degradation in performance.

Do you actually need a vector search engine?

Since vector search has started to gain prominence there have been a plethora of dedicated vector search engines. These solve a very specific problem.

  • You have a large set of vectors representing your domain objects and you want to search through them. Or…
  • You have a large set of documents and you want to vectorize a query and then use that vector to search against the vectors representing a collection documents.

While faiss is widely used one of the main shortcoming is that it doesn’t support adding attributes to vectors. So it doesn’t support filtered search. One common use case is searching some subset of vectors that have some specific attribute. E.g. an online store might let you search for product within a certain part of the taxonomy, or only search through products that are currently in-stock etc. You can work around this to a degree if you only have a small number of attributes by building an index for each attribute value. Or by getting more results than you need and filtering the ANN results. But these approaches doesn’t scale well.

So if you need filtered search you’ll need to move beyond using a library like faiss. If your vector sets are not large and you don’t need filtered search then using a library like faiss may be enough.

The newer releases of the major search engines include decent support for vector fields.

  • Solr 9 has support for dense vectors that can be indexed and searched using hnsw.
  • Elastic 8 similarly added support for dense vectors and ANN search
  • Vespa has supported vector fields and HNSW for a while and has the most mature support of search options.

So if you are already using one of these you can now avail of the benefits of vector search without needing an additional dedicated vector search engine.

There is also db support for vector similarity queries in postgres using pgvector. If you are using vector search for something like similar items over over items in your datahbase, this might be a good option.

With improving support for vector search among these commonly used tools, the need for a dedicated vector search diminishes for many use-cases. Unless you know that you need a dedicated vector search engine, you probably don’t. There are some advantages to them though. E.g. Googles Vertex AI matching engine is substantially faster on querying large vector sets than an of the other options. Something like Weaviate is significantly easier to set up and get started with than Solr/Elastic/Vespa.

Whether a dedicated vector-search engine is appropriate depends a lot on the problem domain, the number of vectors you are dealing with and how well your existing tools support vector search. But you can achieve a lot without a dedicated vector search engine, either by using a library such as faiss or storing your vectors in your existing db or search engine.

Notes

  1. Previous approaches to text representation often did represent text as a vector but is was using term vectors and one-hot-encoding. E.g. a vector where each element in the vector represents whether a particular term from the vocabulary is present or not in a document. 

  2. fastText is similar to word2vec. It’s a library for learning text embeddings. It has the additional improvement that it can learn vectors for subwords and use those subword vectors to generate vectors for new words that weren’t seen the the training vocabulary 

  3. BERT doesn’t actually output an explicit token vector but there are multiple ways to extract a token vector E.g. by averaging or concatenating the final layers of the network. 

  4. This is often referred to as filtered search in the context of vector search engines and it is one of the things that distinguishes vector search engines such as Weaviate from ANN libraries such as faiss.