Exploring Vector Databases for Semantic Search

by Arpit Kumar

26 Apr, 2023

8 minute read

Semantic search algorithm takes into account the meaning of words and phrases used in a search query, as well as the context in which they are used

# LLM # search # vector

Table of Content

What is Semantic Search?

Common Algorithms used for similarity search

A practical example

How are Vector Databases used in Semantic Search?

Common use cases for vector databases

Popular Vector Databases

What if you have to search a text based on meaning or similarity of a phrase ? Our normal inverted index based search query won’t help. That where semantic search comes into picture.

What is Semantic Search?

Semantic search is a type of search algorithm that takes into account the meaning of words and phrases used in a search query, as well as the context in which they are used. This is achieved through the use of machine learning algorithms that are able to identify relationships between different pieces of content and derive a semantic understanding of the underlying data.

For example, consider the query “best laptop for gaming”. A traditional search engine would look for web pages that contain these exact words, without considering the meaning behind the words. A semantic search engine, on the other hand, would look for pages that contain information about laptops that are specifically designed for gaming, regardless of whether or not they contain the exact phrase “best laptop for gaming”.

Common Algorithms used for similarity search

Vector databases are becoming increasingly popular for similarity searches because they allow for efficient retrieval and analysis of high-dimensional data. To perform similarity searches in vector databases, a number of algorithms can be used, including the following:

Cosine similarity: Cosine similarity is a commonly used algorithm for vector databases. It calculates the cosine of the angle between two vectors, which represents the similarity between them. Cosine similarity is efficient for high-dimensional data and is widely used in text-based similarity search.

Euclidean distance: Euclidean distance is another commonly used algorithm for similarity search in vector databases. It measures the distance between two vectors in a multi-dimensional space, which is inversely proportional to their similarity. Euclidean distance is computationally efficient but may not be as effective as cosine similarity for high-dimensional data.

Jaccard similarity: Jaccard similarity is a measure of the similarity between two sets of data. It is used in cases where the data is binary, such as in text analysis where the presence or absence of words is used to create a binary vector. Jaccard similarity can be used in combination with other algorithms such as cosine similarity to improve accuracy.

Locality-sensitive hashing (LSH): LSH is a technique that involves hashing similar vectors into the same hash bucket, which allows for efficient similarity search in large databases. LSH can be used in conjunction with cosine similarity to improve the speed of similarity searches.

Product quantization: Product quantization is a technique that involves dividing a vector into multiple subvectors and quantising each subvector separately. This technique reduces the dimensionality of the data and can improve the efficiency of similarity search algorithms.

Approximate nearest neighbor (ANN) search: ANN search is a technique that involves finding the nearest neighbor to a query vector in a vector database, but with some approximation to reduce the computational complexity. ANN search can be used in conjunction with other algorithms such as LSH or product quantisation to further improve the efficiency of similarity searches.

These algorithms are commonly used in vector databases for similarity search, and their effectiveness may depend on the specific characteristics of the data being analysed.

A practical example

For instance, let’s consider two sentences: “The cat sat on the mat” and “The dog sat on the rug.”

Here, we can use bag-of-words approach, where each element in the vector corresponds to the count of a specific word in the sentence.

Bag of words: [The, cat, sat, on, the, mat, rug]

We can represent these sentences as vectors by counting the frequency of each word, such as:

The cat sat on the mat - Sentence 1: [1, 1, 1, 1, 1, 0, 0]

The dog sat on the rug. - Sentence 2: [1, 1, 0, 1, 0, 1, 1]

To calculate the cosine similarity between these two sentences, we first take the dot product of the vectors:

dot_product = (1 1) + (1 1) + (1 0) + (1 1) + (1 0) + (0 1) + (0 * 1) = 3

Next, we calculate the magnitude of each vector:

magnitude₁ = √(1² + 1² + 1² + 1² + 1² + 0² + 0²) = √5

magnitude₂ = √(1² + 1² + 0² + 1² + 0² + 1² + 1²) = √5

Finally, we calculate the cosine similarity as the dot product divided by the product of the magnitudes:

cosine_similarity = dot_product / (magnitude₁ magnitude₂) = 3 / (√5 √5) = 0.6

Therefore, we can conclude that these two sentences have a similarity score of 0.6 based on their vector representation.

Cosine similarity between these two sentences: “The cat sat on the mat” and “The cat sat on the rug.” - 0.8

How are Vector Databases used in Semantic Search?

Vector databases are used in semantic search to store and retrieve high-dimensional data, such as text and images. In a vector database, each piece of data is represented as a vector in a high-dimensional space. These vectors can then be used to compare the similarity between different pieces of data.

For example, in a text-based search engine, each document is represented as a vector in a high-dimensional space, where each dimension represents a unique word in the document. When a user enters a search query, the query is also represented as a vector in the same high-dimensional space. The similarity between the query vector and the document vectors can then be calculated, and the most similar documents can be returned as search results.

Common use cases for vector databases

Recommendation systems: Vector databases can be used to store and query user profiles and product embeddings, enabling fast and accurate product recommendations based on user preferences and behaviour.
Image and video search: Vector databases can be used to store and search for image and video embeddings, allowing for fast and accurate search results.
Natural language processing: Vector databases can be used to store and search for word embeddings, enabling efficient querying of large text datasets.
Fraud detection: Vector databases can be used to store and search for embeddings of fraudulent behaviour patterns, allowing for efficient identification of fraudulent transactions or behaviour.
Anomaly detection: Vector databases can be used to store and search for embeddings of normal behaviour patterns, allowing for efficient identification of anomalies or outliers.
Biometric identification: Vector databases can be used to store and search for embeddings of biometric data, such as facial recognition or fingerprint data.
Voice recognition: Vector databases can be used to store and search for embeddings of voice data, allowing for efficient voice recognition and transcription.
Product categorisation: Vector databases can be used to store and search for embeddings of product features, allowing for efficient categorisation and organisation of product data.
Personalised advertising: Vector databases can be used to store and query user profiles and ad embeddings, enabling personalised ad targeting based on user preferences and behaviour.

These are just a few examples of the many possible use cases for vector databases. As machine learning and artificial intelligence continue to advance, it’s likely that vector databases will become even more important for a wide range of applications.

Popular Vector Databases

There are several popular vector databases available today, each with its own unique features and capabilities. Let’s take a closer look at some of the most popular ones:

Elasticsearch: Elasticsearch is a distributed search and analytics engine that can be used for text, numeric, and geospatial data. It also includes support for vector similarity search using its KNN plugin. Elasticsearch can be used to build scalable search and recommendation systems.
Faiss: Faiss is a library for efficient similarity search and clustering of dense vectors. It provides several algorithms for nearest neighbor search, including brute force search, k-means clustering, and hierarchical navigable small world graphs. Faiss is widely used for applications such as image and audio search.
Milvus: Milvus is an open-source vector database that supports efficient storage and retrieval of large-scale vector data. It includes support for approximate nearest neighbor search algorithms such as HNSW and IVFADC. Milvus is designed to be highly scalable and can be used for a wide range of applications such as e-commerce, recommendation systems, and image and video search.
Weaviate: Weaviate is a cloud-native, real-time vector search engine that is designed to handle large-scale vector data. It includes support for natural language queries and enables users to build complex search queries using its GraphQL API. Weaviate is ideal for applications such as chatbots, virtual assistants, and recommendation systems.
Pinecone: Pinecone is a managed vector search service that enables users to build real-time applications that require vector similarity search. It includes support for indexing and querying high-dimensional vector data and includes several built-in features such as filtering, live index updates, and horizontal scaling.
Qdrant: Qdrant is a vector similarity search engine that includes support for extended filtering and aggregation features. It includes several algorithms for approximate nearest neighbor search, including HNSW and Annoy. Qdrant is designed to be highly scalable and can be used for a wide range of applications such as e-commerce, recommendation systems, and image and video search.
Vespa: Vespa is an open-source, scalable search and recommendation engine that includes support for vector similarity search. It includes several algorithms for nearest neighbor search, including HNSW and ANNOY. Vespa is designed to handle large-scale data and is ideal for applications such as e-commerce, recommendation systems, and content personalization.
Vald: Vald is a highly scalable distributed vector search engine that includes support for a wide range of similarity search algorithms such as HNSW, IVFADC, and PQ. It is designed to handle large-scale vector data and can be used for a wide range of applications such as image and video search, recommendation systems, and natural language processing.
ScaNN: ScaNN is a library for efficient vector similarity search at scale. It includes support for a wide range of similarity search algorithms such as HNSW and IVFADC. ScaNN is designed to be highly scalable and can be used for a wide range of applications such as e-commerce, recommendation systems, and content personalization.
pgvector: pgvector is an open-source vector similarity search extension for PostgreSQL. It includes support for a wide range of similarity search algorithms such as HNSW and PQ. pgvector is designed to be highly scalable and can be used for a wide range of applications such as image and video search, recommendation systems

Conclusion

Overall, vector databases are a powerful technology that can provide significant benefits for organisations working with high-dimensional data. As machine learning continues to advance, vector databases are likely to become increasingly important for a wide range of applications.