How to Implement Cosine Similarity in Python

DataStax
4 min readNov 30, 2023

--

By Phil Miesle

Image generated with DALL-E 2

Cosine similarity has several real-world applications, and by using embedding vectors, we can compare real-world meanings in a programmatic manner. Python is one of the most popular languages for data science, and it offers various libraries to calculate cosine similarity with ease. In this article, we’ll discuss how you can implement cosine similarity in Python using the help of Scikit-Learn and NumPy libraries.

What is cosine similarity?

Cosine similarity is a measure of similarity between two non-zero vectors in an n-dimensional space. It is used in various applications, such as text analysis and recommendation systems, to determine how similar two vectors are in terms of their direction in the vector space.

Cosine similarity formula

The cosine similarity between two vectors A and B is calculated using the following formula:

Cosine Similarity (A, B) = (A · B) / (||A|| * ||B||)

In this formula, A · B represents the dot product of vectors A and B. This is calculated by multiplying the corresponding components of the two vectors and summing up the results. ||A|| represents the Euclidean norm (magnitude) of vector A, which is the square root of the sum of the squares of its components. It’s calculated as ||A|| = √(A₁² + A₂² + … + Aₙ²). ||B|| represents the Euclidean norm (magnitude) of vector B, calculated in the same way as ||A||.

How to calculate cosine similarity

To calculate cosine similarity, you first complete the calculation for the dot product of the two vectors. Then, divide it by the product of their magnitudes. The resulting value will be in the range of -1 to 1, where:

  • If the cosine similarity is 1, it means the vectors have the same direction and are perfectly similar.
  • If the cosine similarity is 0, it means the vectors are perpendicular to each other and have no similarity.
  • If the cosine similarity is -1, it means the vectors have opposite directions and are perfectly dissimilar.

In text analysis, cosine similarity is used to measure the similarity between document vectors, where each document is represented as a vector in a high-dimensional space, with each dimension corresponding to a term or word in the corpus. By calculating the cosine similarity between document vectors, you can determine how similar or dissimilar two documents are to each other.

Libraries for cosine similarity calculation

  • NumPy: Great for numerical operations, and it’s optimized for speed.
  • scikit-learn: Offers various machine learning algorithms and includes a method for cosine similarity in its metrics package.

The following are some examples to show how cosine similarity can be calculated using Python. We’ll use our two now-familiar book review vectors [5,3,4] and [4,2,4].

Straight Python

While we previously calculated this by hand, of course, a computer can do it! Here is how you can compute cosine similarity using Python with no additional libraries:

A = [5, 3, 4]
B = [4, 2, 4]

# Calculate dot product
dot_product = sum(a*b for a, b in zip(A, B))

# Calculate the magnitude of each vector
magnitude_A = sum(a*a for a in A)**0.5
magnitude_B = sum(b*b for b in B)**0.5

# Compute cosine similarity
cosine_similarity = dot_product / (magnitude_A * magnitude_B)
print(f"Cosine Similarity using standard Python: {cosine_similarity}")

NumPy

Embedding vectors will typically have many dimensions — hundreds, thousands, even millions—or more! With NumPy, you can calculate cosine similarity using array operations that are highly optimized.

import numpy as np

A = np.array([5, 3, 4])
B = np.array([4, 2, 4])

dot_product = np.dot(A, B)
magnitude_A = np.linalg.norm(A)
magnitude_B = np.linalg.norm(B)

cosine_similarity = dot_product / (magnitude_A * magnitude_B)
print(f"Cosine Similarity using NumPy: {cosine_similarity}")

Scikit-Learn

Scikit-learn’s cosine_similarity function makes it even easier to calculate highly optimized cosine similarity operations:

from sklearn.metrics.pairwise import cosine_similarity

A = np.array([[5, 3, 4]])
B = np.array([[4, 2, 4]])

cosine_similarity_result = cosine_similarity(A, B)
print(f"Cosine Similarity using scikit-learn: {cosine_similarity_result[0][0]}")

Tips for optimizing cosine similarity calculations in Python

If you’re going to use Python to directly compute cosine similarity, there are some things to consider:

  • Use optimized libraries like NumPy or scikit-learn: These libraries are optimized for performance and are generally faster than vanilla Python.
  • Use Numba: Numba is an open source JIT compiler for Python and NumPy code, built specifically to optimize scientific computing functions.
  • Use GPUs: If you have access to a GPU, use Python libraries such as TensorFlow that have been optimized for use on a GPU.
  • Parallelize computations: If you have the hardware capabilities, consider parallelizing your computations to speed them up.

Search large quantities of vectors with vector search on DataStax Astra DB

If you need to search large quantities of vectors, you might find it more efficient and scalable to use a vector database such as DataStax Astra DB’s vector search capability. Vector search on Astra DB offers a powerful platform to help you execute vector searches with built-in cosine similarity calculations, so you can get more insights from your data.

To start your journey with Astra DB, register for a free account here.

--

--

DataStax
DataStax

Written by DataStax

DataStax provides the real-time vector data tools that generative AI apps need, with seamless integration with developers' stacks of choice.

No responses yet