On various news portals and e-commerce platforms, you may have seen recommendations for articles and products related to the main article or product.
Recommending similar products, articles, or documents involves complex algorithms, but Elasticsearch empowers us to utilize recommendation algorithms effortlessly.
Mainly, recommending similar products, articles, or documents is accomplished through content similarity or by considering the user’s query history.
I will provide an example of Amazon product recommendations. In the image below, “Get similar item Fast” is related to content similarity, while the second one, “Customers who viewed this item also viewed” is related to the user’s query history.
Setup Elasticsearch locally
Before diving deep into finding related document, let’s set up the Elasticsearch project on your machine. I won’t go into detail on how to set up Elasticsearch locally; instead, I will clone the project from GitHub, which I have already created.
You can also clone it from my GitHub public repo. dockerize-elasticsearch
Or you can just use the below
docker-compose.yml file to run Elasticsearch and kibana locally
Once your Elasticsearch and Kibana project is up you can browse Kibana via
Kibana is an open source analytics and visualization platform designed to work with Elasticsearch.
Similar Document Search With Elasticsearch
In this post, I will focus on content similarity with an example in Elasticsearch.
In Elasticsearch, content similarity can be found in two ways:
- k-nearest neighbor (kNN) search.
- More-like-this query.
k-nearest neighbor (kNN) search
A k-nearest neighbor (kNN) search finds the k nearest vectors to a query vector, as measured by a similarity metric.
Common use cases for kNN include:
Relevance ranking based on natural language processing (NLP) algorithms Product recommendations and recommendation engines Similarity search for images or videos
The More Like This(MLT) Query allows you to find the similar documents from an input.
It works from a new query built from the relevant terms present in the input.
In this post, I will be focusing on More-like-this query. I will write new blog post for k-nearest neighbor (kNN) search
You can find the full Elasticsearch documentation at elastic.co for input parameters, term selection parameters, and query formation parameters. I will not delve into the details of these aspects.
In this post, I will demonstrate recommend similar reviews for Apple products.
For this demo, I will use
like Document Input Parameters and
min_doc_freq Term Selection Parameters.
min_term_freq: This sets the minimum term frequency, below which terms will be ignored from the input document.
In our case,
min_term_freq is set to
2, indicating that the main document must have a term occurring two or more times.
max_query_terms: This defines the maximum number of query terms that will be included in the generated query. It imposes a limit on the number of terms in the query. In our case, max_query_terms is set to 12, which means that if a documents contains more than 12 terms, it will not be considered related document and ignored.
min_doc_freq: This specifies the minimum document frequency that a term must have to be considered when generating a query.
Document frequency refers to the number of documents in which a term appears.
Terms with a low document frequency are considered less common and specific,
whereas terms with a high document frequency are more common and general.
In our case,
min_doc_freq is set to 5,
indicating that a term should be present in at least in 5 documents.
If a term’s document frequency is less than 5, it will not yield any results.
Create Index with config map
You can use the create index API to add a new index to an Elasticsearch cluster. When creating an index, you can specify the following:
- Settings for the index
- Mappings for fields in the index
- Index aliases
The create index API allows for providing a mapping definition.
For our demo purpose I am creating
apple-review-index index with
review text field.
review I will be storing review of each user for Apple’s product
Once index is successfully created, you will be able to see 200 response code with below response
Result of create index is attached blow.
Store date in Field
We have already created the index with
review field. Now, in
apple-review-index Elasticsearch index, we have
review text field.
apple-review-index I will be storing below 7 reviews.
Let’s store all the reviews one by one in the Elasticsearch index
Once record is successfully created in the index, you will be able to see 200 response code with below response.
Create index by using Kibana UI to the Elasticsearch screenshot is attached below.
Retrieve all the data from index
We have already inserted all 7 reviews in the
Let’s retrieve all of them to confirm whether all the reviews are in the Elasticsearch index.
You will see the response below. Where you will be able to see all the reviews that we have created before.
The search result in the index screenshot is attached below.
Apply more_like_this Query
Certainly, let’s proceed with finding similar reviews in
apple-review-index by applying a “more_like_this” query.
To do this, you’ll need to provide the actual review content that you want to use as a source for finding similar reviews.
Let’s create a query for Elasticsearch using the “more_like_this” query. Here are the details:
- Actual review content: “Love Apple AirPods, the sound and quality are amazing as always with Apple”
- min_term_freq : 2 (indicating that a term in the actual review content must occur two or more times)
- max_query_terms : 12 (indicating that a term in the actual review content must not occur more than 12 times)
- max_doc_freq: 5 (indicating that a term should appear in a maximum of 5 reviews)
In our actual review content, the term “Apple” appears twice.
Here is the similar review search query
The query should only return the 5 reviews, while the two reviews below should be ignored. These reviews do not have the term “Apple” in their content.
Let’s see the response
In the above search response, we have observed only five hits, which is as expected. :)
The search result for similar reviews from the index screenshot is attached below.
In conclusion, this post has provided an overview of performing similar document searches with Elasticsearch, focusing on content similarity using the “more_like_this” query. Elasticsearch offers powerful features for finding related documents based on the content of the documents. We have explored the use of various parameters such as min_term_freq, max_query_terms, and min_doc_freq to fine-tune our similarity search.
By creating an index, storing data in fields, and applying the “more_like_this” query, we were able to effectively find similar reviews in our Elasticsearch index. Elasticsearch is a versatile tool that can be used for a wide range of applications, including recommendation systems, content similarity analysis, and more.
This post serves as a starting point for those looking to implement content-based recommendation systems and leverage Elasticsearch’s capabilities for similar document searches. Further exploration of Elasticsearch’s capabilities and parameter tuning can lead to more refined and accurate results in real-world applications.