To build web application for product recommendations to customers based on their reviews at the online grocery store. The web application is a prototype to mimic a real time application where the recommendations are rendered with product images for a given input string as a review.
Check it out live here
The dataset was obtained from Amazon review dataset released in 2014, provided by UCSD. The dataset contains 287,209 products with 5,074,160 reviews and ratings by 1,57,386 unique users
- Python - Scikit-learn, Pipeline
- Flask
- Docker
- PowerShell
- Heroku
We use the k-means algorithm to cluster all the products based on the reviews. The features would form the unsupervised clusters based on TF-IDF scores of the text.
How do we do that ?
Each product's reviews are collected and concatenated as a single string.Thus, each product has the feature set of tf-idf scores for the concatenated string of reviews. Further the tf-idf scores as a feature set is used to find the euclidean distance between selected points in space, thus allowing us to implement the k-means algorithm.
Based on the number of categories in the grocery store, we get to choose the number of centroids and bucket the products with a label. We would train the model, over iterations, allowing the clusters to move farther apart in space and save the model, i.e saving the centroid points in the space. Further, we pickle dump the dictionary of cluster labels and its corresponding products.
The conversion of a string into a tf-idf score and then find its nearest centroid point (cluster) would done using a pipeline function and saved into a joblib file.
What is TF-IDF score ?
Given a document(concatenated string of a product) in a corpus(across the reviews of all products), It tells how rarely a word occurs accross the corpus and how frequently it occurs in a that particular document.
Example for intution
Consider comparing reviews of chocolates. Let's assume there are three variants in chocolates available in the market.
Review for Variant 1 : This is the best choclate in the world.
Review for Variant 2 : I liked this choclate.
Given that similarity of two sentences here is based on Euclidean distance, the reviews would have closer distance due the presence of the word " Chocolate".
However, there would a be lot of noice and misallocations, but it's possibility is very less as the reviews for grociries would involve some amount of context to express the thoughts. Also we concatenate all the reviews for the product, which reduce the noise by considering the tf-idf scores for each word.
The flow of data given a input string is explained in this section. The input string is passed on to the flask application to access pipeline which is saved on joblib file. The pipeline would return labels which is used by the lookup file to return the ASIN IDs. These ASIN IDs are used to generate product URLS which is used to post the recommendations.
Once the model is trained with the dataset using the algorithm, we save the model pipeline for deploying flask application on local server or cloud.
Given a the flask app requires a set environment and further perform predictions through the saved model pipline, we consider building an image using docker. The docker container can be hosted on local system or on cloud as per requirements.
We build a docker image first, which could be containerized on local system and test the application. Building image makes it easy to flexibly build containers on any server.
However, you could also push the application directly in to the containers already facilitated by the cloud environments. We host the web application on Heroku container directly without building any image on local system.
Both the methods are illustrated below,
Docker Image
-
A docker folder is created on your local system with the requirements file. Check the folder here
-
Once the folder is in place, use powershell command line for building the image. On CLI we navigate to the folder we just created.
-
Building the image.
docker image build -t "recommsys".
-
Running the application with docker image
docker image build -t "recommsys".
-
Verify the image up and running without any errors and further validate for web applicationtio
-
Validating the application
Heroku container (Cloud deployment)
-
A docker folder is created on your local system with the requirements file. Check the folder here
-
Once the folder is in place, use command prompt as the Heroku CLI, installed on the system. We use heroku container and push the Docker folder and create the image on heroku to host the application.
-
Check if heroku is installed by following the below command, else check here
heroku
-
Now login into heroku container
heroku container: login
Create dummy application
heroku create recommsys
-
Now push to web directly from the docker folder directly
heroku container:push web --app recommsys
Now release the app to web,
heroku container:release web --app recommsys
-
Now open in web page, command line shortcut is
heroku open --app recommsys
Here you could see, for a given input, you get presented with relevant recommendations.

- https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
- Data Science in Production: Building Scalable Model Pipelines with Python – Ben G. Weber
- https://help.heroku.com/4RNZSHL2/
- https://aws.amazon.com/
- https://towardsdatascience.com/deploy-machine-learning-pipeline-on-cloud-using-docker-container-bec64458dc01
- https://medium.com/analytics-vidhya/deploy-your-machine-learning-model-on-docker-ee2b931e133c
- https://medium.com/analytics-vidhya/deploy-machinelearning-model-with-flask-and-heroku-2721823bb653




