This project demonstrates a complete, end-to-end machine learning pipeline for the Olist e-commerce dataset. The primary goal is to predict customer satisfaction by classifying their potential review score (1-5 stars) based on initial order data. The entire lifecycle, from raw data ingestion to a containerized API, is managed using professional MLOps practices.
Docker Hub Image: shaikhilhaam/olist-review-api:latest
| Category | Technologies |
|---|---|
| Data Storage & Querying | PostgreSQL, SQL |
| Data Science & EDA | Python, Pandas, Jupyter Notebook, Matplotlib, Seaborn, Tableau |
| Feature Engineering | Scikit-learn, Sentence-Transformers (for NLP), Haversine |
| Modeling & MLOps | XGBoost, MLflow (Tracking & Model Registry), Optuna, SHAP |
| API & Deployment | FastAPI, Uvicorn |
| Containerization | Docker |
| Automation & CI/CD | Git, GitHub Actions |
This project follows a structured, multi-stage pipeline:
-
Data Engineering: Raw CSV files are ingested and structured into a PostgreSQL database. A comprehensive master table is generated using a single SQL query that joins 9 different tables, performing the heavy data lifting within the database.
-
Exploratory Data Analysis (EDA): An extensive analysis revealed key business insights, most notably that delivery performance (speed and accuracy) is the single biggest driver of customer satisfaction, and that the platform has an extremely low customer retention rate (~3%). This insight guided the pivot from an LTV model to a more impactful review score prediction model.
-
Feature Engineering: A sophisticated feature set was engineered, including:
- Logistical Features:
delivery_time_vs_estimated,customer_seller_distance. - Temporal Features: Cyclical encoding of month and day of the week.
- NLP Features: State-of-the-art text embeddings from review comments, compressed with PCA.
- Seller Features: Seller's average review score and total order count.
- Logistical Features:
-
Modeling & Experiment Tracking:
- An XGBoost multiclass classifier was trained to predict the 1-5 star review score.
- MLflow was used to manage the entire modeling lifecycle. Experiments were tracked, and the final, best-performing model was versioned and stored in the MLflow Model Registry.
- Model performance was deeply analyzed using a confusion matrix, classification reports, and SHAP for expert-level explainability.
-
Containerization & Deployment:
- The registered model is served via a FastAPI application, which exposes a
/predictendpoint. - The entire application is containerized using Docker, creating a lightweight, portable, and production-ready image.
- The registered model is served via a FastAPI application, which exposes a
-
Automation (CI/CD): A GitHub Actions workflow automates the entire process. On every push to
main, the workflow automatically builds the Docker image and pushes the:latesttag to Docker Hub, ensuring the application is always up-to-date.
- Clone the repository.
- Create a Python virtual environment and install dependencies:
pip install -r requirements.txt
- Set up a PostgreSQL database and populate it using the
ingest_data.pyscript. - Create a
.envfile with your database credentials.
- Generate the final modeling dataset:
python main.py
- Train the model and register it with MLflow:
python src/train.py
- Start the FastAPI server:
uvicorn api:app --reload
- Access the interactive API documentation at
http://127.0.0.1:8000/docs.
- Build the Docker image:
docker build -t olist-review-api . - Run the container:
docker run -p 8000:8000 olist-review-api