Skip to content

rdtg94/volis-case-study

Repository files navigation

Volis Project Banner

🎓 Volis Case Study: Higher Education Success & Institutional Classification

A Data Science and Machine Learning project focused on analyzing higher education metrics to support strategic decision-making.

Python Scikit-Learn Pandas


📌 Project Overview

The Board of Directors of a higher education institution ("Volis") requested a comprehensive analysis of a dataset containing information from various institutions. The primary goal is to identify patterns and trends that influence academic success, financial sustainability, and institutional profiles.

This project leverages Exploratory Data Analysis (EDA), Feature Engineering, and Machine Learning algorithms to extract actionable insights and predict institutional types (Private vs. Public).

Author Ricardo Daniel Teixeira Gonçalves
Date March 2026

🎯 Key Objectives

  1. Exploratory Data Analysis (EDA): Identify and resolve data quality issues, including missing values, logical inconsistencies, and extreme outliers.
  2. Correlation & Trend Analysis: Discover significant relationships between institutional variables, with a special focus on graduation rates and application volumes.
  3. Feature Engineering: Create derived metrics (e.g., Acceptance Rate, Enrollment Rate) to prevent extreme multicollinearity and enhance predictive power.
  4. Predictive Modeling: Develop and evaluate a supervised Machine Learning model to classify institutions (Private vs. Public) to support strategic benchmarking.

🛠️ Methodologies & Tech Stack

Technologies

Category Tools/Libraries
Language Python 3.x
Data Processing & EDA pandas, numpy, matplotlib, seaborn
Machine Learning scikit-learn (Logistic Regression, Random Forest)

Data Pipeline

  • Data Cleaning & Integrity: IQR-based outlier detection, domain-constraint validation, and strict median imputation applied post train-test split to prevent data leakage.
  • Feature Transformation: Logarithmic transformation (log1p) for skewed variables and StandardScaler for normalization.
  • Modeling Strategies:
    • Logistic Regression (Baseline)
    • Logistic Regression with class_weight='balanced' (for imbalanced data)
    • Random Forest Classifier (Final selected model)
  • Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, and Classification Reports.

📊 Key Findings & Results

1. Data Insights (EDA)

Insight Description
Graduation Rate Drivers Out-of-state tuition fees showed a strong positive correlation with graduation rates ($r \approx 0.54$) after data cleaning. Purely academic metrics (e.g., expenditure per student, faculty with PhDs) demonstrated a weaker direct impact.
Application Volume The structural distinction between Public and Private institutions proved to be the strongest predictor of application volumes, overshadowing traditional academic prestige metrics.

2. Machine Learning Performance (Institutional Classification)

Task: Classify institutions as Private (1) or Public (0).
Challenge: Class imbalance (280 Public vs. 637 Private).
Test Set: 276 unseen observations.

Metric Logistic Regression (Balanced) Random Forest Classifier
Accuracy 0.8370 0.8400
Precision 0.9298 0.9200
Recall 0.8281 0.8400
F1-Score 0.8760 0.8800

Conclusion: The Random Forest Classifier was selected as the primary model. Based on the F1-Score and balanced Recall, the Random Forest ensemble effectively captured non-linear relationships, achieving a highly reliable distinction between institutional types without a heavy bias toward the Private majority class.


🚀 How to Run the Project

To reproduce this analysis locally:

1. Clone the repository

git clone https://github.com/rdtg94/volis-case-study.git
cd volis-case-study-fcup

2. Install dependencies

(It is highly recommended to use a virtual environment)

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install pandas numpy scikit-learn matplotlib seaborn jupyter

3. Execute the analysis

Run the notebook in a Jupyter environment:

jupyter notebook volis_case_study_Ricardo_Goncalves.ipynb

📂 Project Structure

volis-case-study/
├── volis_case_study_Ricardo_Goncalves.ipynb  # Main analysis notebook
├── volis_dataset.csv                         # Raw dataset
├── README.md                                 # This file


📚 References

The technical development of this project was grounded in standard industry literature:

  1. Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Ed.). O'Reilly Media.
  2. Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O'Reilly Media.
  3. Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
  4. Scikit-Learn Developers. (2024). User Guide: 1.11. Ensemble methods (Random Forests). https://scikit-learn.org/stable/modules/ensemble.html#random-forests

⚠️ Important Notes

  • Data Leakage Prevention: All imputation and scaling operations were fitted exclusively on the training set and then applied to the test set.
  • Class Imbalance Handling: The class_weight='balanced' parameter was used in Logistic Regression to mitigate bias toward the majority class.
  • Reproducibility: Random states were fixed (random_state=42) across all stochastic operations to ensure consistent results.

About

Predicting higher education graduation rates using Machine Learning in Python. Includes advanced Exploratory Data Analysis (EDA), Feature Engineering, and predictive ambiguity resolution.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors