🎓 Volis Case Study: Higher Education Success & Institutional Classification

A Data Science and Machine Learning project focused on analyzing higher education metrics to support strategic decision-making.

📌 Project Overview

The Board of Directors of a higher education institution ("Volis") requested a comprehensive analysis of a dataset containing information from various institutions. The primary goal is to identify patterns and trends that influence academic success, financial sustainability, and institutional profiles.

This project leverages Exploratory Data Analysis (EDA), Feature Engineering, and Machine Learning algorithms to extract actionable insights and predict institutional types (Private vs. Public).

Author	Ricardo Daniel Teixeira Gonçalves
Date	March 2026

🎯 Key Objectives

Exploratory Data Analysis (EDA): Identify and resolve data quality issues, including missing values, logical inconsistencies, and extreme outliers.
Correlation & Trend Analysis: Discover significant relationships between institutional variables, with a special focus on graduation rates and application volumes.
Feature Engineering: Create derived metrics (e.g., Acceptance Rate, Enrollment Rate) to prevent extreme multicollinearity and enhance predictive power.
Predictive Modeling: Develop and evaluate a supervised Machine Learning model to classify institutions (Private vs. Public) to support strategic benchmarking.

🛠️ Methodologies & Tech Stack

Technologies

Category	Tools/Libraries
Language	Python 3.x
Data Processing & EDA	`pandas`, `numpy`, `matplotlib`, `seaborn`
Machine Learning	`scikit-learn` (Logistic Regression, Random Forest)

Data Pipeline

Data Cleaning & Integrity: IQR-based outlier detection, domain-constraint validation, and strict median imputation applied post train-test split to prevent data leakage.
Feature Transformation: Logarithmic transformation (log1p) for skewed variables and StandardScaler for normalization.
Modeling Strategies:
- Logistic Regression (Baseline)
- Logistic Regression with class_weight='balanced' (for imbalanced data)
- Random Forest Classifier (Final selected model)
Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, and Classification Reports.

📊 Key Findings & Results

1. Data Insights (EDA)

Insight	Description
Graduation Rate Drivers	Out-of-state tuition fees showed a strong positive correlation with graduation rates ($r \approx 0.54$) after data cleaning. Purely academic metrics (e.g., expenditure per student, faculty with PhDs) demonstrated a weaker direct impact.
Application Volume	The structural distinction between Public and Private institutions proved to be the strongest predictor of application volumes, overshadowing traditional academic prestige metrics.

2. Machine Learning Performance (Institutional Classification)

Task: Classify institutions as Private (1) or Public (0).
Challenge: Class imbalance (280 Public vs. 637 Private).
Test Set: 276 unseen observations.

Metric	Logistic Regression (Balanced)	Random Forest Classifier
Accuracy	0.8370	0.8400
Precision	0.9298	0.9200
Recall	0.8281	0.8400
F1-Score	0.8760	0.8800

Conclusion: The Random Forest Classifier was selected as the primary model. Based on the F1-Score and balanced Recall, the Random Forest ensemble effectively captured non-linear relationships, achieving a highly reliable distinction between institutional types without a heavy bias toward the Private majority class.

🚀 How to Run the Project

To reproduce this analysis locally:

1. Clone the repository

git clone https://github.com/rdtg94/volis-case-study.git
cd volis-case-study-fcup

2. Install dependencies

(It is highly recommended to use a virtual environment)

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install pandas numpy scikit-learn matplotlib seaborn jupyter

3. Execute the analysis

Run the notebook in a Jupyter environment:

jupyter notebook volis_case_study_Ricardo_Goncalves.ipynb

📂 Project Structure

volis-case-study/
├── volis_case_study_Ricardo_Goncalves.ipynb  # Main analysis notebook
├── volis_dataset.csv                         # Raw dataset
├── README.md                                 # This file

📚 References

The technical development of this project was grounded in standard industry literature:

Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Ed.). O'Reilly Media.
Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O'Reilly Media.
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
Scikit-Learn Developers. (2024). User Guide: 1.11. Ensemble methods (Random Forests). https://scikit-learn.org/stable/modules/ensemble.html#random-forests

⚠️ Important Notes

Data Leakage Prevention: All imputation and scaling operations were fitted exclusively on the training set and then applied to the test set.
Class Imbalance Handling: The class_weight='balanced' parameter was used in Logistic Regression to mitigate bias toward the majority class.
Reproducibility: Random states were fixed (random_state=42) across all stochastic operations to ensure consistent results.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
volis_case_study_Ricardo_Goncalves.ipynb		volis_case_study_Ricardo_Goncalves.ipynb
volis_dataset.csv		volis_dataset.csv
volis_project_banner.png		volis_project_banner.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎓 Volis Case Study: Higher Education Success & Institutional Classification

📌 Project Overview

🎯 Key Objectives

🛠️ Methodologies & Tech Stack

Technologies

Data Pipeline

📊 Key Findings & Results

1. Data Insights (EDA)

2. Machine Learning Performance (Institutional Classification)

🚀 How to Run the Project

1. Clone the repository

2. Install dependencies

3. Execute the analysis

📂 Project Structure

📚 References

⚠️ Important Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎓 Volis Case Study: Higher Education Success & Institutional Classification

📌 Project Overview

🎯 Key Objectives

🛠️ Methodologies & Tech Stack

Technologies

Data Pipeline

📊 Key Findings & Results

1. Data Insights (EDA)

2. Machine Learning Performance (Institutional Classification)

🚀 How to Run the Project

1. Clone the repository

2. Install dependencies

3. Execute the analysis

📂 Project Structure

📚 References

⚠️ Important Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages