A Data Science and Machine Learning project focused on analyzing higher education metrics to support strategic decision-making.
The Board of Directors of a higher education institution ("Volis") requested a comprehensive analysis of a dataset containing information from various institutions. The primary goal is to identify patterns and trends that influence academic success, financial sustainability, and institutional profiles.
This project leverages Exploratory Data Analysis (EDA), Feature Engineering, and Machine Learning algorithms to extract actionable insights and predict institutional types (Private vs. Public).
| Author | Ricardo Daniel Teixeira Gonçalves |
|---|---|
| Date | March 2026 |
- Exploratory Data Analysis (EDA): Identify and resolve data quality issues, including missing values, logical inconsistencies, and extreme outliers.
- Correlation & Trend Analysis: Discover significant relationships between institutional variables, with a special focus on graduation rates and application volumes.
- Feature Engineering: Create derived metrics (e.g., Acceptance Rate, Enrollment Rate) to prevent extreme multicollinearity and enhance predictive power.
- Predictive Modeling: Develop and evaluate a supervised Machine Learning model to classify institutions (Private vs. Public) to support strategic benchmarking.
| Category | Tools/Libraries |
|---|---|
| Language | Python 3.x |
| Data Processing & EDA | pandas, numpy, matplotlib, seaborn |
| Machine Learning | scikit-learn (Logistic Regression, Random Forest) |
- Data Cleaning & Integrity: IQR-based outlier detection, domain-constraint validation, and strict median imputation applied post train-test split to prevent data leakage.
- Feature Transformation: Logarithmic transformation (
log1p) for skewed variables andStandardScalerfor normalization. - Modeling Strategies:
- Logistic Regression (Baseline)
- Logistic Regression with
class_weight='balanced'(for imbalanced data) - Random Forest Classifier (Final selected model)
- Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, and Classification Reports.
| Insight | Description |
|---|---|
| Graduation Rate Drivers | Out-of-state tuition fees showed a strong positive correlation with graduation rates ( |
| Application Volume | The structural distinction between Public and Private institutions proved to be the strongest predictor of application volumes, overshadowing traditional academic prestige metrics. |
Task: Classify institutions as Private (1) or Public (0).
Challenge: Class imbalance (280 Public vs. 637 Private).
Test Set: 276 unseen observations.
| Metric | Logistic Regression (Balanced) | Random Forest Classifier |
|---|---|---|
| Accuracy | 0.8370 | 0.8400 |
| Precision | 0.9298 | 0.9200 |
| Recall | 0.8281 | 0.8400 |
| F1-Score | 0.8760 | 0.8800 |
Conclusion: The Random Forest Classifier was selected as the primary model. Based on the F1-Score and balanced Recall, the Random Forest ensemble effectively captured non-linear relationships, achieving a highly reliable distinction between institutional types without a heavy bias toward the Private majority class.
To reproduce this analysis locally:
git clone https://github.com/rdtg94/volis-case-study.git
cd volis-case-study-fcup(It is highly recommended to use a virtual environment)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install pandas numpy scikit-learn matplotlib seaborn jupyterRun the notebook in a Jupyter environment:
jupyter notebook volis_case_study_Ricardo_Goncalves.ipynbvolis-case-study/
├── volis_case_study_Ricardo_Goncalves.ipynb # Main analysis notebook
├── volis_dataset.csv # Raw dataset
├── README.md # This file
The technical development of this project was grounded in standard industry literature:
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Ed.). O'Reilly Media.
- Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O'Reilly Media.
- Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
- Scikit-Learn Developers. (2024). User Guide: 1.11. Ensemble methods (Random Forests). https://scikit-learn.org/stable/modules/ensemble.html#random-forests
- Data Leakage Prevention: All imputation and scaling operations were fitted exclusively on the training set and then applied to the test set.
- Class Imbalance Handling: The
class_weight='balanced'parameter was used in Logistic Regression to mitigate bias toward the majority class. - Reproducibility: Random states were fixed (
random_state=42) across all stochastic operations to ensure consistent results.
