This project aims to forecast monthly U.S. stock returns using machine learning, leveraging financial characteristics observable at the time of prediction. The target variable is derived from the monthly CRSP dataset, while predictors come from quarterly Compustat fundamentals and JKP factor characteristics, both offering rich insights into firm behavior and asset pricing dynamics.
├── main.py # Main script to run the full pipeline
├── data_preprocessing.py # Raw-to-ML dataset construction
├── linear_models.py # Linear regression and shrinkage models (e.g., Ridge, Lasso)
├── mlp_model.py # Feedforward neural network (MLP)
├── xgboost_model.py # Gradient boosting model
├── requirements.txt # Required dependencies
├── data/
│ ├── Predictors/ # Merged predictors and processed X
│ │ ├── ccmxpf_linktable.csv
│ │ ├── CompFirmCharac.csv
│ │ ├── jkp_characteristic.plk
│ │ ├── merged.plk
│ │ └── X.pkl
│ └── Targets/
│ └── monthly_crsp.csv
│ └── y.pkl
├── plots/ # Output directory for plots
├── project_report.pdf # Full project report and results
Ensure you are using Python 3.8+ and install required packages:
pip install -r requirements.txtIt's recommended to use a virtual environment (
venvorconda).
Download the raw data and place raw data in the appropriate folders:
data/
├── Predictors/
│ ├── ccmxpf_linktable.csv
│ ├── CompFirmCharac.csv
│ ├── jkp_characteristic.plk
└── Targets/
└── monthly_crsp.csv
Processed datasets (merged.pkl, X.pkl, y.pkl) will be automatically generated and saved in the same directories.
To execute the full workflow from data preprocessing to model training and evaluation:
python main.pyThis will train models, evaluate them, and save results and plots in the
plots/directory.
- OLS, Ridge, and Lasso Regression (
linear_models.py) - Multi-Layer Perceptron (MLP) (
mlp_model.py) - XGBoost Regressor (
xgboost_model.py)
- Plots of predicted vs. actual returns
- IC Distribution, Rolling IC
- All outputs saved in the
plots/folder
Mahe Velay, Elias Bourgon, Adélaïde Robert, Théodore Decaux, Benjamin Beretz