A minimal MLOps-style batch pipeline that demonstrates reproducibility, observability, and deployment readiness for OHLCV trading-signal pipelines.
.
├── run.py # Main pipeline script
├── config.yaml # Job configuration (seed, window, version)
├── data.csv # 10,000-row OHLCV dataset
├── requirements.txt # Python dependencies
├── Dockerfile # Container definition
├── metrics.json # Sample output from a successful run
├── run.log # Sample log from a successful run
└── README.md # This file
- Python 3.9+
- pip
pip install -r requirements.txtpython run.py \
--input data.csv \
--config config.yaml \
--output metrics.json \
--log-file run.logAfter a successful run, metrics.json and run.log are written to the current directory and the final metrics JSON is also printed to stdout.
docker build -t mlops-task .docker run --rm mlops-taskThe container includes data.csv and config.yaml, runs the pipeline, writes metrics.json and run.log inside the container, and prints the final metrics JSON to stdout.
To retrieve output files from the container, mount a host directory:
docker run --rm -v "$(pwd)/output:/app/output" \
mlops-task \
python run.py --input data.csv --config config.yaml \
--output output/metrics.json --log-file output/run.logExit codes: 0 = success, non-zero = failure.
| Key | Type | Description |
|---|---|---|
| seed | int | NumPy random seed for determinism |
| window | int | Rolling mean window size (rows) |
| version | string | Pipeline version tag |
seed: 42
window: 5
version: "v1"- Load & validate config — parse YAML, assert required keys and types, set
numpy.random.seed. - Load & validate dataset — check file exists, CSV is parseable, not empty,
closecolumn present and numeric. - Rolling mean —
close.rolling(window=window, min_periods=window).mean(). The firstwindow-1rows produceNaNand are excluded from signal computation. - Signal —
signal = 1ifclose > rolling_mean, else0. Rows withNaNrolling mean are excluded. - Metrics —
rows_processed(valid signal rows),signal_rate(mean of signal),latency_ms.
{
"version": "v1",
"rows_processed": 9996,
"metric": "signal_rate",
"value": 0.4973,
"latency_ms": 47,
"seed": 42,
"status": "success"
}Note:
rows_processedis 9996 (not 10000) because the first 4 rows have no valid rolling mean withwindow=5and are excluded from signal computation. This is deterministic and reproducible.
All validation errors produce an error-format metrics.json and a non-zero exit code:
{
"version": "v1",
"status": "error",
"error_message": "Required column 'close' not found. Available columns: [...]"
}Handled error cases:
- Missing input file
- Invalid / malformed CSV
- Empty CSV
- Missing
closecolumn - Invalid config structure or missing required fields
Running the pipeline multiple times with the same config.yaml and data.csv always produces identical metrics.json values. The seed field in config controls numpy.random.seed ensuring deterministic output.