Skip to content

only-komal/ml-task

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLOps Batch Job — Rolling Mean Signal Generator

A minimal MLOps-style batch pipeline that demonstrates reproducibility, observability, and deployment readiness for OHLCV trading-signal pipelines.


Project Structure

.
├── run.py            # Main pipeline script
├── config.yaml       # Job configuration (seed, window, version)
├── data.csv          # 10,000-row OHLCV dataset
├── requirements.txt  # Python dependencies
├── Dockerfile        # Container definition
├── metrics.json      # Sample output from a successful run
├── run.log           # Sample log from a successful run
└── README.md         # This file

Local Run

Prerequisites

  • Python 3.9+
  • pip

Install dependencies

pip install -r requirements.txt

Run the pipeline

python run.py \
  --input    data.csv \
  --config   config.yaml \
  --output   metrics.json \
  --log-file run.log

After a successful run, metrics.json and run.log are written to the current directory and the final metrics JSON is also printed to stdout.


Docker Build & Run

Build

docker build -t mlops-task .

Run

docker run --rm mlops-task

The container includes data.csv and config.yaml, runs the pipeline, writes metrics.json and run.log inside the container, and prints the final metrics JSON to stdout.

To retrieve output files from the container, mount a host directory:

docker run --rm -v "$(pwd)/output:/app/output" \
  mlops-task \
  python run.py --input data.csv --config config.yaml \
                --output output/metrics.json --log-file output/run.log

Exit codes: 0 = success, non-zero = failure.


Configuration (config.yaml)

Key Type Description
seed int NumPy random seed for determinism
window int Rolling mean window size (rows)
version string Pipeline version tag
seed: 42
window: 5
version: "v1"

Pipeline Logic

  1. Load & validate config — parse YAML, assert required keys and types, set numpy.random.seed.
  2. Load & validate dataset — check file exists, CSV is parseable, not empty, close column present and numeric.
  3. Rolling meanclose.rolling(window=window, min_periods=window).mean(). The first window-1 rows produce NaN and are excluded from signal computation.
  4. Signalsignal = 1 if close > rolling_mean, else 0. Rows with NaN rolling mean are excluded.
  5. Metricsrows_processed (valid signal rows), signal_rate (mean of signal), latency_ms.

Example metrics.json

{
  "version": "v1",
  "rows_processed": 9996,
  "metric": "signal_rate",
  "value": 0.4973,
  "latency_ms": 47,
  "seed": 42,
  "status": "success"
}

Note: rows_processed is 9996 (not 10000) because the first 4 rows have no valid rolling mean with window=5 and are excluded from signal computation. This is deterministic and reproducible.


Error Handling

All validation errors produce an error-format metrics.json and a non-zero exit code:

{
  "version": "v1",
  "status": "error",
  "error_message": "Required column 'close' not found. Available columns: [...]"
}

Handled error cases:

  • Missing input file
  • Invalid / malformed CSV
  • Empty CSV
  • Missing close column
  • Invalid config structure or missing required fields

Reproducibility

Running the pipeline multiple times with the same config.yaml and data.csv always produces identical metrics.json values. The seed field in config controls numpy.random.seed ensuring deterministic output.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors