Skip to content

AkCodes23/GPU-scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GPU Runner Repository

A production-ready repository providing complete, tested, and documented tooling for running ML training and inference on NVIDIA GPUs. Includes support for single-GPU, multi-GPU, multi-node distributed training, GPU task scheduling, and optional cuTile (CUDA Tile) kernel integration.

πŸš€ Features

  • Single-GPU Training: Full training loop with mixed precision, checkpointing, and reproducibility
  • Multi-GPU Training: DataParallel (prototyping) and DistributedDataParallel (production)
  • Multi-Node Training: Ready for torchrun / torch.distributed.run
  • GPU Task Scheduler: Pack multiple independent jobs across GPUs on a single machine
  • Production Deployment: Docker, Kubernetes, and Slurm configurations
  • Monitoring & Logging: GPU monitoring, log collection, and metrics tracking
  • cuTile Integration: Optional support for CUDA Tile Python kernels

πŸ“ Repository Structure

gpu-runner-repo/
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ LICENSE                      # MIT License
β”œβ”€β”€ .gitignore                   # Git ignore patterns
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ docker/
β”‚   β”œβ”€β”€ Dockerfile               # CUDA-ready container
β”‚   └── docker-compose.yml       # Local development setup
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ run_single_gpu.sh        # Single GPU launch script
β”‚   β”œβ”€β”€ run_dataparallel.sh      # DataParallel launch script
β”‚   β”œβ”€β”€ run_ddp.sh               # DDP/torchrun launch script
β”‚   β”œβ”€β”€ multi_task_scheduler.py  # GPU task packer
β”‚   β”œβ”€β”€ slurm_submit.sbatch      # Slurm job submission
β”‚   └── k8s_job.yaml             # Kubernetes Job manifest
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ train/
β”‚   β”‚   β”œβ”€β”€ train_single_gpu.py  # Single GPU training
β”‚   β”‚   β”œβ”€β”€ train_dataparallel.py# DataParallel training
β”‚   β”‚   β”œβ”€β”€ train_ddp.py         # DDP training
β”‚   β”‚   └── utils.py             # Shared utilities
β”‚   β”œβ”€β”€ inference/
β”‚   β”‚   └── inference.py         # Batch inference
β”‚   β”œβ”€β”€ cutile_examples/
β”‚   β”‚   β”œβ”€β”€ cutile_vector_add.py # cuTile vector add example
β”‚   β”‚   └── cutile_integration_example.py  # PyTorch integration
β”‚   └── examples/
β”‚       └── simple_dataset.py    # Sample dataset
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ default.yaml             # Default training config
β”‚   └── jobs.json                # Multi-task scheduler jobs
β”œβ”€β”€ ops/
β”‚   β”œβ”€β”€ monitor_gpu.sh           # GPU monitoring
β”‚   └── collect_logs.sh          # Log collection
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_single_gpu.py       # CPU-compatible tests
β”‚   └── test_ddp_local.py        # DDP smoke tests
└── .github/
    └── workflows/
        └── ci.yml               # GitHub Actions CI

⚑ Quick Start

Prerequisites

  • Python 3.8+
  • PyTorch 2.0+ (with CUDA support for GPU training)
  • NVIDIA GPU with CUDA drivers (optional for CPU fallback)
  • NVIDIA CUDA Toolkit 13.1+ (required only for cuTile)

Installation

# Clone the repository
git clone https://github.com/yourusername/gpu-runner-repo.git
cd gpu-runner-repo

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Optional: Install cuTile for CUDA Tile kernel support
pip install cuda-tile  # Requires CUDA Toolkit 13.1+

Running Training

Single GPU Training

# Run on GPU 0
bash scripts/run_single_gpu.sh 0

# Or directly with Python
CUDA_VISIBLE_DEVICES=0 python src/train/train_single_gpu.py \
    --config configs/default.yaml \
    --epochs 10

DataParallel (Multi-GPU, Simple)

# Use all available GPUs
bash scripts/run_dataparallel.sh

# Or specify GPUs
CUDA_VISIBLE_DEVICES=0,1 python src/train/train_dataparallel.py \
    --config configs/default.yaml

DistributedDataParallel (Production)

# Run on 2 GPUs on one node
bash scripts/run_ddp.sh 2

# Or use torchrun directly
torchrun --nproc_per_node=2 src/train/train_ddp.py \
    --config configs/default.yaml \
    --epochs 10

Multi-Node DDP

# On node 0 (master):
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 \
    --master_addr="192.168.1.1" --master_port=29500 \
    src/train/train_ddp.py --config configs/default.yaml

# On node 1:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 \
    --master_addr="192.168.1.1" --master_port=29500 \
    src/train/train_ddp.py --config configs/default.yaml

Multi-Task GPU Scheduler

Run multiple independent tasks packed across available GPUs:

# Edit configs/jobs.json with your tasks
python scripts/multi_task_scheduler.py --config configs/jobs.json

Example jobs.json:

{
  "jobs": [
    {"name": "job1", "command": "python train.py --exp 1", "gpus": 1},
    {"name": "job2", "command": "python train.py --exp 2", "gpus": 2}
  ]
}

🐳 Docker

Build and Run

# Build the image
docker build -t gpu-runner -f docker/Dockerfile .

# Run with GPU access
docker run --gpus all -v $(pwd):/workspace gpu-runner \
    python src/train/train_single_gpu.py --config configs/default.yaml

Docker Compose

# Start development environment
docker-compose -f docker/docker-compose.yml up -d

# Attach to container
docker-compose exec gpu-runner bash

☸️ Kubernetes

Deploy a training job to Kubernetes:

# Apply the job manifest
kubectl apply -f scripts/k8s_job.yaml

# Monitor the job
kubectl logs -f job/gpu-training-job

πŸ–₯️ Slurm

Submit a job to a Slurm cluster:

# Submit the job
sbatch scripts/slurm_submit.sbatch

# Check job status
squeue -u $USER

πŸ”§ cuTile (CUDA Tile) Integration

This repository includes optional support for cuTile (CUDA Tile Python), enabling you to write performant tile-based CUDA kernels in Python.

Requirements

Important

cuTile requires CUDA Toolkit 13.1+. Ensure your system or container has the appropriate CUDA Toolkit version installed.

Installation

# Install cuTile (optional dependency)
pip install cuda-tile

# Verify installation
python -c "import cuda.tile as ct; print('cuTile version:', ct.__version__)"

Check cuTile Availability

from src.train.utils import check_cutile_available

info = check_cutile_available()
print(f"cuTile available: {info['available']}")
print(f"CUDA Toolkit required: {info['min_cuda_version']}")

Example: Vector Addition Kernel

python src/cutile_examples/cutile_vector_add.py

PyTorch Integration

# Run integration example (falls back to PyTorch if cuTile unavailable)
python src/cutile_examples/cutile_integration_example.py

πŸ“Š Monitoring

GPU Monitoring

# Continuous GPU monitoring
bash ops/monitor_gpu.sh

# Or use nvidia-smi directly
watch -n 1 nvidia-smi

Log Collection

# Collect logs from a training run
bash ops/collect_logs.sh ./logs ./collected_logs

πŸ§ͺ Testing

# Run CPU tests (works without GPU)
pytest -q tests/

# Run specific test
pytest tests/test_single_gpu.py -v

# Run DDP test (requires 2+ GPUs)
pytest tests/test_ddp_local.py -v

βš™οΈ Configuration

Training Configuration (configs/default.yaml)

model:
  name: resnet18
  num_classes: 10

training:
  epochs: 10
  batch_size: 32
  learning_rate: 0.001
  weight_decay: 0.0001

data:
  num_workers: 4
  pin_memory: true

checkpoint:
  save_dir: ./checkpoints
  save_every: 5

Environment Variables

For production deployments, consider setting these environment variables:

# NCCL settings (for multi-GPU/multi-node)
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=eth0

# OpenMP settings
export OMP_NUM_THREADS=4

# MKL settings (for Intel CPUs)
export MKL_NUM_THREADS=4

# PyTorch settings
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export PYTHONUNBUFFERED=1

πŸ“ Best Practices

Reproducibility

  1. Set random seeds using utils.set_seed(seed)
  2. Use torch.backends.cudnn.deterministic = True for deterministic operations
  3. Document all hyperparameters in config files

Checkpointing

  1. Save model, optimizer, scheduler, and epoch state
  2. Only save on rank 0 in distributed training
  3. Implement checkpoint resume functionality

Memory Management

  1. Use gradient accumulation for large models
  2. Enable mixed precision training with torch.cuda.amp
  3. Clear CUDA cache periodically with torch.cuda.empty_cache()

Monitoring

  1. Log GPU memory usage during training
  2. Monitor GPU utilization to detect data loading bottlenecks
  3. Track training metrics with TensorBoard or Weights & Biases

πŸ“š References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published