Skip to content

Releases: HelpingAI/llm-trainer

LLM Trainer v0.2.7

08 Jan 14:37

Choose a tag to compare

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[0.2.7] - 2026-01-08

Overview

This release introduces Mixed Precision (fp16/bf16) support, tqdm progress tracking, and a major reorganization of notebooks. The update enhances training efficiency, provides better visual feedback during training, and improves the overall structure of the project's educational resources.

Features

  • Mixed Precision Training

    • Added fp16 and bf16 boolean flags to TrainingConfig for explicit precision selection.
    • Implemented automatic precision selection logic with should_use_amp() and get_amp_dtype().
    • Enhanced Trainer with torch.amp.autocast and GradScaler support for both fp16 and bf16.
    • Support for hardware-specific precision (e.g., bf16 on NVIDIA Ampere+).
  • Progress Tracking & Logging

    • Replaced excessive console logging with tqdm progress bars for training and evaluation.
    • Real-time display of loss, learning rate, and step in the progress bar.
    • Reduced console clutter while maintaining detailed logging to TensorBoard and Weights & Biases.
  • Notebook Reorganization

    • Created a structured notebooks/ directory with subfolders: tokenizers/, training/, and generation/.
    • Renamed all notebooks to more descriptive, consistent names.
    • Added new comprehensive notebooks:
      • notebooks/tokenizers/all_tokenizers_demo.ipynb: Demonstrates all available tokenizer types.
      • notebooks/training/train_classification_model.ipynb: Guide for training text classification models.

Improvements

  • Tokenizer API

    • Standardized tokenizer training API: .train() is now the primary method for all data sources.
    • Deprecated .train_from_texts() and .train_from_dataset() in favor of the unified .train() method.
    • Updated all examples, scripts, and documentation to use the new unified API.
  • Tooling & Environment

    • Bumped requires-python to >=3.9 in pyproject.toml for better dependency compatibility.
    • Removed apex from optional dependencies to avoid build failures on Windows.
    • Added uv extra-build-dependencies for improved environment setup.

Technical Changes

  • Configuration System

    • Refactored TrainingConfig to handle fp16/bf16 flags and map them to internal use_amp logic.
    • Improved device-aware validation for mixed precision settings.
  • Trainer Implementation

    • Refactored _train_epoch and _evaluate to use tqdm for progress tracking.
    • Updated _training_step and _backward_step to support torch.amp mixed precision.
    • Unified dataloader setup in train_from_config.

[0.2.5] - 2025-09-05

Overview

This release focuses on performance optimizations and code cleanup. All Unsloth-related code has been removed to streamline the codebase and focus on core functionality. The trainer now fully integrates patching and kernel optimizations for enhanced memory efficiency and faster training.

Features

  • Enhanced Trainer Integration

    • Full integration of patching and kernel optimizations into the main trainer
    • Automatic application of memory-efficient techniques during initialization
    • Support for fused layers, gradient checkpointing, and efficient attention
    • Factory method for creating memory-efficient trainers
  • Kernel Optimizations

    • Fused linear layers for 10-30% performance improvement
    • Memory-efficient attention mechanisms (Flash Attention)
    • Gradient checkpointing support for low VRAM training
    • Optimizer state offloading for reduced GPU memory usage
  • Code Cleanup

    • Complete removal of Unsloth-related code and dependencies
    • Streamlined codebase with focus on core functionality
    • Updated documentation and examples

Improvements

  • Performance Enhancements

    • Automatic fused layer application when fuse_layers=True
    • Efficient attention setup for PyTorch 2.0+ compatibility
    • Memory optimization techniques integrated into training loop
    • Periodic cache emptying for sustained performance
  • Developer Experience

    • Comprehensive documentation for fused layers vs linear layers
    • Clear configuration options for performance optimizations
    • Better error handling and fallback mechanisms
    • Improved logging for optimization features

Technical Changes

  • Trainer Architecture

    • Integrated patching system into trainer initialization
    • Added kernel optimization methods to training workflow
    • Enhanced memory management during training steps
    • Support for hardware-specific optimizations
  • API Changes

    • New create_memory_efficient_trainer() factory method
    • Enhanced apply_fused_layers() method for layer optimization
    • Automatic patching application during trainer setup
    • Backward-compatible configuration options

Removed

  • Unsloth Integration
    • Removed all Unsloth-related optimizers and utilities
    • Cleaned up Unsloth-specific documentation and examples
    • Removed Unsloth dependencies from project metadata
    • Streamlined codebase by removing unused components

Documentation

  • New Documentation
    • Added comprehensive guide for fused layers vs linear layers
    • Performance benchmarks and usage recommendations
    • Configuration examples for optimization features
    • Troubleshooting guide for common issues

[0.2.4] - 2025-08-29

Overview

This release introduces full TRL (Transformer Reinforcement Learning) integration with familiar APIs for SFT, DPO, PPO, and Reward Modeling training. The update provides memory-efficient training techniques and complete compatibility with the HuggingFace ecosystem while maintaining backward compatibility with existing APIs.

This release also introduces patching system with kernel optimizations for fast and memory-efficient training.


Features

  • TRL-Style Training APIs

    • Added SFTTrainer, DPOTrainer, PPOTrainer, and RewardTrainer classes with familiar .train() methods.
    • Implemented TRL-style configuration classes: SFTConfig, DPOConfig, PPOConfig, and RewardConfig.
    • Full compatibility with HuggingFace model architectures and training workflows.
    • Affected files:
      • src/llm_trainer/config/sft_config.py
      • src/llm_trainer/config/dpo_config.py
      • src/llm_trainer/config/ppo_config.py
      • src/llm_trainer/config/reward_config.py
      • src/llm_trainer/training/enhanced_trainer.py
  • Memory Optimizations

    • Added memory-efficient operations for low VRAM training.
  • Kernel Optimizations for Fast Training

    • Added kernels module with fused operations for better performance.
    • Implemented memory-efficient operations for low VRAM training.
    • Added gradient checkpointing and cache management utilities.
    • Affected files:
      • src/llm_trainer/kernels/__init__.py
      • src/llm_trainer/kernels/fused_ops.py
      • src/llm_trainer/kernels/memory_efficient.py
  • Patching System for Transformers/TRL

    • Added patching module to enhance existing Transformers and TRL classes.
    • Implemented monkey-patching for memory-efficient optimizations.
    • Added methods to existing trainer classes.
    • Affected files:
      • src/llm_trainer/patching/__init__.py
      • src/llm_trainer/patching/patch_transformers.py
      • src/llm_trainer/patching/patch_trl.py
  • Enhanced Trainer Functionality

    • Extended EnhancedTrainer with TRL-style training methods.
    • Added HuggingFace-style APIs: .save_model(), .save_pretrained(), .from_pretrained().
    • Implemented parameter efficiency reporting with .print_trainable_parameters().
    • Affected files:
      • src/llm_trainer/training/enhanced_trainer.py
  • PEFT Integration

    • Full support for LoRA and other PEFT adapters.
    • Automatic PEFT adapter application during trainer initialization.
    • Integration with PEFT preparation functions.
    • Affected files:
      • src/llm_trainer/training/enhanced_trainer.py

Improvements

  • API Compatibility

    • Complete backward compatibility with existing Trainer and TrainingConfig.
    • Seamless integration with HuggingFace Transformers models and tokenizers.
    • Familiar TRL-style parameter names and configurations.
    • Support for all HuggingFace training arguments and configurations.
  • Memory Efficiency

    • Memory-efficient optimizers reduce training memory footprint.
    • Parameter-efficient training with LoRA adapters reduces trainable parameters by 99%+.
    • Gradient checkpointing support for large model training.
    • Low VRAM linear layers and attention mechanisms.
  • Performance Optimizations

    • Fused operations in optimizers for faster training.
    • Efficient parameter updates with reduced computational overhead.
    • Optimized data loading and preprocessing pipelines.
    • Kernel-level optimizations for common operations.
  • Documentation

    • Added comprehensive documentation for TRL integration.
    • Updated README with examples and usage instructions.
    • Created detailed API reference and best practices guide.

Technical Implementation

  • TRL-Style Configuration Classes
    • SFTConfig, DPOConfig, PPOConfig, and RewardConfig follow TRL conventions.
    • Support all familiar TRL par...
Read more