Releases: HelpingAI/llm-trainer
LLM Trainer v0.2.7
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[0.2.7] - 2026-01-08
Overview
This release introduces Mixed Precision (fp16/bf16) support, tqdm progress tracking, and a major reorganization of notebooks. The update enhances training efficiency, provides better visual feedback during training, and improves the overall structure of the project's educational resources.
Features
-
Mixed Precision Training
- Added
fp16andbf16boolean flags toTrainingConfigfor explicit precision selection. - Implemented automatic precision selection logic with
should_use_amp()andget_amp_dtype(). - Enhanced
Trainerwithtorch.amp.autocastandGradScalersupport for bothfp16andbf16. - Support for hardware-specific precision (e.g.,
bf16on NVIDIA Ampere+).
- Added
-
Progress Tracking & Logging
- Replaced excessive console logging with
tqdmprogress bars for training and evaluation. - Real-time display of
loss,learning rate, andstepin the progress bar. - Reduced console clutter while maintaining detailed logging to TensorBoard and Weights & Biases.
- Replaced excessive console logging with
-
Notebook Reorganization
- Created a structured
notebooks/directory with subfolders:tokenizers/,training/, andgeneration/. - Renamed all notebooks to more descriptive, consistent names.
- Added new comprehensive notebooks:
notebooks/tokenizers/all_tokenizers_demo.ipynb: Demonstrates all available tokenizer types.notebooks/training/train_classification_model.ipynb: Guide for training text classification models.
- Created a structured
Improvements
-
Tokenizer API
- Standardized tokenizer training API:
.train()is now the primary method for all data sources. - Deprecated
.train_from_texts()and.train_from_dataset()in favor of the unified.train()method. - Updated all examples, scripts, and documentation to use the new unified API.
- Standardized tokenizer training API:
-
Tooling & Environment
- Bumped
requires-pythonto>=3.9inpyproject.tomlfor better dependency compatibility. - Removed
apexfrom optional dependencies to avoid build failures on Windows. - Added
uvextra-build-dependencies for improved environment setup.
- Bumped
Technical Changes
-
Configuration System
- Refactored
TrainingConfigto handlefp16/bf16flags and map them to internaluse_amplogic. - Improved device-aware validation for mixed precision settings.
- Refactored
-
Trainer Implementation
- Refactored
_train_epochand_evaluateto usetqdmfor progress tracking. - Updated
_training_stepand_backward_stepto supporttorch.ampmixed precision. - Unified dataloader setup in
train_from_config.
- Refactored
[0.2.5] - 2025-09-05
Overview
This release focuses on performance optimizations and code cleanup. All Unsloth-related code has been removed to streamline the codebase and focus on core functionality. The trainer now fully integrates patching and kernel optimizations for enhanced memory efficiency and faster training.
Features
-
Enhanced Trainer Integration
- Full integration of patching and kernel optimizations into the main trainer
- Automatic application of memory-efficient techniques during initialization
- Support for fused layers, gradient checkpointing, and efficient attention
- Factory method for creating memory-efficient trainers
-
Kernel Optimizations
- Fused linear layers for 10-30% performance improvement
- Memory-efficient attention mechanisms (Flash Attention)
- Gradient checkpointing support for low VRAM training
- Optimizer state offloading for reduced GPU memory usage
-
Code Cleanup
- Complete removal of Unsloth-related code and dependencies
- Streamlined codebase with focus on core functionality
- Updated documentation and examples
Improvements
-
Performance Enhancements
- Automatic fused layer application when
fuse_layers=True - Efficient attention setup for PyTorch 2.0+ compatibility
- Memory optimization techniques integrated into training loop
- Periodic cache emptying for sustained performance
- Automatic fused layer application when
-
Developer Experience
- Comprehensive documentation for fused layers vs linear layers
- Clear configuration options for performance optimizations
- Better error handling and fallback mechanisms
- Improved logging for optimization features
Technical Changes
-
Trainer Architecture
- Integrated patching system into trainer initialization
- Added kernel optimization methods to training workflow
- Enhanced memory management during training steps
- Support for hardware-specific optimizations
-
API Changes
- New
create_memory_efficient_trainer()factory method - Enhanced
apply_fused_layers()method for layer optimization - Automatic patching application during trainer setup
- Backward-compatible configuration options
- New
Removed
- Unsloth Integration
- Removed all Unsloth-related optimizers and utilities
- Cleaned up Unsloth-specific documentation and examples
- Removed Unsloth dependencies from project metadata
- Streamlined codebase by removing unused components
Documentation
- New Documentation
- Added comprehensive guide for fused layers vs linear layers
- Performance benchmarks and usage recommendations
- Configuration examples for optimization features
- Troubleshooting guide for common issues
[0.2.4] - 2025-08-29
Overview
This release introduces full TRL (Transformer Reinforcement Learning) integration with familiar APIs for SFT, DPO, PPO, and Reward Modeling training. The update provides memory-efficient training techniques and complete compatibility with the HuggingFace ecosystem while maintaining backward compatibility with existing APIs.
This release also introduces patching system with kernel optimizations for fast and memory-efficient training.
Features
-
TRL-Style Training APIs
- Added
SFTTrainer,DPOTrainer,PPOTrainer, andRewardTrainerclasses with familiar.train()methods. - Implemented TRL-style configuration classes:
SFTConfig,DPOConfig,PPOConfig, andRewardConfig. - Full compatibility with HuggingFace model architectures and training workflows.
- Affected files:
src/llm_trainer/config/sft_config.pysrc/llm_trainer/config/dpo_config.pysrc/llm_trainer/config/ppo_config.pysrc/llm_trainer/config/reward_config.pysrc/llm_trainer/training/enhanced_trainer.py
- Added
-
Memory Optimizations
- Added memory-efficient operations for low VRAM training.
-
Kernel Optimizations for Fast Training
- Added
kernelsmodule with fused operations for better performance. - Implemented memory-efficient operations for low VRAM training.
- Added gradient checkpointing and cache management utilities.
- Affected files:
src/llm_trainer/kernels/__init__.pysrc/llm_trainer/kernels/fused_ops.pysrc/llm_trainer/kernels/memory_efficient.py
- Added
-
Patching System for Transformers/TRL
- Added
patchingmodule to enhance existing Transformers and TRL classes. - Implemented monkey-patching for memory-efficient optimizations.
- Added methods to existing trainer classes.
- Affected files:
src/llm_trainer/patching/__init__.pysrc/llm_trainer/patching/patch_transformers.pysrc/llm_trainer/patching/patch_trl.py
- Added
-
Enhanced Trainer Functionality
- Extended
EnhancedTrainerwith TRL-style training methods. - Added HuggingFace-style APIs:
.save_model(),.save_pretrained(),.from_pretrained(). - Implemented parameter efficiency reporting with
.print_trainable_parameters(). - Affected files:
src/llm_trainer/training/enhanced_trainer.py
- Extended
-
PEFT Integration
- Full support for LoRA and other PEFT adapters.
- Automatic PEFT adapter application during trainer initialization.
- Integration with PEFT preparation functions.
- Affected files:
src/llm_trainer/training/enhanced_trainer.py
Improvements
-
API Compatibility
- Complete backward compatibility with existing
TrainerandTrainingConfig. - Seamless integration with HuggingFace Transformers models and tokenizers.
- Familiar TRL-style parameter names and configurations.
- Support for all HuggingFace training arguments and configurations.
- Complete backward compatibility with existing
-
Memory Efficiency
- Memory-efficient optimizers reduce training memory footprint.
- Parameter-efficient training with LoRA adapters reduces trainable parameters by 99%+.
- Gradient checkpointing support for large model training.
- Low VRAM linear layers and attention mechanisms.
-
Performance Optimizations
- Fused operations in optimizers for faster training.
- Efficient parameter updates with reduced computational overhead.
- Optimized data loading and preprocessing pipelines.
- Kernel-level optimizations for common operations.
-
Documentation
- Added comprehensive documentation for TRL integration.
- Updated README with examples and usage instructions.
- Created detailed API reference and best practices guide.
Technical Implementation
- TRL-Style Configuration Classes
SFTConfig,DPOConfig,PPOConfig, andRewardConfigfollow TRL conventions.- Support all familiar TRL par...