A supervised fine-tuning (SFT) dataset generator that converts conversation logs from multiple AI assistant platforms into standardized training datasets.
SFTizer extracts and transforms conversation data from Claude, Gemini, Cursor, and Cline into high-quality SFT datasets compatible with popular training frameworks. It handles tool calls, conversation branching, chunking, and privacy protection out of the box.
- Multi-source ingestion - Parse logs from Claude CLI, Gemini CLI, Cursor AI editor, and Cline VS Code extension
- Multiple output formats - Export to OpenAI, LangChain, Hermes, or Llama3 formats
- Intelligent chunking - Token-based (via HuggingFace tokenizers) or character-based splitting with configurable overlap
- Tool call normalization - Map 30+ source tools to unified target tools with parameter translation
- Conversation branching - Extract all branches from tree-structured conversations (Claude)
- PII protection - Configurable regex-based stripping of sensitive information
- Parallel processing - Multi-process execution for large-scale dataset generation
- Interleaved content - Preserve or flatten text/tool_use ordering based on target format
| Source | Format | Location |
|---|---|---|
| Claude | JSONL | User-specified directories |
| Gemini CLI | JSON | ~/.gemini/tmp/*/chats/session-*.json |
| Cursor | SQLite | ~/.config/Cursor/User/globalStorage/state.vscdb |
| Cline | JSON | VS Code globalStorage task directories |
| Format | Description |
|---|---|
| OpenAI | Standard Chat Completions API structure |
| LangChain | Native LangChain/LangGraph format with dict arguments |
| Hermes | NousResearch Hermes with XML tool tags |
| Llama3 | Meta Llama 3.1+ compatible format |
# Clone the repository
git clone https://github.com/thad0ctor/SFTizer.git
cd SFTizer
# Install dependencies
pip install -r requirements.txt
# For development/testing
pip install -r requirements-test.txt- Python 3.10+
- PyYAML
- tiktoken or transformers (for tokenization)
- Configure input sources in
config.yml:
io_settings:
input_log_directories: "~/.claude/logs"
gemini_enabled: true
gemini_input_directories:
- "~/.gemini/tmp"
cursor_enabled: true
cursor_config_dir: "~/.config/Cursor"
cline_enabled: true
cline_input_directories:
- "~/.config/Code/User/globalStorage/saoudrizwan.claude-dev/tasks"- Run the generator:
python main.py-
Select output format when prompted (OpenAI, LangChain, Hermes, or Llama3)
-
Output is written to the configured
output_sft_file(default:sft_dataset.jsonl)
# Tokenizer settings
tokenizer_settings:
use_tokenizer: true
tokenizer_path: "meta-llama/Llama-3.1-8B"
# SFT generation settings
sft_settings:
max_tokens: 4096
reserved_tokens: 64
overlap_tokens: 512
oversized_message_strategy: "split" # keep, truncate, split, exclude
# Continuation markers for split conversations
continuation_markers:
enabled: true
continuation_prefix: "[Continued from previous chunk {chunk_index}/{total_chunks}]"
truncation_suffix: "[Continues in next chunk...]"
# Metadata to include in output
metadata:
enabled: true
include_source_file: true
include_conversation_id: true
include_branch_index: true
include_chunk_index: true
# PII stripping
pii_stripping:
enabled: true
strip_filepaths: true
filepath_replacement_prefix: "/home/user"
patterns:
- pattern: '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
replacement_type: "static"
replacement: "[EMAIL]"
# Processing
processing_settings:
num_workers: 4
# I/O settings
io_settings:
output_sft_file: "sft_dataset.jsonl"
unmapped_tools_file: "unmapped_tools.yml"
exclude_patterns:
- "**/test_*"
- "**/.archive/**"Map source tools to unified target tools:
tool_mappings:
# Claude tools
- source_tool_name: "Bash"
enabled: true
action: "map"
target_tool_name: "shell"
parameter_mapping:
command: code
- source_tool_name: "Read"
enabled: true
action: "map"
target_tool_name: "file_reader"
parameter_mapping:
file_path: path
offset: start_line
limit: line_count
# Disable internal tools
- source_tool_name: "Task"
enabled: false
action: "ignore"Each format defines its message structure:
# OpenAI example
roles:
assistant: "assistant"
user: "user"
tool_result: "tool"
fields:
assistant_tool_calls:
enabled: true
key_name: "tool_calls"
tool_result_tool_call_id:
enabled: true
key_name: "tool_call_id"
tool_call_structure: "openai"Log Files (Claude/Gemini/Cursor/Cline)
│
▼
┌─────────────────────────────────────┐
│ Parsers │
│ - conversation_parser.py (Claude) │
│ - gemini_parser.py │
│ - cursor_parser.py │
│ - cline_parser.py │
└─────────────────────────────────────┘
│
▼ Standardized Turn Format
┌─────────────────────────────────────┐
│ Branch Extraction │
│ (Tree → Linear paths) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Message Formatter │
│ - Tool mapping │
│ - Format conversion │
│ - Interleaved/flattened content │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Chunker │
│ - Token/character limits │
│ - Overlap handling │
│ - Continuation markers │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ PII Stripper │
│ - Pattern matching │
│ - Filepath redaction │
└─────────────────────────────────────┘
│
▼
JSONL Output (SFT Dataset)
SFTizer/
├── main.py # Entry point
├── discover_tools.py # Tool discovery utility
├── config.yml # Main configuration
├── configs/
│ ├── openai/
│ │ └── format_mapping.yml
│ ├── langchain/
│ │ └── format_mapping.yml
│ ├── hermes/
│ │ └── format_mapping.yml
│ ├── llama3/
│ │ └── format_mapping.yml
│ └── shared/
│ └── tool_mapping.yml
├── utils/
│ ├── config_loader.py
│ ├── config_validator.py
│ ├── conversation_parser.py
│ ├── gemini_parser.py
│ ├── cursor_parser.py
│ ├── cline_parser.py
│ ├── message_formatter.py
│ ├── chunker.py
│ ├── pii_stripper.py
│ └── tokenizer_loader.py
└── tests/
├── conftest.py
├── fixtures/
├── validators/
└── test_*.py
# Run all tests
pytest
# Run with coverage
pytest --cov=utils --cov-report=html
# Run specific test file
pytest tests/test_chunker.py
# Run in parallel
pytest -n auto
# Skip slow tests
pytest -m "not slow"The test suite includes 348+ tests covering:
- Unit tests for all parsers and utilities
- Integration tests for end-to-end workflows
- Format validation for all output formats
- Performance benchmarks
Find all unique tools used in your logs:
python discover_tools.pyThis helps populate tool_mapping.yml with tools found in your conversation logs.
Each line in the output JSONL contains:
{
"messages": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "...", "tool_calls": [...]},
{"role": "tool", "tool_call_id": "...", "content": "..."}
],
"source_file": "/path/to/log.jsonl",
"conversation_id": "abc123",
"branch_index": 0,
"chunk_index": 0,
"timestamp": "2025-01-15T10:30:00Z"
}When a single message exceeds the chunk limit:
| Strategy | Behavior |
|---|---|
keep |
Include as-is (may exceed limits) |
truncate |
Cut to fit with truncation marker |
split |
Split into multiple chunks with markers |
exclude |
Skip the message entirely |
- Fork the repository
- Create a feature branch
- Run tests:
pytest - Submit a pull request
MIT License