SFTizer

A supervised fine-tuning (SFT) dataset generator that converts conversation logs from multiple AI assistant platforms into standardized training datasets.

Overview

SFTizer extracts and transforms conversation data from Claude, Gemini, Cursor, and Cline into high-quality SFT datasets compatible with popular training frameworks. It handles tool calls, conversation branching, chunking, and privacy protection out of the box.

Features

Multi-source ingestion - Parse logs from Claude CLI, Gemini CLI, Cursor AI editor, and Cline VS Code extension
Multiple output formats - Export to OpenAI, LangChain, Hermes, or Llama3 formats
Intelligent chunking - Token-based (via HuggingFace tokenizers) or character-based splitting with configurable overlap
Tool call normalization - Map 30+ source tools to unified target tools with parameter translation
Conversation branching - Extract all branches from tree-structured conversations (Claude)
PII protection - Configurable regex-based stripping of sensitive information
Parallel processing - Multi-process execution for large-scale dataset generation
Interleaved content - Preserve or flatten text/tool_use ordering based on target format

Supported Sources

Source	Format	Location
Claude	JSONL	User-specified directories
Gemini CLI	JSON	`~/.gemini/tmp//chats/session-.json`
Cursor	SQLite	`~/.config/Cursor/User/globalStorage/state.vscdb`
Cline	JSON	VS Code globalStorage task directories

Output Formats

Format	Description
OpenAI	Standard Chat Completions API structure
LangChain	Native LangChain/LangGraph format with dict arguments
Hermes	NousResearch Hermes with XML tool tags
Llama3	Meta Llama 3.1+ compatible format

Installation

# Clone the repository
git clone https://github.com/thad0ctor/SFTizer.git
cd SFTizer

# Install dependencies
pip install -r requirements.txt

# For development/testing
pip install -r requirements-test.txt

Dependencies

Python 3.10+
PyYAML
tiktoken or transformers (for tokenization)

Quick Start

Configure input sources in config.yml:

io_settings:
  input_log_directories: "~/.claude/logs"

  gemini_enabled: true
  gemini_input_directories:
    - "~/.gemini/tmp"

  cursor_enabled: true
  cursor_config_dir: "~/.config/Cursor"

  cline_enabled: true
  cline_input_directories:
    - "~/.config/Code/User/globalStorage/saoudrizwan.claude-dev/tasks"

Run the generator:

python main.py

Select output format when prompted (OpenAI, LangChain, Hermes, or Llama3)
Output is written to the configured output_sft_file (default: sft_dataset.jsonl)

Configuration

Main Configuration (`config.yml`)

# Tokenizer settings
tokenizer_settings:
  use_tokenizer: true
  tokenizer_path: "meta-llama/Llama-3.1-8B"

# SFT generation settings
sft_settings:
  max_tokens: 4096
  reserved_tokens: 64
  overlap_tokens: 512
  oversized_message_strategy: "split"  # keep, truncate, split, exclude

# Continuation markers for split conversations
continuation_markers:
  enabled: true
  continuation_prefix: "[Continued from previous chunk {chunk_index}/{total_chunks}]"
  truncation_suffix: "[Continues in next chunk...]"

# Metadata to include in output
metadata:
  enabled: true
  include_source_file: true
  include_conversation_id: true
  include_branch_index: true
  include_chunk_index: true

# PII stripping
pii_stripping:
  enabled: true
  strip_filepaths: true
  filepath_replacement_prefix: "/home/user"
  patterns:
    - pattern: '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
      replacement_type: "static"
      replacement: "[EMAIL]"

# Processing
processing_settings:
  num_workers: 4

# I/O settings
io_settings:
  output_sft_file: "sft_dataset.jsonl"
  unmapped_tools_file: "unmapped_tools.yml"
  exclude_patterns:
    - "**/test_*"
    - "**/.archive/**"

Tool Mapping (`configs/shared/tool_mapping.yml`)

Map source tools to unified target tools:

tool_mappings:
  # Claude tools
  - source_tool_name: "Bash"
    enabled: true
    action: "map"
    target_tool_name: "shell"
    parameter_mapping:
      command: code

  - source_tool_name: "Read"
    enabled: true
    action: "map"
    target_tool_name: "file_reader"
    parameter_mapping:
      file_path: path
      offset: start_line
      limit: line_count

  # Disable internal tools
  - source_tool_name: "Task"
    enabled: false
    action: "ignore"

Format Configuration (`configs/{format}/format_mapping.yml`)

Each format defines its message structure:

# OpenAI example
roles:
  assistant: "assistant"
  user: "user"
  tool_result: "tool"

fields:
  assistant_tool_calls:
    enabled: true
    key_name: "tool_calls"
  tool_result_tool_call_id:
    enabled: true
    key_name: "tool_call_id"

tool_call_structure: "openai"

Architecture

Log Files (Claude/Gemini/Cursor/Cline)
    │
    ▼
┌─────────────────────────────────────┐
│  Parsers                            │
│  - conversation_parser.py (Claude)  │
│  - gemini_parser.py                 │
│  - cursor_parser.py                 │
│  - cline_parser.py                  │
└─────────────────────────────────────┘
    │
    ▼ Standardized Turn Format
┌─────────────────────────────────────┐
│  Branch Extraction                  │
│  (Tree → Linear paths)              │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Message Formatter                  │
│  - Tool mapping                     │
│  - Format conversion                │
│  - Interleaved/flattened content    │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Chunker                            │
│  - Token/character limits           │
│  - Overlap handling                 │
│  - Continuation markers             │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  PII Stripper                       │
│  - Pattern matching                 │
│  - Filepath redaction               │
└─────────────────────────────────────┘
    │
    ▼
JSONL Output (SFT Dataset)

Project Structure

SFTizer/
├── main.py                 # Entry point
├── discover_tools.py       # Tool discovery utility
├── config.yml              # Main configuration
├── configs/
│   ├── openai/
│   │   └── format_mapping.yml
│   ├── langchain/
│   │   └── format_mapping.yml
│   ├── hermes/
│   │   └── format_mapping.yml
│   ├── llama3/
│   │   └── format_mapping.yml
│   └── shared/
│       └── tool_mapping.yml
├── utils/
│   ├── config_loader.py
│   ├── config_validator.py
│   ├── conversation_parser.py
│   ├── gemini_parser.py
│   ├── cursor_parser.py
│   ├── cline_parser.py
│   ├── message_formatter.py
│   ├── chunker.py
│   ├── pii_stripper.py
│   └── tokenizer_loader.py
└── tests/
    ├── conftest.py
    ├── fixtures/
    ├── validators/
    └── test_*.py

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=utils --cov-report=html

# Run specific test file
pytest tests/test_chunker.py

# Run in parallel
pytest -n auto

# Skip slow tests
pytest -m "not slow"

The test suite includes 348+ tests covering:

Unit tests for all parsers and utilities
Integration tests for end-to-end workflows
Format validation for all output formats
Performance benchmarks

Utility Scripts

Discover Tools

Find all unique tools used in your logs:

python discover_tools.py

This helps populate tool_mapping.yml with tools found in your conversation logs.

Output Format

Each line in the output JSONL contains:

{
  "messages": [
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "...", "tool_calls": [...]},
    {"role": "tool", "tool_call_id": "...", "content": "..."}
  ],
  "source_file": "/path/to/log.jsonl",
  "conversation_id": "abc123",
  "branch_index": 0,
  "chunk_index": 0,
  "timestamp": "2025-01-15T10:30:00Z"
}

Oversized Message Strategies

When a single message exceeds the chunk limit:

Strategy	Behavior
`keep`	Include as-is (may exceed limits)
`truncate`	Cut to fit with truncation marker
`split`	Split into multiple chunks with markers
`exclude`	Skip the message entirely

Contributing

Fork the repository
Create a feature branch
Run tests: pytest
Submit a pull request

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
configs		configs
docs		docs
tests		tests
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
discover_tools.py		discover_tools.py
main.py		main.py
pytest.ini		pytest.ini
requirements-test.txt		requirements-test.txt
roadmap.md		roadmap.md
run_tests.sh		run_tests.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SFTizer

Overview

Features

Supported Sources

Output Formats

Installation

Dependencies

Quick Start

Configuration

Main Configuration (`config.yml`)

Tool Mapping (`configs/shared/tool_mapping.yml`)

Format Configuration (`configs/{format}/format_mapping.yml`)

Architecture

Project Structure

Testing

Utility Scripts

Discover Tools

Output Format

Oversized Message Strategies

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SFTizer

Overview

Features

Supported Sources

Output Formats

Installation

Dependencies

Quick Start

Configuration

Main Configuration (config.yml)

Tool Mapping (configs/shared/tool_mapping.yml)

Format Configuration (configs/{format}/format_mapping.yml)

Architecture

Project Structure

Testing

Utility Scripts

Discover Tools

Output Format

Oversized Message Strategies

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Main Configuration (`config.yml`)

Tool Mapping (`configs/shared/tool_mapping.yml`)

Format Configuration (`configs/{format}/format_mapping.yml`)

Packages