Skip to content

thad0ctor/SFTizer

Repository files navigation

SFTizer

A supervised fine-tuning (SFT) dataset generator that converts conversation logs from multiple AI assistant platforms into standardized training datasets.

Overview

SFTizer extracts and transforms conversation data from Claude, Gemini, Cursor, and Cline into high-quality SFT datasets compatible with popular training frameworks. It handles tool calls, conversation branching, chunking, and privacy protection out of the box.

Features

  • Multi-source ingestion - Parse logs from Claude CLI, Gemini CLI, Cursor AI editor, and Cline VS Code extension
  • Multiple output formats - Export to OpenAI, LangChain, Hermes, or Llama3 formats
  • Intelligent chunking - Token-based (via HuggingFace tokenizers) or character-based splitting with configurable overlap
  • Tool call normalization - Map 30+ source tools to unified target tools with parameter translation
  • Conversation branching - Extract all branches from tree-structured conversations (Claude)
  • PII protection - Configurable regex-based stripping of sensitive information
  • Parallel processing - Multi-process execution for large-scale dataset generation
  • Interleaved content - Preserve or flatten text/tool_use ordering based on target format

Supported Sources

Source Format Location
Claude JSONL User-specified directories
Gemini CLI JSON ~/.gemini/tmp/*/chats/session-*.json
Cursor SQLite ~/.config/Cursor/User/globalStorage/state.vscdb
Cline JSON VS Code globalStorage task directories

Output Formats

Format Description
OpenAI Standard Chat Completions API structure
LangChain Native LangChain/LangGraph format with dict arguments
Hermes NousResearch Hermes with XML tool tags
Llama3 Meta Llama 3.1+ compatible format

Installation

# Clone the repository
git clone https://github.com/thad0ctor/SFTizer.git
cd SFTizer

# Install dependencies
pip install -r requirements.txt

# For development/testing
pip install -r requirements-test.txt

Dependencies

  • Python 3.10+
  • PyYAML
  • tiktoken or transformers (for tokenization)

Quick Start

  1. Configure input sources in config.yml:
io_settings:
  input_log_directories: "~/.claude/logs"

  gemini_enabled: true
  gemini_input_directories:
    - "~/.gemini/tmp"

  cursor_enabled: true
  cursor_config_dir: "~/.config/Cursor"

  cline_enabled: true
  cline_input_directories:
    - "~/.config/Code/User/globalStorage/saoudrizwan.claude-dev/tasks"
  1. Run the generator:
python main.py
  1. Select output format when prompted (OpenAI, LangChain, Hermes, or Llama3)

  2. Output is written to the configured output_sft_file (default: sft_dataset.jsonl)

Configuration

Main Configuration (config.yml)

# Tokenizer settings
tokenizer_settings:
  use_tokenizer: true
  tokenizer_path: "meta-llama/Llama-3.1-8B"

# SFT generation settings
sft_settings:
  max_tokens: 4096
  reserved_tokens: 64
  overlap_tokens: 512
  oversized_message_strategy: "split"  # keep, truncate, split, exclude

# Continuation markers for split conversations
continuation_markers:
  enabled: true
  continuation_prefix: "[Continued from previous chunk {chunk_index}/{total_chunks}]"
  truncation_suffix: "[Continues in next chunk...]"

# Metadata to include in output
metadata:
  enabled: true
  include_source_file: true
  include_conversation_id: true
  include_branch_index: true
  include_chunk_index: true

# PII stripping
pii_stripping:
  enabled: true
  strip_filepaths: true
  filepath_replacement_prefix: "/home/user"
  patterns:
    - pattern: '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
      replacement_type: "static"
      replacement: "[EMAIL]"

# Processing
processing_settings:
  num_workers: 4

# I/O settings
io_settings:
  output_sft_file: "sft_dataset.jsonl"
  unmapped_tools_file: "unmapped_tools.yml"
  exclude_patterns:
    - "**/test_*"
    - "**/.archive/**"

Tool Mapping (configs/shared/tool_mapping.yml)

Map source tools to unified target tools:

tool_mappings:
  # Claude tools
  - source_tool_name: "Bash"
    enabled: true
    action: "map"
    target_tool_name: "shell"
    parameter_mapping:
      command: code

  - source_tool_name: "Read"
    enabled: true
    action: "map"
    target_tool_name: "file_reader"
    parameter_mapping:
      file_path: path
      offset: start_line
      limit: line_count

  # Disable internal tools
  - source_tool_name: "Task"
    enabled: false
    action: "ignore"

Format Configuration (configs/{format}/format_mapping.yml)

Each format defines its message structure:

# OpenAI example
roles:
  assistant: "assistant"
  user: "user"
  tool_result: "tool"

fields:
  assistant_tool_calls:
    enabled: true
    key_name: "tool_calls"
  tool_result_tool_call_id:
    enabled: true
    key_name: "tool_call_id"

tool_call_structure: "openai"

Architecture

Log Files (Claude/Gemini/Cursor/Cline)
    │
    ▼
┌─────────────────────────────────────┐
│  Parsers                            │
│  - conversation_parser.py (Claude)  │
│  - gemini_parser.py                 │
│  - cursor_parser.py                 │
│  - cline_parser.py                  │
└─────────────────────────────────────┘
    │
    ▼ Standardized Turn Format
┌─────────────────────────────────────┐
│  Branch Extraction                  │
│  (Tree → Linear paths)              │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Message Formatter                  │
│  - Tool mapping                     │
│  - Format conversion                │
│  - Interleaved/flattened content    │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Chunker                            │
│  - Token/character limits           │
│  - Overlap handling                 │
│  - Continuation markers             │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  PII Stripper                       │
│  - Pattern matching                 │
│  - Filepath redaction               │
└─────────────────────────────────────┘
    │
    ▼
JSONL Output (SFT Dataset)

Project Structure

SFTizer/
├── main.py                 # Entry point
├── discover_tools.py       # Tool discovery utility
├── config.yml              # Main configuration
├── configs/
│   ├── openai/
│   │   └── format_mapping.yml
│   ├── langchain/
│   │   └── format_mapping.yml
│   ├── hermes/
│   │   └── format_mapping.yml
│   ├── llama3/
│   │   └── format_mapping.yml
│   └── shared/
│       └── tool_mapping.yml
├── utils/
│   ├── config_loader.py
│   ├── config_validator.py
│   ├── conversation_parser.py
│   ├── gemini_parser.py
│   ├── cursor_parser.py
│   ├── cline_parser.py
│   ├── message_formatter.py
│   ├── chunker.py
│   ├── pii_stripper.py
│   └── tokenizer_loader.py
└── tests/
    ├── conftest.py
    ├── fixtures/
    ├── validators/
    └── test_*.py

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=utils --cov-report=html

# Run specific test file
pytest tests/test_chunker.py

# Run in parallel
pytest -n auto

# Skip slow tests
pytest -m "not slow"

The test suite includes 348+ tests covering:

  • Unit tests for all parsers and utilities
  • Integration tests for end-to-end workflows
  • Format validation for all output formats
  • Performance benchmarks

Utility Scripts

Discover Tools

Find all unique tools used in your logs:

python discover_tools.py

This helps populate tool_mapping.yml with tools found in your conversation logs.

Output Format

Each line in the output JSONL contains:

{
  "messages": [
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "...", "tool_calls": [...]},
    {"role": "tool", "tool_call_id": "...", "content": "..."}
  ],
  "source_file": "/path/to/log.jsonl",
  "conversation_id": "abc123",
  "branch_index": 0,
  "chunk_index": 0,
  "timestamp": "2025-01-15T10:30:00Z"
}

Oversized Message Strategies

When a single message exceeds the chunk limit:

Strategy Behavior
keep Include as-is (may exceed limits)
truncate Cut to fit with truncation marker
split Split into multiple chunks with markers
exclude Skip the message entirely

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Run tests: pytest
  4. Submit a pull request

License

MIT License

About

Claudex parsing of CLaude code logs for SFT dataset generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors