Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .claude/backends.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Backends

| Backend | Platform | Hardware | Location |
|---------|----------|----------|----------|
| XNNPACK | All | CPU | `backends/xnnpack/` |
| CUDA | Linux/Windows | GPU | `backends/cuda/` |
| CoreML | iOS, macOS | NPU/GPU/CPU | `backends/apple/coreml/` |
| MPS | iOS, macOS | GPU | `backends/apple/mps/` |
| Vulkan | Android | GPU | `backends/vulkan/` |
| QNN | Android | NPU | `backends/qualcomm/` |
| MediaTek | Android | NPU | `backends/mediatek/` |
| Arm Ethos-U | Embedded | NPU | `backends/arm/` |
| OpenVINO | Embedded | CPU/GPU/NPU | `backends/openvino/` |
| Cadence | Embedded | DSP | See `backends-cadence.md` |
| Samsung | Android | NPU | `backends/samsung/` |

## Partitioner imports
```python
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
from executorch.backends.apple.coreml.partition.coreml_partitioner import CoreMLPartitioner
from executorch.backends.qualcomm.partition.qnn_partitioner import QnnPartitioner
from executorch.backends.vulkan.partition.vulkan_partitioner import VulkanPartitioner
```

## Usage pattern
```python
from executorch.exir import to_edge

edge = to_edge(exported_program)
edge = edge.to_backend(XnnpackPartitioner()) # or other partitioner
exec_prog = edge.to_executorch()
```

Unsupported ops fall back to portable CPU. Use multiple partitioners for priority fallback.
35 changes: 35 additions & 0 deletions .claude/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Common Errors

## Error Codes
Error codes defined in `runtime/core/error.h`.

| Code | Name | Common Cause |
|------|------|--------------|
| 0x10 | InvalidArgument | Input shape mismatch - inputs don't match export shapes. Use dynamic shapes if needed. |
| 0x14 | OperatorMissing | Selective build missing operator. Regenerate `et_operator_library` from current model. |
| 0x20 | NotFound | Missing backend. Link with `--whole-archive`: `-Wl,--whole-archive libxnnpack_backend.a -Wl,--no-whole-archive` |

## Export Issues

**Missing out variants**: Custom ops need ExecuTorch implementation. See `kernel-library-custom-aten-kernel.md`.

**RuntimeError: convert function not implemented**: Unsupported operator. File GitHub issue.

## Runtime Issues

**Slow inference**:
1. Build with `-DCMAKE_BUILD_TYPE=Release`
2. Ensure model is delegated (use `XnnpackPartitioner`)
3. Set thread count: `threadpool::get_threadpool()->_unsafe_reset_threadpool(num_threads)`

**Numerical accuracy**: Use devtools to debug. See `/profile` skill.

**Error setting input 0x10**: Input shape mismatch. Specify dynamic shapes at export.

**Duplicate kernel registration abort**: Multiple `gen_operators_lib` linked. Use only one per target.

## Installation

**Missing python-dev**: `sudo apt install python<version>-dev`

**Missing pytorch_tokenizers**: `pip install -e ./extension/llm/tokenizers/`
65 changes: 65 additions & 0 deletions .claude/llm-export.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# LLM Export

High-level API for exporting LLMs to .pte format.

## Supported Models
Llama 2/3/3.1/3.2, Qwen 2.5/3, Phi 3.5/4-mini, SmolLM2

Full list: `extension/llm/export/config/llm_config.py`

For other models (Gemma, Mistral, BERT, Whisper): use optimum-executorch (see `/setup` skill).

## Basic Usage

```bash
python -m executorch.extension.llm.export.export_llm \
--config path/to/config.yaml
```

## Config Structure

```yaml
base:
model_class: llama3_2
checkpoint: path/to/consolidated.00.pth
params: path/to/params.json
metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'

model:
use_kv_cache: True # recommended
use_sdpa_with_kv_cache: True # recommended
use_attention_sink: False # extend generation
quantize_kv_cache: False # int8 KV cache

quantization:
qmode: 8da4w # int8 activation + int4 weight
group_size: 32
embedding_quantize: 4,32

backend:
xnnpack:
enabled: True
extended_ops: True

debug:
verbose: True # show delegation table
generate_etrecord: True # for devtools profiling
```

## Quantization Modes

**TorchAO (XNNPACK)**:
- `8da4w`: int8 dynamic activation + int4 weight
- `int8`: int8 weight-only
- `torchao:8da4w`: low-bit kernels for Arm

**pt2e (QNN, CoreML, Vulkan)**: Use for non-CPU backends.

## Config Classes
All options in `extension/llm/export/config/llm_config.py`:
- `LlmConfig` - top level
- `ExportConfig` - max_seq_length, max_context_length
- `ModelConfig` - model optimizations
- `QuantizationConfig` - quantization options
- `BackendConfig` - backend settings
- `DebugConfig` - verbose, etrecord, profiling
13 changes: 13 additions & 0 deletions .claude/quantization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Quantization

Docs: https://docs.pytorch.org/ao/main/pt2e_quantization/index.html

## Backend quantizers
| Backend | Quantizer |
|---------|-----------|
| XNNPACK | `XNNPACKQuantizer` |
| Qualcomm | `QnnQuantizer` |
| CoreML | `CoreMLQuantizer` |

## LLM modes
See `examples/models/llama/source_transformation/quantize.py`: `int8`, `8da4w`, `4w`
28 changes: 28 additions & 0 deletions .claude/runtime-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Runtime API

## executorch.runtime (preferred)
```python
from executorch.runtime import Runtime, Program, Method
runtime = Runtime.get()
program = runtime.load_program(Path("model.pte"))
outputs = program.load_method("forward").execute(inputs)
```

## portable_lib (low-level)
```python
from executorch.extension.pybindings.portable_lib import _load_for_executorch
module = _load_for_executorch("model.pte")
outputs = module.forward(inputs)
```

## Missing kernel fixes

If runtime shows missing kernel errors, import the kernel module before loading:

```python
# Missing quantized kernels (e.g., quantized_decomposed::embedding_byte.out)
from executorch.kernels import quantized

# Missing LLM custom ops (e.g., llama::custom_sdpa.out, llama::update_cache.out)
from executorch.extension.llm.custom_ops import custom_ops
```
23 changes: 23 additions & 0 deletions .claude/skills/building/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
name: building
description: Build ExecuTorch runners or C++ libraries. Use when compiling runners for Llama, Whisper, or other models, or building the C++ runtime.
---

# Building

## Runners (Makefile)
```bash
make help # list all targets
make llama-cpu # Llama
make whisper-metal # Whisper on Metal
make gemma3-cuda # Gemma3 on CUDA
```

Output: `cmake-out/examples/models/<model>/<runner>`

## C++ Libraries (CMake)
```bash
cmake --list-presets # list presets
cmake --workflow --preset llm-release # LLM CPU
cmake --workflow --preset llm-release-metal # LLM Metal
```
28 changes: 28 additions & 0 deletions .claude/skills/export/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
name: export
description: Export a PyTorch model to .pte format for ExecuTorch. Use when converting models, lowering to edge, or generating .pte files.
---

# Export

## Basic pattern
```python
from executorch.exir import to_edge_transform_and_lower
from torch.export import export

exported = export(model.eval(), example_inputs)
edge = to_edge_transform_and_lower(exported)
with open("model.pte", "wb") as f:
f.write(edge.to_executorch().buffer)
```

## Model-specific scripts
| Model | Script |
|-------|--------|
| Llama | `examples/models/llama/export_llama.py` |
| Whisper | `examples/models/whisper/` |
| Parakeet | `examples/models/parakeet/export_parakeet_tdt.py` |

## Debugging
- Draft export: `export(model, inputs, strict=False)`
- tlparse: `TORCH_LOGS="+dynamo,+export" python script.py 2>&1 | tlparse`
24 changes: 24 additions & 0 deletions .claude/skills/profile/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
name: profile
description: Profile ExecuTorch model execution. Use when measuring performance, analyzing operator timing, or debugging slow models.
---

# Profile

## 1. Enable ETDump when loading
```python
program = runtime.load_program("model.pte", enable_etdump=True, debug_buffer_size=int(1e7))
```

## 2. Execute and save
```python
outputs = program.load_method("forward").execute(inputs)
program.write_etdump_result_to_file("etdump.etdp", "debug.bin")
```

## 3. Analyze with Inspector
```python
from executorch.devtools import Inspector
inspector = Inspector(etrecord="model.etrecord", etdump_path="etdump.etdp")
inspector.print_data_tabular()
```
15 changes: 15 additions & 0 deletions .claude/skills/setup/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
name: setup
description: Set up ExecuTorch development environment. Use when installing dependencies, setting up conda environments, or preparing to develop with ExecuTorch.
---

# Setup

1. Activate conda: `conda activate executorch`
- If not found: `conda env list | grep -E "(executorch|et)"`

2. Install executorch: `./install_executorch.sh`

3. (Optional) For Huggingface integration:
- Read commit from `.ci/docker/ci_commit_pins/optimum-executorch.txt`
- Install: `pip install git+https://github.com/huggingface/optimum-executorch.git@<COMMIT>`
54 changes: 54 additions & 0 deletions .claude/tokenizers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Tokenizers

C++ tokenizer implementations with Python bindings. Located in `extension/llm/tokenizers/`.

## Installation
```bash
pip install -e ./extension/llm/tokenizers/
```

## Python API

```python
from pytorch_tokenizers import get_tokenizer

# Auto-detect tokenizer type from file
tokenizer = get_tokenizer("path/to/tokenizer.model") # or .json

# Encode/decode
tokens = tokenizer.encode("Hello world")
text = tokenizer.decode(tokens)
```

## Available Tokenizers

| Class | Format | Use Case |
|-------|--------|----------|
| `HuggingFaceTokenizer` | `.json` | HuggingFace models |
| `TiktokenTokenizer` | `.model` | OpenAI/Llama 3 |
| `Llama2cTokenizer` | `.model` | Llama 2, SentencePiece |
| `CppSPTokenizer` | `.model` | SentencePiece (C++) |

## Direct Usage

```python
from pytorch_tokenizers import HuggingFaceTokenizer, TiktokenTokenizer, Llama2cTokenizer

# HuggingFace (tokenizer.json)
tokenizer = HuggingFaceTokenizer("tokenizer.json", "tokenizer_config.json")

# Tiktoken (Llama 3, etc.)
tokenizer = TiktokenTokenizer(model_path="tokenizer.model")

# Llama2c/SentencePiece
tokenizer = Llama2cTokenizer(model_path="tokenizer.model")
```

## C++ Tokenizers

For C++ runners, include headers from `extension/llm/tokenizers/include/`:
- `hf_tokenizer.h` - HuggingFace
- `tiktoken.h` - Tiktoken
- `sentencepiece.h` - SentencePiece
- `llama2c_tokenizer.h` - Llama2c
- `tekken.h` - Mistral Tekken v7
Loading
Loading