diff --git a/.claude/backends.md b/.claude/backends.md new file mode 100644 index 00000000000..5112b2e29e0 --- /dev/null +++ b/.claude/backends.md @@ -0,0 +1,34 @@ +# Backends + +| Backend | Platform | Hardware | Location | +|---------|----------|----------|----------| +| XNNPACK | All | CPU | `backends/xnnpack/` | +| CUDA | Linux/Windows | GPU | `backends/cuda/` | +| CoreML | iOS, macOS | NPU/GPU/CPU | `backends/apple/coreml/` | +| MPS | iOS, macOS | GPU | `backends/apple/mps/` | +| Vulkan | Android | GPU | `backends/vulkan/` | +| QNN | Android | NPU | `backends/qualcomm/` | +| MediaTek | Android | NPU | `backends/mediatek/` | +| Arm Ethos-U | Embedded | NPU | `backends/arm/` | +| OpenVINO | Embedded | CPU/GPU/NPU | `backends/openvino/` | +| Cadence | Embedded | DSP | See `backends-cadence.md` | +| Samsung | Android | NPU | `backends/samsung/` | + +## Partitioner imports +```python +from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner +from executorch.backends.apple.coreml.partition.coreml_partitioner import CoreMLPartitioner +from executorch.backends.qualcomm.partition.qnn_partitioner import QnnPartitioner +from executorch.backends.vulkan.partition.vulkan_partitioner import VulkanPartitioner +``` + +## Usage pattern +```python +from executorch.exir import to_edge + +edge = to_edge(exported_program) +edge = edge.to_backend(XnnpackPartitioner()) # or other partitioner +exec_prog = edge.to_executorch() +``` + +Unsupported ops fall back to portable CPU. Use multiple partitioners for priority fallback. diff --git a/.claude/faq.md b/.claude/faq.md new file mode 100644 index 00000000000..1aa415e4ce9 --- /dev/null +++ b/.claude/faq.md @@ -0,0 +1,35 @@ +# Common Errors + +## Error Codes +Error codes defined in `runtime/core/error.h`. + +| Code | Name | Common Cause | +|------|------|--------------| +| 0x10 | InvalidArgument | Input shape mismatch - inputs don't match export shapes. Use dynamic shapes if needed. | +| 0x14 | OperatorMissing | Selective build missing operator. Regenerate `et_operator_library` from current model. | +| 0x20 | NotFound | Missing backend. Link with `--whole-archive`: `-Wl,--whole-archive libxnnpack_backend.a -Wl,--no-whole-archive` | + +## Export Issues + +**Missing out variants**: Custom ops need ExecuTorch implementation. See `kernel-library-custom-aten-kernel.md`. + +**RuntimeError: convert function not implemented**: Unsupported operator. File GitHub issue. + +## Runtime Issues + +**Slow inference**: +1. Build with `-DCMAKE_BUILD_TYPE=Release` +2. Ensure model is delegated (use `XnnpackPartitioner`) +3. Set thread count: `threadpool::get_threadpool()->_unsafe_reset_threadpool(num_threads)` + +**Numerical accuracy**: Use devtools to debug. See `/profile` skill. + +**Error setting input 0x10**: Input shape mismatch. Specify dynamic shapes at export. + +**Duplicate kernel registration abort**: Multiple `gen_operators_lib` linked. Use only one per target. + +## Installation + +**Missing python-dev**: `sudo apt install python-dev` + +**Missing pytorch_tokenizers**: `pip install -e ./extension/llm/tokenizers/` diff --git a/.claude/llm-export.md b/.claude/llm-export.md new file mode 100644 index 00000000000..65e3414b421 --- /dev/null +++ b/.claude/llm-export.md @@ -0,0 +1,65 @@ +# LLM Export + +High-level API for exporting LLMs to .pte format. + +## Supported Models +Llama 2/3/3.1/3.2, Qwen 2.5/3, Phi 3.5/4-mini, SmolLM2 + +Full list: `extension/llm/export/config/llm_config.py` + +For other models (Gemma, Mistral, BERT, Whisper): use optimum-executorch (see `/setup` skill). + +## Basic Usage + +```bash +python -m executorch.extension.llm.export.export_llm \ + --config path/to/config.yaml +``` + +## Config Structure + +```yaml +base: + model_class: llama3_2 + checkpoint: path/to/consolidated.00.pth + params: path/to/params.json + metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' + +model: + use_kv_cache: True # recommended + use_sdpa_with_kv_cache: True # recommended + use_attention_sink: False # extend generation + quantize_kv_cache: False # int8 KV cache + +quantization: + qmode: 8da4w # int8 activation + int4 weight + group_size: 32 + embedding_quantize: 4,32 + +backend: + xnnpack: + enabled: True + extended_ops: True + +debug: + verbose: True # show delegation table + generate_etrecord: True # for devtools profiling +``` + +## Quantization Modes + +**TorchAO (XNNPACK)**: +- `8da4w`: int8 dynamic activation + int4 weight +- `int8`: int8 weight-only +- `torchao:8da4w`: low-bit kernels for Arm + +**pt2e (QNN, CoreML, Vulkan)**: Use for non-CPU backends. + +## Config Classes +All options in `extension/llm/export/config/llm_config.py`: +- `LlmConfig` - top level +- `ExportConfig` - max_seq_length, max_context_length +- `ModelConfig` - model optimizations +- `QuantizationConfig` - quantization options +- `BackendConfig` - backend settings +- `DebugConfig` - verbose, etrecord, profiling diff --git a/.claude/quantization.md b/.claude/quantization.md new file mode 100644 index 00000000000..94abee0431e --- /dev/null +++ b/.claude/quantization.md @@ -0,0 +1,13 @@ +# Quantization + +Docs: https://docs.pytorch.org/ao/main/pt2e_quantization/index.html + +## Backend quantizers +| Backend | Quantizer | +|---------|-----------| +| XNNPACK | `XNNPACKQuantizer` | +| Qualcomm | `QnnQuantizer` | +| CoreML | `CoreMLQuantizer` | + +## LLM modes +See `examples/models/llama/source_transformation/quantize.py`: `int8`, `8da4w`, `4w` diff --git a/.claude/runtime-api.md b/.claude/runtime-api.md new file mode 100644 index 00000000000..4078ec68daa --- /dev/null +++ b/.claude/runtime-api.md @@ -0,0 +1,28 @@ +# Runtime API + +## executorch.runtime (preferred) +```python +from executorch.runtime import Runtime, Program, Method +runtime = Runtime.get() +program = runtime.load_program(Path("model.pte")) +outputs = program.load_method("forward").execute(inputs) +``` + +## portable_lib (low-level) +```python +from executorch.extension.pybindings.portable_lib import _load_for_executorch +module = _load_for_executorch("model.pte") +outputs = module.forward(inputs) +``` + +## Missing kernel fixes + +If runtime shows missing kernel errors, import the kernel module before loading: + +```python +# Missing quantized kernels (e.g., quantized_decomposed::embedding_byte.out) +from executorch.kernels import quantized + +# Missing LLM custom ops (e.g., llama::custom_sdpa.out, llama::update_cache.out) +from executorch.extension.llm.custom_ops import custom_ops +``` diff --git a/.claude/skills/building/SKILL.md b/.claude/skills/building/SKILL.md new file mode 100644 index 00000000000..7ff7be38df1 --- /dev/null +++ b/.claude/skills/building/SKILL.md @@ -0,0 +1,23 @@ +--- +name: building +description: Build ExecuTorch runners or C++ libraries. Use when compiling runners for Llama, Whisper, or other models, or building the C++ runtime. +--- + +# Building + +## Runners (Makefile) +```bash +make help # list all targets +make llama-cpu # Llama +make whisper-metal # Whisper on Metal +make gemma3-cuda # Gemma3 on CUDA +``` + +Output: `cmake-out/examples/models//` + +## C++ Libraries (CMake) +```bash +cmake --list-presets # list presets +cmake --workflow --preset llm-release # LLM CPU +cmake --workflow --preset llm-release-metal # LLM Metal +``` diff --git a/.claude/skills/export/SKILL.md b/.claude/skills/export/SKILL.md new file mode 100644 index 00000000000..c075e9403c6 --- /dev/null +++ b/.claude/skills/export/SKILL.md @@ -0,0 +1,28 @@ +--- +name: export +description: Export a PyTorch model to .pte format for ExecuTorch. Use when converting models, lowering to edge, or generating .pte files. +--- + +# Export + +## Basic pattern +```python +from executorch.exir import to_edge_transform_and_lower +from torch.export import export + +exported = export(model.eval(), example_inputs) +edge = to_edge_transform_and_lower(exported) +with open("model.pte", "wb") as f: + f.write(edge.to_executorch().buffer) +``` + +## Model-specific scripts +| Model | Script | +|-------|--------| +| Llama | `examples/models/llama/export_llama.py` | +| Whisper | `examples/models/whisper/` | +| Parakeet | `examples/models/parakeet/export_parakeet_tdt.py` | + +## Debugging +- Draft export: `export(model, inputs, strict=False)` +- tlparse: `TORCH_LOGS="+dynamo,+export" python script.py 2>&1 | tlparse` diff --git a/.claude/skills/profile/SKILL.md b/.claude/skills/profile/SKILL.md new file mode 100644 index 00000000000..b118a8a61e4 --- /dev/null +++ b/.claude/skills/profile/SKILL.md @@ -0,0 +1,24 @@ +--- +name: profile +description: Profile ExecuTorch model execution. Use when measuring performance, analyzing operator timing, or debugging slow models. +--- + +# Profile + +## 1. Enable ETDump when loading +```python +program = runtime.load_program("model.pte", enable_etdump=True, debug_buffer_size=int(1e7)) +``` + +## 2. Execute and save +```python +outputs = program.load_method("forward").execute(inputs) +program.write_etdump_result_to_file("etdump.etdp", "debug.bin") +``` + +## 3. Analyze with Inspector +```python +from executorch.devtools import Inspector +inspector = Inspector(etrecord="model.etrecord", etdump_path="etdump.etdp") +inspector.print_data_tabular() +``` diff --git a/.claude/skills/setup/SKILL.md b/.claude/skills/setup/SKILL.md new file mode 100644 index 00000000000..3b5e2955357 --- /dev/null +++ b/.claude/skills/setup/SKILL.md @@ -0,0 +1,15 @@ +--- +name: setup +description: Set up ExecuTorch development environment. Use when installing dependencies, setting up conda environments, or preparing to develop with ExecuTorch. +--- + +# Setup + +1. Activate conda: `conda activate executorch` + - If not found: `conda env list | grep -E "(executorch|et)"` + +2. Install executorch: `./install_executorch.sh` + +3. (Optional) For Huggingface integration: + - Read commit from `.ci/docker/ci_commit_pins/optimum-executorch.txt` + - Install: `pip install git+https://github.com/huggingface/optimum-executorch.git@` diff --git a/.claude/tokenizers.md b/.claude/tokenizers.md new file mode 100644 index 00000000000..bcdc540136b --- /dev/null +++ b/.claude/tokenizers.md @@ -0,0 +1,54 @@ +# Tokenizers + +C++ tokenizer implementations with Python bindings. Located in `extension/llm/tokenizers/`. + +## Installation +```bash +pip install -e ./extension/llm/tokenizers/ +``` + +## Python API + +```python +from pytorch_tokenizers import get_tokenizer + +# Auto-detect tokenizer type from file +tokenizer = get_tokenizer("path/to/tokenizer.model") # or .json + +# Encode/decode +tokens = tokenizer.encode("Hello world") +text = tokenizer.decode(tokens) +``` + +## Available Tokenizers + +| Class | Format | Use Case | +|-------|--------|----------| +| `HuggingFaceTokenizer` | `.json` | HuggingFace models | +| `TiktokenTokenizer` | `.model` | OpenAI/Llama 3 | +| `Llama2cTokenizer` | `.model` | Llama 2, SentencePiece | +| `CppSPTokenizer` | `.model` | SentencePiece (C++) | + +## Direct Usage + +```python +from pytorch_tokenizers import HuggingFaceTokenizer, TiktokenTokenizer, Llama2cTokenizer + +# HuggingFace (tokenizer.json) +tokenizer = HuggingFaceTokenizer("tokenizer.json", "tokenizer_config.json") + +# Tiktoken (Llama 3, etc.) +tokenizer = TiktokenTokenizer(model_path="tokenizer.model") + +# Llama2c/SentencePiece +tokenizer = Llama2cTokenizer(model_path="tokenizer.model") +``` + +## C++ Tokenizers + +For C++ runners, include headers from `extension/llm/tokenizers/include/`: +- `hf_tokenizer.h` - HuggingFace +- `tiktoken.h` - Tiktoken +- `sentencepiece.h` - SentencePiece +- `llama2c_tokenizer.h` - Llama2c +- `tekken.h` - Mistral Tekken v7 diff --git a/CLAUDE.md b/CLAUDE.md index 67b29b9c652..8d0b6675186 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,67 +1,48 @@ -# Repo and framework name +# ExecuTorch -Refer to the repo/framework/runtime "executorch" (in lower cases) or "ExecuTorch" (in -camel cases), not "ExecutorTorch". With limited code or comment length, maybe refer -to the framework "ET" but consider it as very unofficial and not recommended. +## Skills +- `/setup` - Set up environment +- `/export` - Export model to .pte +- `/building` - Build runners or C++ libs +- `/profile` - Profile execution -# Install +Reference docs in `.claude/`: backends, runtime-api, quantization, llm-export, faq, tokenizers -## Python +## Quick Reference -If the user is mostly importing `executorch` module and experimenting with Ahead-Of-Time -export flow, installation means installing `executorch` python package. +**Install Python package:** +```bash +./install_executorch.sh # first time (or .bat on Windows) +pip install -e . --no-build-isolation # subsequent installs +``` -Python virtual environment or conda environment is highly recommended for installing -executorch from source. Double check if the user wants to enable virtual enablement before -building from source. +**Build C++ libraries:** see `CMakeLists.txt`; for LLM/ASR runners use `Makefile` and `CMakePresets.json` -First time install: run `install_executorch.sh` (or `install_executorch.bat` for Windows). +**Run tests:** `pytest -n auto` (Python), `ctest --output-on-failure` (C++) -This script handles dependencies properly (since `executorch` depends on nightly versions -of `torch`, those packages won't be available in pip so need special index url). +**Lint:** `lintrunner init && lintrunner -a` -Subsequent install: run `pip install . -v --no-build-isolation` inside `executorch` -directory. +Details: [docs/source/using-executorch-building-from-source.md](docs/source/using-executorch-building-from-source.md) -Editable mode is avilable (either through `install_executorch.sh` script or `pip install . -e`. +## Naming -Refer to more details in this [doc](docs/source/using-executorch-building-from-source.md). +- Use "executorch" (lowercase) or "ExecuTorch" (camel case) +- Never "ExecutorTorch" +- "ET" only when space-constrained (unofficial) -## C++ -If the user is building basic executorch C++ libraries, refer to root level [CMakeLists.txt](CMakeLists.txt). +## Commits -If working with LLM/ASR runners, prefer to use [Makefile](Makefile) and cmake [presets](CMakePresets.json). +- Only commit when explicitly asked +- No bullet lists of changes; explain review order for large PRs, or omit for small ones +- Disclose PR was authored with Claude -Again refer to this [doc](docs/source/using-executorch-building-from-source.md#building-the-c-runtime) -for more details. +## Code Style -# Commit messages +- Minimal comments; code should be self-documenting +- Comments only for non-obvious global context +- No trivial (1-2 LOC) single-use helpers unless significantly improving readability +- Explicit state management; no dynamic `setattr`/`getattr` patterns +- Match existing style and architecture +- Assume reader knows ExecuTorch/PyTorch basics -Don't commit unless the user explicitly asks you to. - -When writing a commit message, don't make a bullet list of the individual -changes. Instead, if the PR is large, explain the order to review changes -(e.g., the logical progression), or if it's short just omit the bullet list -entirely. - -Disclose that the PR was authored with Claude. - -# Coding Style Guidelines - -Follow these rules for all code changes in this repository: - -- Minimize comments; be concise; code should be self-explanatory and self-documenting. -- Comments should be useful, for example, comments that remind the reader about - some global context that is non-obvious and can't be inferred locally. -- Don't make trivial (1-2 LOC) helper functions that are only used once unless - it significantly improves code readability. -- Prefer clear abstractions. State management should be explicit. - For example, if managing state in a Python class: there should be a clear - class definition that has all of the members: don't dynamically `setattr` - a field on an object and then dynamically `getattr` the field on the object. -- Match existing code style and architectural patterns. -- Assume the reader has familiarity with ExecuTorch and PyTorch. They may not be the expert - on the code that is being read, but they should have some experience in the - area. - -If uncertain, choose the simpler, more concise implementation. +**When uncertain: choose simpler, more concise.**