Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 21 additions & 6 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# ltx.cpp

C++ inference engine for LTX-Video (LTX 2.3) — text-to-video and image-to-video generation using GGML backends (Metal, CUDA, CPU).
C++ inference engine for LTX-Video (LTX 2.3) — text-to-video, image-to-video, and **audio-video (AV)** generation using GGML backends (Metal, CUDA, CPU).

**Branch `audio-video`**: same DiT sees concatenated video+audio latent; one denoise loop; output is video frames + WAV. See `docs/AV_PIPELINE.md` and README “Audio-video (AV)” section.

## Build

Expand All @@ -23,15 +25,17 @@ cmake -B build -DLTX_HIP=ON # ROCm/AMD

## Models

Download with `./models.sh` (requires `huggingface-cli`):
Download with `./models.sh` (requires `curl` or `wget`):
```bash
./models.sh # DiT Q4_K_M + T5 Q8_0 + VAE + extras
./models.sh # Dev DiT (default) + T5 + VAE + extras
./models.sh --distilled # Distilled DiT (few-step 4–8, CFG=1) from same repo
./models.sh --minimal # DiT + T5 + VAE only
./models.sh --quant Q8_0 # different DiT quant
```

Models land flat under `models/`. Key files:
- `models/ltx-2.3-22b-dev-Q4_K_M.gguf` — DiT weights
- `models/ltx-2.3-22b-dev-Q4_K_M.gguf` — Dev DiT (default)
- `models/ltx-2.3-22b-distilled-Q4_K_M.gguf` — Distilled DiT (with `--distilled`)
- `models/ltx-2.3-22b-dev_video_vae.safetensors` — VAE
- `models/t5-v1_1-xxl-encoder-Q8_0.gguf` — T5 text encoder

Expand All @@ -47,9 +51,14 @@ build/ltx-generate \
--steps 20 --out output/frame
```

**Audio-video (AV):** add `--av` and optionally `--out-wav path.wav` to get video frames + WAV from the same run. Mux with ffmpeg: `ffmpeg -framerate 24 -i out_%04d.ppm -i out.wav -c:v libx264 -c:a aac -shortest out.mp4`.

Useful flags:
- `-v` — verbose per-step logging
- `--perf` — print CPU%/RSS/free-RAM/GPU-MB to stderr every 10 s
- `--av` — enable audio+video path (concat latent → DiT → split → decode both)
- `--audio-vae path` — optional; for full audio VAE decoder when implemented
- `--out-wav path` — WAV output when `--av` (default: `<out prefix>.wav`)
- `--start-frame img.png` — image-to-video (I2V)
- `--end-frame img.png` — keyframe interpolation
- `--seed N`, `--cfg F`, `--shift F`, `--threads N`
Expand All @@ -65,14 +74,15 @@ BIN=build_debug/ltx-generate bash scripts/test-gpu-migration.sh

| File | Purpose |
|------|---------|
| `src/ltx-generate.cpp` | Main binary: arg parsing, model loading, denoising loop |
| `src/ltx_dit.hpp` | DiT transformer (forward pass, block loop, Metal/CPU paths) |
| `src/ltx-generate.cpp` | Main binary: arg parsing, model loading, denoising loop; AV path (concat/split, WAV output) |
| `src/ltx_dit.hpp` | DiT transformer (forward pass, block loop); `patchify_audio` / `unpatchify_audio` for AV |
| `src/video_vae.hpp` | VAE encoder/decoder (safetensors) |
| `src/t5_encoder.hpp` | T5-XXL text encoder (GGUF) |
| `src/scheduler.hpp` | RF flow scheduler (timesteps, Euler step, CFG) |
| `src/ltx_perf.hpp` | Background perf monitor thread (CPU/RAM stats) |
| `src/ltx_common.hpp` | Shared macros (`LTX_LOG`, `LTX_ERR`), GGML helpers |
| `src/safetensors_loader.cpp` | safetensors file loader |
| `docs/AV_PIPELINE.md` | AV pipeline design (token concat, shapes, CLI) |

## Architecture notes

Expand All @@ -87,3 +97,8 @@ BIN=build_debug/ltx-generate bash scripts/test-gpu-migration.sh
| Variable | Default | Effect |
|----------|---------|--------|
| `LTX_MIGRATE_MAX_TENSOR_MB` | `6144` | Max single-tensor size for GPU migration |

## Branch: audio-video

- **AV path**: with `--av`, video and audio latents are patchified, concatenated (video then audio tokens), passed through one DiT forward, then split; Euler step on both; video decoded with existing VAE, audio turned into WAV via a latent→waveform fallback.
- **Full audio VAE** (safetensors decoder) is not yet implemented; audio quality uses the fallback. See `docs/AV_PIPELINE.md` and `DEV.md` §5.
93 changes: 62 additions & 31 deletions DEV.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ navigate the known limitations.
- [Build configurations](#build-configurations)
- [Obtaining model files](#obtaining-model-files)
4. [End-to-end data flow](#4-end-to-end-data-flow)
5. [Source file reference](#5-source-file-reference)
5. [Audio-video (AV) pipeline](#5-audio-video-av-pipeline)
6. [Source file reference](#6-source-file-reference)
- [ltx\_common.hpp](#ltx_commonhpp)
- [scheduler.hpp](#schedulerhpp)
- [t5\_encoder.hpp](#t5_encoderhpp)
Expand All @@ -25,25 +26,25 @@ navigate the known limitations.
- [ltx-generate.cpp](#ltx-generatecpp)
- [ltx-quantize.cpp](#ltx-quantizecpp)
- [convert.py](#convertpy)
6. [GGUF model format conventions](#6-gguf-model-format-conventions)
7. [GGUF model format conventions](#7-gguf-model-format-conventions)
- [DiT GGUF](#dit-gguf)
- [VAE GGUF](#vae-gguf)
- [T5 GGUF](#t5-gguf)
7. [Image-to-video (I2V) design](#7-image-to-video-i2v-design)
8. [Image-to-video (I2V) design](#8-image-to-video-i2v-design)
- [VaeEncoder](#vaeencoder)
- [Frame-conditioning schedule](#frame-conditioning-schedule)
- [Hard-pinning at t=0](#hard-pinning-at-t0)
8. [Key algorithms and design decisions](#8-key-algorithms-and-design-decisions)
9. [Key algorithms and design decisions](#9-key-algorithms-and-design-decisions)
- [Rectified Flow (RF) scheduling](#rectified-flow-rf-scheduling)
- [Classifier-free guidance](#classifier-free-guidance)
- [Patchify / unpatchify](#patchify--unpatchify)
- [Latent dimension formulas](#latent-dimension-formulas)
- [Tokenizer](#tokenizer)
9. [Adding a new backend (GPU/Metal/Vulkan)](#9-adding-a-new-backend-gpumetalvulkan)
10. [Known limitations and open tasks](#10-known-limitations-and-open-tasks)
11. [Coding conventions](#11-coding-conventions)
12. [Testing](#12-testing)
13. [Contributing](#13-contributing)
10. [Adding a new backend (GPU/Metal/Vulkan)](#10-adding-a-new-backend-gpumetalvulkan)
11. [Known limitations and open tasks](#11-known-limitations-and-open-tasks)
12. [Coding conventions](#12-coding-conventions)
13. [Testing](#13-testing)
14. [Contributing](#14-contributing)

---

Expand Down Expand Up @@ -90,6 +91,9 @@ ltx.cpp/
├── checkpoints.sh Download raw HF safetensors checkpoints
├── models.sh Download pre-quantised GGUF models from Unsloth/HF
├── quantize.sh Shell wrapper: run ltx-quantize on all BF16 GGUFs
├── docs/
│ ├── AV_PIPELINE.md Audio-video pipeline design (token concat, shapes, CLI)
│ └── LTX_COMFY_REFERENCE.md ComfyUI workflow reference
└── ggml/ Git submodule — GGML tensor library
```
Expand Down Expand Up @@ -246,12 +250,32 @@ CLI args
→ start_lat / end_lat [H_lat × W_lat × 128]

These latents are blended into the live denoising latent after each Euler step
(see §7 for the full schedule).
(see §8 for the full schedule).
```

---

## 5. Source file reference
## 5. Audio-video (AV) pipeline

**Branch: `audio-video`.** The LTX 2.3 GGUF DiT is a full **audio-video** model: it expects a single sequence of **concatenated video + audio** tokens and outputs a combined velocity that is split back into video and audio.

**Data flow when `--av`:**

1. **Latent init**: Video latent `[T_lat, H_lat, W_lat, C]` (as today) plus audio latent `[T_lat, 8, 16]` (C_audio=8, mel_bins=16), both filled with noise.
2. **Per step**:
- `patchify()` → video tokens `[n_video_tok, 128]`; `patchify_audio()` → audio tokens `[T_lat, 128]`.
- Concat → `[n_video_tok + T_lat, 128]` = `[n_tok_total, Pd]`.
- `LtxDiT::forward(combined, n_tok_total, …)` → combined velocity.
- Split: first `n_video_tok` tokens → video velocity; remainder → audio velocity.
- Unpatchify both; Euler step on video latent and on audio latent.
- (Optional) frame conditioning on video only (unchanged).
3. **Decode**: Video VAE decode → PPM frames (unchanged). Audio: denoised audio latent → waveform via a **latent-to-waveform** path (fake mel + overlap-add sinusoids) → 16-bit WAV (16 kHz). A full **audio VAE decoder** (safetensors) can be integrated later for higher-quality audio.

**Code:** `patchify_audio` / `unpatchify_audio` in `ltx_dit.hpp`; combined patch buffer, split, and dual Euler step in `ltx-generate.cpp`; `write_wav()` and `latent_to_waveform()` in `ltx-generate.cpp`. Design details: [docs/AV_PIPELINE.md](docs/AV_PIPELINE.md).

---

## 6. Source file reference

### `ltx_common.hpp`

Expand Down Expand Up @@ -332,7 +356,7 @@ Weights layout expected in the GGUF (prefix `vae.decoder.*`):

`decode(latents, T_lat, H_lat, W_lat)` runs a simplified per-frame 2-D decode
with nearest-neighbour temporal upsampling. Full causal 3-D conv decode is a
planned improvement (see §10).
planned improvement (see §11).

#### `VaeEncoder`

Expand Down Expand Up @@ -368,10 +392,12 @@ model.diffusion_model.norm_out.linear.{weight,bias}
```
Fallback names with prefix `dit.*` are also tried.

**Audio (AV pipeline)**: `patchify_audio(lat, T, C, F)` and `unpatchify_audio(tok, T, C, F)` in the same header convert audio latent `[T, 8, 16]` ↔ `[T, 128]` tokens for concatenation with video tokens before the single DiT forward.

**Forward pass** (per call to `LtxDiT::forward()`):
1. Sinusoidal timestep embedding → MLP → `hidden_size` vector
2. AdaLN-single linear → `6 × hidden_size` (scale/shift params; currently
stored but not yet fully applied per-block — see §10)
stored but not yet fully applied per-block — see §11)
3. Patchify projection: `[N_tok, patch_dim]` → `[N_tok, hidden_size]`
4. Caption projection: `[S, 4096]` → `[S, hidden_size]`
5. N × transformer blocks:
Expand Down Expand Up @@ -411,11 +437,13 @@ Orchestrates the full inference pipeline.
| `start_frame_path` | `--start-frame` | `""` (disabled) |
| `end_frame_path` | `--end-frame` | `""` (disabled) |
| `frame_strength` | `--frame-strength` | 1.0 |
| `av` | `--av` | false (enable audio+video path) |
| `audio_vae_path` | `--audio-vae` | `""` (optional; for full decoder when implemented) |
| `out_wav` | `--out-wav` | `""` (default: `<out prefix>.wav` when `--av`) |
| `threads` | `--threads` | 4 |
| `verbose` | `-v` | false |

**Output**: frames are written as `{out_prefix}_{NNNN}.ppm`. The output
directory is created automatically (including intermediate directories).
**Output**: frames are written as `{out_prefix}_{NNNN}.ppm`. When `--av`, a WAV file is also written (default `{out_prefix}.wav`). The output directory is created automatically (including intermediate directories).

---

Expand All @@ -430,7 +458,7 @@ Rules:
- Everything else → quantised to `target_type`

All GGUF KV metadata is copied verbatim. String arrays (e.g. the tokenizer
vocabulary) are not currently copied — this is a known limitation (see §10).
vocabulary) are not currently copied — this is a known limitation (see §11).

Supported quant types: `Q4_K_M`, `Q5_K_M`, `Q6_K`, `Q8_0`, `BF16`, `F32`, `F16`.

Expand Down Expand Up @@ -458,7 +486,7 @@ For T5, the HF tokenizer vocabulary can be embedded into the GGUF via

---

## 6. GGUF model format conventions
## 7. GGUF model format conventions

### DiT GGUF

Expand Down Expand Up @@ -501,7 +529,7 @@ Architecture string: `"t5"`

---

## 7. Image-to-video (I2V) design
## 8. Image-to-video (I2V) design

The I2V implementation does not modify the DiT architecture. Instead it
works by conditioning the *latent* directly at the boundary frames before and
Expand Down Expand Up @@ -553,7 +581,7 @@ appearance, regardless of any residual denoising drift.

---

## 8. Key algorithms and design decisions
## 9. Key algorithms and design decisions

### Rectified Flow (RF) scheduling

Expand Down Expand Up @@ -587,17 +615,19 @@ The unconditional embedding is computed by encoding the `--neg` prompt

### Patchify / unpatchify

The DiT operates on *tokens*, not on the raw latent volume. The latent
The DiT operates on *tokens*, not on the raw latent volume. The **video** latent
`[T_lat, H_lat, W_lat, C]` is chunked into non-overlapping patches of size
`(pt=1, ph=2, pw=2)` along the temporal, height, and width dimensions:

```
patch_dim = pt * ph * pw * C = 1 * 2 * 2 * 128 = 512
patch_dim = pt * ph * pw * C = 1 * 2 * 2 * 128 = 512 (or 128 for C=32)
N_tok = (T_lat/pt) * (H_lat/ph) * (W_lat/pw)
```

`patchify()` and `unpatchify()` are helper functions called from
`ltx-generate.cpp`. Both are pure memory rearrangements with no arithmetic.
`patchify()` and `unpatchify()` are helper functions in `ltx_dit.hpp` called from
`ltx-generate.cpp`. For the **audio-video** path, `patchify_audio()` and
`unpatchify_audio()` convert audio latent `[T, 8, 16]` to/from `[T, 128]` tokens;
video and audio token sequences are concatenated before the DiT forward and split after. All are pure memory rearrangements with no arithmetic.

### Latent dimension formulas

Expand Down Expand Up @@ -643,7 +673,7 @@ Scores are written by `convert.py --tokenizer` (via

---

## 9. Backends (GPU: Metal, CUDA, Vulkan, ROCm)
## 10. Backends (GPU: Metal, CUDA, Vulkan, ROCm)

We follow the same pattern as [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp): **the build command determines the backend**. One backend per build; no platform-specific divergence in code.

Expand All @@ -658,7 +688,7 @@ The main performance bottleneck is the DiT `forward()` call, which rebuilds a `g

---

## 10. Known limitations and open tasks
## 11. Known limitations and open tasks

These are the main areas where the implementation is deliberately simplified
and where contributions are most welcome.
Expand All @@ -677,10 +707,11 @@ and where contributions are most welcome.
| 10 | **Threading** | `--threads` is parsed but not passed to `ggml_graph_compute_with_ctx` | Wire the thread count through to `ggml_graph_compute_with_ctx(ctx, gf, n_threads)` |
| 11 | **Output formats** | Only binary PPM (P6) output | Add JPEG/PNG output via stb_image_write or a similar library |
| 12 | **Windows `_mkdir`** | Only one level of directory is created on Windows | Implement recursive mkdir for Windows |
| 13 | **Audio VAE decoder** | With `--av`, audio is synthesized from the denoised latent via a fallback (fake mel + overlap-add); no full audio VAE decode yet | Load `ltx-2.3-22b-dev_audio_vae.safetensors` and implement 2D conv decoder (see docs/AV_PIPELINE.md) |

---

## 11. Coding conventions
## 12. Coding conventions

- **Language**: C++17 throughout; no exceptions (use return codes).
- **Headers only**: all modules live in `src/*.hpp`. Only the two `main()`
Expand All @@ -704,7 +735,7 @@ and where contributions are most welcome.

---

## 12. Testing
## 13. Testing

There is no formal test suite yet. Validation is currently done by:

Expand All @@ -726,16 +757,16 @@ There is no formal test suite yet. Validation is currently done by:

---

## 13. Contributing
## 14. Contributing

1. **Fork** the repository and create a branch from `main`.
2. **Read §10** to find where help is most needed.
2. **Read §11** to find where help is most needed.
3. **Keep PRs focused** — one feature or fix per PR.
4. **Match the style** described in §11.
4. **Match the style** described in §12.
5. **Document** any new CLI flag in both `print_usage()` (in
`ltx-generate.cpp`) and `README.md`.
6. **Update this file** (`DEV.md`) if you add a new module, change the GGUF
schema, or significantly alter the data flow.
schema, or significantly alter the data flow (e.g. AV pipeline in §5).
7. **No model weights** should ever be committed to the repo.

For questions, open a GitHub Discussion or issue in the
Expand Down
Loading
Loading