ltx.cpp — Developer Guide

This document collects everything a new contributor needs to understand the codebase, set up their development environment, extend the implementation, and navigate the known limitations.

Project overview
Repository layout
Getting started
End-to-end data flow
Audio-video (AV) pipeline
Source file reference
GGUF model format conventions
Image-to-video (I2V) design
Key algorithms and design decisions
Adding a new backend (GPU/Metal/Vulkan)
Known limitations and open tasks
Coding conventions
Testing
Contributing

1. Project overview

ltx.cpp is a self-contained C++17 inference engine for LTX-Video (Lightricks), built on top of GGML.

Goals:

No Python at runtime — all inference is done from a single compiled binary.
Cross-platform — CPU (any OS), CUDA, ROCm/HIP, Metal (macOS), Vulkan.
Memory-efficient — weights stored and computed in quantised GGUF format (Q4_K_M through BF16).
Three generation modes: text-to-video (T2V), image-to-video (I2V), and keyframe interpolation.

The project is intentionally not a 1:1 port of the original diffusers/PyTorch code; instead it provides a minimal, readable C++ implementation that is easy to extend.

2. Repository layout

ltx.cpp/
├── CMakeLists.txt        Build system (C++17 + GGML)
├── README.md             End-user documentation
├── DEV.md                ← this file
│
├── src/
│   ├── ltx_common.hpp    Shared utilities: GGUF loading, logging, VideoBuffer,
│   │                     image loading (stb_image), bilinear resize
│   ├── scheduler.hpp     Rectified-Flow Euler scheduler + CFG
│   ├── t5_encoder.hpp    T5-XXL text encoder (GGML graph)
│   ├── video_vae.hpp     CausalVideoVAE decoder + VaeEncoder (I2V)
│   ├── ltx_dit.hpp       LTX-Video DiT forward pass (GGML graph)
│   ├── ltx-generate.cpp  Main binary: argument parsing + inference orchestration
│   ├── ltx-quantize.cpp  Re-quantize GGUF files (BF16 → Q4_K_M / Q8_0 / …)
│   └── stb_image.h       Vendored stb_image v2.28 (public domain)
│
├── convert.py            Python: safetensors → GGUF conversion
├── checkpoints.sh        Download raw HF safetensors checkpoints
├── models.sh             Download pre-quantised GGUF models from Unsloth/HF
├── quantize.sh           Shell wrapper: run ltx-quantize on all BF16 GGUFs
├── docs/
│   ├── AV_PIPELINE.md    Audio-video pipeline design (token concat, shapes, CLI)
│   └── LTX_COMFY_REFERENCE.md  ComfyUI workflow reference
│
└── ggml/                 Git submodule — GGML tensor library

Key design rule: every module is a single header-only file (*.hpp). There is no separate src/ library — headers are included directly by ltx-generate.cpp. This keeps the build trivial and avoids link-time complexity.

3. Getting started

Prerequisites

Tool	Purpose	Minimum version
`cmake`	Build system	3.16
C++ compiler	Build	C++17 (GCC 9+, Clang 10+, MSVC 19.29+)
`git`	Submodule checkout	any
`python3` + `pip`	Model conversion (optional at inference time)	3.9+
`ffmpeg`	PPM → MP4 conversion (optional)	any
CUDA toolkit	GPU inference via CUDA (optional)	11.8+

Clone & initialise submodules

git clone https://github.com/audiohacking/ltx.cpp
cd ltx.cpp
git submodule update --init    # pulls the ggml submodule (~10 MB)

Build configurations

All options are passed as -D flags to CMake:

mkdir build && cd build

# ── CPU only (default) ───────────────────────────────────────────────────────
cmake ..

# ── NVIDIA GPU (CUDA) ────────────────────────────────────────────────────────
cmake .. -DLTX_CUDA=ON

# ── AMD GPU (ROCm/HIP) ───────────────────────────────────────────────────────
cmake .. -DLTX_HIP=ON

# ── Apple Silicon / macOS (Metal) ────────────────────────────────────────────
cmake .. -DLTX_METAL=ON

# ── Vulkan ───────────────────────────────────────────────────────────────────
cmake .. -DLTX_VULKAN=ON

# ── Build ────────────────────────────────────────────────────────────────────
cmake --build . --config Release -j$(nproc)

The CMake options (LTX_CUDA, LTX_HIP, LTX_METAL, LTX_VULKAN) forward to the corresponding GGML_* options in the ggml submodule — no extra wiring is needed.

Output binaries appear in build/:

ltx-generate — inference
ltx-quantize — quantization utility

Obtaining model files

Option A – pre-quantised GGUF (recommended for first run)

./models.sh             # downloads Q8_0 (~7 GB) into ./models/
./models.sh --quant Q4_K_M   # smaller, faster

Option B – convert from safetensors

pip install gguf safetensors transformers
./checkpoints.sh        # downloads raw HF checkpoints

python3 convert.py --model dit \
    --input  checkpoints/ltxv-2b-0.9.6-dev.safetensors \
    --output models/ltxv-2b-BF16.gguf

python3 convert.py --model vae \
    --input  checkpoints/ltxv-vae.safetensors \
    --output models/ltxv-vae-BF16.gguf

python3 convert.py --model t5 \
    --input  checkpoints/t5-xxl/ \
    --output models/t5-xxl-BF16.gguf

./quantize.sh Q8_0      # re-quantise BF16 → Q8_0

4. End-to-end data flow

Text-to-video (T2V)

CLI args
  │
  ├─ --prompt  →  T5Encoder::encode_text()
  │                  tokenise → GGML graph → float[seq_len × 4096]
  │
  ├─ --dit / --vae / --t5  →  LtxGgufModel::open()
  │                              gguf_init_from_file() loads tensors into ggml_context
  │
  │  latent dims:  T_lat = (frames-1)/4 + 1
  │               H_lat = height / 8
  │               W_lat = width  / 8
  │
  ├─ LtxRng::fill()  →  random noise latent  [T_lat × H_lat × W_lat × 128]
  │
  └─ denoising loop  (steps times):
       │
       ├─ patchify()        [T,H,W,C] → [N_tok, patch_dim]    (patch_size=1×2×2)
       │
       ├─ LtxDiT::forward() [N_tok, Pd] + text_emb + timestep → velocity [N_tok, Pd]
       │    └─ GGML graph: patchify proj → N×(self-attn + cross-attn + SwiGLU FFN) → proj_out
       │
       ├─ (if CFG) second forward() with uncond_emb → apply_cfg()
       │
       ├─ unpatchify()      velocity [N_tok, Pd] → [T,H,W,C]
       │
       ├─ RFScheduler::euler_step()   x_t += dt * v
       │
       └─ (if I2V) frame conditioning blend  (see §7)
            │
            └─ after final step: hard-pin reference frame latents
  │
  └─ VaeDecoder::decode()  [T_lat, H_lat, W_lat, 128] → [T_vid, H_vid, W_vid, 3]
       │
       └─ write_video_frames()  →  output/frame_NNNN.ppm

Image-to-video (I2V) additions

--start-frame / --end-frame  (PNG/JPG/BMP/TGA/PPM)
  │
  ├─ load_image()  →  VideoBuffer (stb_image, 8-bit RGB)
  │
  └─ VaeEncoder::encode_frame()
       ├─ resize_bilinear()  pixel [H×W×3] → latent spatial [H_lat×W_lat×3]
       ├─ normalise to [-1,1]
       └─ project 3-ch → 128-ch latent
            ├─ if conv_in_w present in GGUF: learned 1×1 conv projection
            └─ else: pseudo-encoding (channel tiling × 3.0 scale)

  → start_lat / end_lat  [H_lat × W_lat × 128]

  These latents are blended into the live denoising latent after each Euler step
  (see §8 for the full schedule).

5. Audio-video (AV) pipeline

Branch: audio-video. The LTX 2.3 GGUF DiT is a full audio-video model: it expects a single sequence of concatenated video + audio tokens and outputs a combined velocity that is split back into video and audio.

Data flow when --av:

Latent init: Video latent [T_lat, H_lat, W_lat, C] (as today) plus audio latent [T_lat, 8, 16] (C_audio=8, mel_bins=16), both filled with noise.
Per step:
- patchify() → video tokens [n_video_tok, 128]; patchify_audio() → audio tokens [T_lat, 128].
- Concat → [n_video_tok + T_lat, 128] = [n_tok_total, Pd].
- LtxDiT::forward(combined, n_tok_total, …) → combined velocity.
- Split: first n_video_tok tokens → video velocity; remainder → audio velocity.
- Unpatchify both; Euler step on video latent and on audio latent.
- (Optional) frame conditioning on video only (unchanged).
Decode: Video VAE decode → PPM frames (unchanged). Audio: denoised audio latent → waveform via a latent-to-waveform path (fake mel + overlap-add sinusoids) → 16-bit WAV (16 kHz). A full audio VAE decoder (safetensors) can be integrated later for higher-quality audio.

Code: patchify_audio / unpatchify_audio in ltx_dit.hpp; combined patch buffer, split, and dual Euler step in ltx-generate.cpp; write_wav() and latent_to_waveform() in ltx-generate.cpp. Design details: docs/AV_PIPELINE.md.

6. Source file reference

`ltx_common.hpp`

Shared include pulled by every other module.

Symbol	Description
`LTX_LOG / LTX_ERR / LTX_ABORT`	`fprintf(stderr,…)` logging macros
`LtxGgufModel`	Wrapper around `gguf_context` + `ggml_context`. Opened with `open(path)`, tensor lookup with `get_tensor(name)`, metadata with `kv_str/kv_i64/kv_u32/kv_f32`
`f32_data(t)`	Cast `ggml_tensor::data` to `float*`
`LtxRng`	`std::mt19937` + `std::normal_distribution<float>` seeded from `--seed`
`sigmoid / gelu`	Inline CPU helpers for small activations
`VideoBuffer`	`uint8_t` frame store `[F×H×W×3]`; `clamp_u8(float)` maps `[-1,1]→[0,255]`
`write_ppm / write_video_frames`	Binary P6 PPM output
`load_image(path)`	stb_image-backed loader; returns `VideoBuffer(0,0,0)` on failure
`resize_bilinear(src, …)`	In-place bilinear resize of uint8 RGB data

stb_image integration: STB_IMAGE_IMPLEMENTATION is defined once inside ltx_common.hpp. Only the decoders actually used are compiled in (STBI_ONLY_PNG, STBI_ONLY_JPEG, STBI_ONLY_BMP, STBI_ONLY_TGA, STBI_ONLY_PNM). Because ltx_common.hpp is included by exactly one translation unit (ltx-generate.cpp), there is no ODR violation.

`scheduler.hpp`

Implements the Rectified Flow Euler sampler.

RFScheduler(steps, shift, cfg)
  .timesteps()      → vector<float> of length steps+1, from 1.0 → 0.0
  ::euler_step()    → x_t += (t_next - t_cur) * v
  ::apply_cfg()     → v_out = v_uncond + scale * (v_cond - v_uncond)

The flow-shift rescales the linear schedule so that more steps are spent near t=0 (fine detail), which is important for the distilled LTX-Video model:

alpha = (steps - i) / steps        # linear 1→0
t     = alpha * shift / (1 + (shift-1) * alpha)

With shift=3.0 (default), the schedule is compressed toward t=0.

`t5_encoder.hpp`

Minimal T5-XXL encoder (encoder stack only — no decoder).

Symbol	Description
`T5Config`	`d_model=4096`, `num_heads=64`, `d_ff=10240`, `num_layers=24`, `vocab_size=32128` — read from GGUF KV at runtime
`T5Tokenizer`	Naïve whitespace + SentencePiece `▁`-prefix tokenizer loaded from `tokenizer.ggml.tokens` array in the GGUF; unk fallback is per-character
`T5Encoder::load()`	Reads weights named `encoder.block.{i}.layer.0.SelfAttention.{q,k,v,o}.weight` etc.
`T5Encoder::encode(ids)`	Builds a GGML graph: embedding lookup → N × (RMSNorm + self-attn + RMSNorm + SwiGLU FFN) → final RMSNorm. Returns `float[S × d_model]`
`T5Encoder::encode_text(str)`	Tokenises then calls `encode()`

Known limitation: the tokenizer is naive (whitespace-split + character fallback). Rare or multi-byte tokens may be mishandled. A proper SentencePiece unigram model should replace it for production use.

`video_vae.hpp`

`VaeDecoder`

Weights layout expected in the GGUF (prefix vae.decoder.*):

conv_in.weight / .bias — post-quant conv (latent_channels → mid_channels)
mid_block.resnets.{0,1}.* — two residual blocks
mid_block.attentions.0.* — self-attention (simplified)
up_blocks.{0..3}.resnets.{0,1}.* — four upsample stages
up_blocks.{b}.upsamplers.0.conv.* — spatial upsamplers
conv_norm_out.* / conv_out.* — final group-norm + output conv

decode(latents, T_lat, H_lat, W_lat) runs a simplified per-frame 2-D decode with nearest-neighbour temporal upsampling. Full causal 3-D conv decode is a planned improvement (see §11).

`VaeEncoder`

Added for I2V conditioning. Only vae.encoder.conv_in.weight/bias are currently loaded. When present, a 1×1 learned projection is used; otherwise a pseudo-encoding tiles normalised RGB across the 128 latent channels.

`ltx_dit.hpp`

The main diffusion transformer.

Config (read from GGUF KV):

Key	Default
`ltxv.hidden_size`	2048
`ltxv.num_hidden_layers`	28
`ltxv.num_attention_heads`	32
`ltxv.in_channels`	128

Tensor naming (primary, from Lightricks diffusers export):

model.diffusion_model.patchify_proj.{weight,bias}
model.diffusion_model.adaln_single.emb.timestep_embedder.linear_{1,2}.{weight,bias}
model.diffusion_model.adaln_single.linear.{weight,bias}
model.diffusion_model.caption_projection.{weight,bias}
model.diffusion_model.transformer_blocks.{i}.attn1.to_{q,k,v,out.0}.{weight,bias}
model.diffusion_model.transformer_blocks.{i}.attn2.to_{q,k,v,out.0}.{weight,bias}
model.diffusion_model.transformer_blocks.{i}.ff.net.{0.proj,2}.{weight,bias}
model.diffusion_model.proj_out.{weight,bias}
model.diffusion_model.norm_out.linear.{weight,bias}

Fallback names with prefix dit.* are also tried.

Audio (AV pipeline): patchify_audio(lat, T, C, F) and unpatchify_audio(tok, T, C, F) in the same header convert audio latent [T, 8, 16] ↔ [T, 128] tokens for concatenation with video tokens before the single DiT forward.

Forward pass (per call to LtxDiT::forward()):

Sinusoidal timestep embedding → MLP → hidden_size vector
AdaLN-single linear → 6 × hidden_size (scale/shift params; currently stored but not yet fully applied per-block — see §11)
Patchify projection: [N_tok, patch_dim] → [N_tok, hidden_size]
Caption projection: [S, 4096] → [S, hidden_size]
N × transformer blocks:
- Pre-norm (RMSNorm) + self-attention (multi-head, scaled dot-product)
- Pre-norm + cross-attention (latent queries, text keys/values)
- Pre-norm + SwiGLU FFN (gate×up → down)
Final RMSNorm + output projection → [N_tok, patch_dim]
GGML graph execution (ggml_graph_compute_with_ctx)

Note on scratch memory: each forward call allocates 1 GB of scratch via ggml_init. This is safe for a single call but not ideal for batching. A planned improvement is to pre-allocate a persistent scratch context.

`ltx-generate.cpp`

Orchestrates the full inference pipeline.

Args struct — all CLI parameters with defaults:

Field	Flag	Default
`dit_path`	`--dit`	required
`vae_path`	`--vae`	required
`t5_path`	`--t5`	required
`prompt`	`--prompt` / `-p`	`"A beautiful scenic landscape…"`
`negative_prompt`	`--neg` / `-n`	`""`
`frames`	`--frames`	25
`height`	`--height`	480
`width`	`--width`	704
`steps`	`--steps`	40
`cfg_scale`	`--cfg`	3.0
`shift`	`--shift`	3.0
`seed`	`--seed`	42
`out_prefix`	`--out`	`"output/frame"`
`start_frame_path`	`--start-frame`	`""` (disabled)
`end_frame_path`	`--end-frame`	`""` (disabled)
`frame_strength`	`--frame-strength`	1.0
`av`	`--av`	false (enable audio+video path)
`audio_vae_path`	`--audio-vae`	`""` (optional; for full decoder when implemented)
`out_wav`	`--out-wav`	`""` (default: `<out prefix>.wav` when `--av`)
`threads`	`--threads`	4
`verbose`	`-v`	false

Output: frames are written as {out_prefix}_{NNNN}.ppm. When --av, a WAV file is also written (default {out_prefix}.wav). The output directory is created automatically (including intermediate directories).

`ltx-quantize.cpp`

Standalone quantizer that reads a BF16/F32 GGUF and writes a new GGUF with all 2-D+ weight tensors quantised to the requested type.

Rules:

1-D tensors (biases, norms) → kept as F32
Embedding weights → kept as F32
Everything else → quantised to target_type

All GGUF KV metadata is copied verbatim. String arrays (e.g. the tokenizer vocabulary) are not currently copied — this is a known limitation (see §11).

Supported quant types: Q4_K_M, Q5_K_M, Q6_K, Q8_0, BF16, F32, F16.

`convert.py`

Python script that reads HuggingFace safetensors checkpoints and writes GGUF files that ltx.cpp can load.

Converter	`--model`	Input	Output arch
`convert_dit()`	`dit`	single `.safetensors`	`ltxv`
`convert_vae()`	`vae`	single `.safetensors`	`ltxv-vae`
`convert_t5()`	`t5`	directory of shards	`t5`

The DiT converter passes tensor names through unchanged from the safetensors file. The VAE converter prefixes all names with "vae." if not already present. The T5 converter remaps encoder.embed_tokens.weight → token_emb.weight and skips decoder tensors.

For T5, the HF tokenizer vocabulary can be embedded into the GGUF via --tokenizer <path>, which runs transformers.T5Tokenizer and writes tokenizer.ggml.tokens as a string array.

7. GGUF model format conventions

DiT GGUF

Architecture string: "ltxv"

KV key	Type	Description
`general.architecture`	string	`"ltxv"`
`ltxv.hidden_size`	uint32	transformer hidden dim
`ltxv.num_hidden_layers`	uint32	number of transformer blocks
`ltxv.num_attention_heads`	uint32	attention heads
`ltxv.in_channels`	uint32	VAE latent channels (128)
`ltxv.cross_attention_dim`	uint32	text encoder dim (4096)
`ltxv.patch_size`	uint32	spatial patch size (2)

VAE GGUF

Architecture string: "ltxv-vae"

KV key	Type	Description
`general.architecture`	string	`"ltxv-vae"`
`vae.latent_channels`	uint32	128
`vae.spatial_scale`	uint32	8 (8× spatial downsampling)
`vae.temporal_scale`	uint32	4 (4× temporal downsampling)

T5 GGUF

Architecture string: "t5"

KV key	Type	Description
`general.architecture`	string	`"t5"`
`t5.block_count`	uint32	encoder layers (24 for XXL)
`t5.embedding_length`	uint32	d_model (4096)
`t5.feed_forward_length`	uint32	d_ff (10240)
`t5.attention.head_count`	uint32	num_heads (64)
`t5.vocab_size`	uint32	32128
`tokenizer.ggml.tokens`	string[]	SentencePiece vocabulary

8. Image-to-video (I2V) design

The I2V implementation does not modify the DiT architecture. Instead it works by conditioning the latent directly at the boundary frames before and after each denoising step.

VaeEncoder

VaeEncoder::encode_frame(img_u8, H_pix, W_pix, H_lat, W_lat):

Bilinear resize the image to [H_lat, W_lat, 3] using resize_bilinear() (in ltx_common.hpp).
Normalise pixels uint8 [0,255] → float [-1,1]: norm = pixel / 127.5 - 1.0
Project 3 channels → C=128 latent channels:
- With encoder weights (vae.encoder.conv_in.weight in the GGUF):
  Apply the learned 1×1 convolution as a [C, 3] matrix multiply.
- Without encoder weights (pseudo-encoding):
  Assign each latent channel to one of the three colour channels (R/G/B, C/3 channels each), scaled by 3.0 to match typical latent statistics.

Frame-conditioning schedule

After every Euler denoising step the first and/or last temporal latent frames are blended toward the encoded reference:

blend = clamp(frame_strength * (1 - t_next), 0, 1)
lat[T=0]    = lat[T=0]    * (1 - blend) + start_lat * blend
lat[T=T-1]  = lat[T=T-1]  * (1 - blend) + end_lat   * blend

At the start of denoising (t=1), blend=0 — the reference is not imposed yet so the DiT can form global structure freely.
As denoising progresses toward t=0, blend increases linearly to frame_strength, pulling the frame latents toward the reference.

Hard-pinning at t=0

When frame_strength >= 1.0 (default), after all denoising steps finish the reference latent is copied verbatim into the output latent buffer:

memcpy(latents.data(), start_lat.data(), frame_lat_size * sizeof(float));

This guarantees the decoded output frame exactly matches the reference image appearance, regardless of any residual denoising drift.

9. Key algorithms and design decisions

Rectified Flow (RF) scheduling

LTX-Video was trained with Rectified Flow. The forward process is:

x_t = (1 - t) * x_0 + t * noise    t ∈ [0, 1]

The model predicts the velocity v = dx/dt = noise - x_0. The Euler ODE solver steps backward from t=1 to t=0:

x_{t-dt} = x_t + dt * v_θ(x_t, t)     (dt < 0)

Classifier-free guidance

With --cfg > 1.0, the DiT is called twice per step:

Once with the text embedding (v_cond)
Once with the empty-string embedding (v_uncond)

The guided velocity is:

v = v_uncond + cfg_scale * (v_cond - v_uncond)

The unconditional embedding is computed by encoding the --neg prompt (default: empty string).

Patchify / unpatchify

The DiT operates on tokens, not on the raw latent volume. The video latent [T_lat, H_lat, W_lat, C] is chunked into non-overlapping patches of size (pt=1, ph=2, pw=2) along the temporal, height, and width dimensions:

patch_dim = pt * ph * pw * C = 1 * 2 * 2 * 128 = 512  (or 128 for C=32)
N_tok     = (T_lat/pt) * (H_lat/ph) * (W_lat/pw)

patchify() and unpatchify() are helper functions in ltx_dit.hpp called from ltx-generate.cpp. For the audio-video path, patchify_audio() and unpatchify_audio() convert audio latent [T, 8, 16] to/from [T, 128] tokens; video and audio token sequences are concatenated before the DiT forward and split after. All are pure memory rearrangements with no arithmetic.

Latent dimension formulas

Video dimension	Latent dimension	Formula
`frames`	`T_lat`	`(frames − 1) / 4 + 1`
`height`	`H_lat`	`height / 8`
`width`	`W_lat`	`width / 8`
`T_vid` (decoded)	—	`(T_lat − 1) * 4 + 1`

The temporal scale is 4× and the spatial scale is 8×. These values are read from the VAE GGUF (vae.temporal_scale, vae.spatial_scale).

Tokenizer

The T5 tokenizer implements the SentencePiece unigram algorithm in pure C++ with no external library dependency. The vocabulary and optional log-probability scores are loaded from the GGUF metadata at model-load time:

GGUF key	Type	Description
`tokenizer.ggml.tokens`	string[]	id → piece (UTF-8, ▁-prefixed)
`tokenizer.ggml.scores`	float32[]	id → unigram log-probability (optional)

Preprocessing (T5Tokenizer::preprocess):

Collapse runs of whitespace to a single space; strip leading/trailing.
Prepend ▁ (U+2581) to the beginning; replace each remaining space with ▁.

Segmentation — two modes depending on whether scores are in the GGUF:

Mode	Condition	Algorithm
Viterbi	`tokenizer.ggml.scores` present	DP over byte positions; maximises sum of log-probs; O(n × max_piece_len)
Greedy	scores absent	Longest-match scan from left; O(n × max_piece_len)

In both modes an unk fallback advances one full UTF-8 character (not one byte) when no vocabulary piece covers the current position, preventing split multi-byte sequences from producing garbage tokens.

Scores are written by convert.py --tokenizer (via tok.sp_model.GetScore(i)) and preserved through quantization by ltx-quantize (via gguf_set_kv).

10. Backends (GPU: Metal, CUDA, Vulkan, ROCm)

We follow the same pattern as acestep.cpp: the build command determines the backend. One backend per build; no platform-specific divergence in code.

macOS: cmake .. defaults to Metal (Apple MPS). Use -DLTX_METAL=OFF for CPU-only.
Linux (NVIDIA): cmake .. -DLTX_CUDA=ON
Linux (AMD): cmake .. -DLTX_HIP=ON
Linux / Windows: cmake .. -DLTX_VULKAN=ON for Vulkan.

Currently inference uses the CPU path (ggml_graph_compute_with_ctx). To run on GPU when a backend is enabled, the next step is to use the GGML backend API in a backend-agnostic way: e.g. ggml_backend_sched_new with the chosen backend(s), ggml_gallocr_new(backend_buffer_type), ggml_gallocr_alloc_graph, then ggml_backend_graph_compute (or ggml_backend_sched_graph_compute). That keeps a single code path for all platforms (Metal, CUDA, Vulkan, HIP).

The main performance bottleneck is the DiT forward() call, which rebuilds a ggml_cgraph on every step. A future improvement is to build the graph once and re-use it across steps by parameterising the timestep embedding.

11. Known limitations and open tasks

These are the main areas where the implementation is deliberately simplified and where contributions are most welcome.

#	Area	Current state	What needs doing
1	VAE decoder	Per-frame 2-D decode + nearest-neighbour temporal upsampling	Implement full causal 3-D conv decode using `ggml_conv_1d / 2d`; wire temporal upsampling via transposed conv
2	VAE encoder	Only the first `conv_in` layer is used; pseudo-encoding fallback	Implement full encoder stack for accurate I2V latent inversion
3	AdaLN-single	Timestep embedding is computed but per-block scale/shift is not fully applied	Apply `ada_params` chunks as scale/shift in each block's norms
4	3-D RoPE	Positional embeddings are not yet applied	Add rotary embeddings along (t, h, w) axes to Q and K tensors
5	T5 tokenizer	~~Whitespace-split + per-char fallback~~ Fixed: full SentencePiece unigram Viterbi DP (when scores in GGUF) or greedy longest-match	—
6	`ltx-quantize` metadata	~~String arrays (tokenizer vocab) are skipped during quantization~~ Fixed: `gguf_set_kv` copies all KV pairs including arrays	—
7	Persistent scratch	~~DiT allocates 1 GB of ggml scratch per forward call~~ Fixed: single buffer (90% RAM) reused each forward (gpt-2 style)	—
8	Batch size > 1	Only batch=1 is implemented	Add batch dimension to enable parallel generation
9	CFG single-pass	CFG requires two full forward passes	Implement single-pass CFG by duplicating the batch
10	Threading	`--threads` is parsed but not passed to `ggml_graph_compute_with_ctx`	Wire the thread count through to `ggml_graph_compute_with_ctx(ctx, gf, n_threads)`
11	Output formats	Only binary PPM (P6) output	Add JPEG/PNG output via stb_image_write or a similar library
12	Windows `_mkdir`	Only one level of directory is created on Windows	Implement recursive mkdir for Windows
13	Audio VAE decoder	With `--av`, audio is synthesized from the denoised latent via a fallback (fake mel + overlap-add); no full audio VAE decode yet	Load `ltx-2.3-22b-dev_audio_vae.safetensors` and implement 2D conv decoder (see docs/AV_PIPELINE.md)

12. Coding conventions

Language: C++17 throughout; no exceptions (use return codes).
Headers only: all modules live in src/*.hpp. Only the two main() translation units are .cpp files.
No STL containers with new/delete: use std::vector<float> for all large buffers; GGML tensors are owned by the ggml_context they were created in.
Logging: use LTX_LOG(fmt, …) for info and LTX_ERR(fmt, …) for errors. Both write to stderr. Progress during the denoising loop uses \r overwrite for a clean single-line display.
Error handling: functions return bool or an empty/zero-frames VideoBuffer on failure. LTX_ABORT for truly unrecoverable conditions.
Naming: snake_case for variables and functions; PascalCase for structs; UPPER_CASE for macros.
Comments: section headers use the // ── … ─── style; inline comments explain why, not what.
Third-party code: vendored in src/ (currently only stb_image.h). Keep separate from project code; suppress vendor warnings at the CMake level, not inside the header.
No #pragma once in .cpp files: only in *.hpp.

13. Testing

There is no formal test suite yet. Validation is currently done by:

Build smoke test — cmake --build . -j$(nproc) must produce zero errors and zero warnings (except those from vendored third-party headers).
Argument parsing — run ./build/ltx-generate --help and verify the usage text is correct.
Image loading — write a short C++ snippet that calls load_image() with a PNG, a JPEG, a PPM, and a missing file, and assert the results.
End-to-end generation — run ltx-generate with real model files and check that the output PPM frames are non-zero and have the expected dimensions.

Planned: a tests/ directory with:

Unit tests for RFScheduler::timesteps() (known values)
Unit tests for patchify / unpatchify round-trip
Unit tests for resize_bilinear
An integration test that runs ltx-generate with tiny synthetic GGUF stubs

14. Contributing

Fork the repository and create a branch from main.
Read §11 to find where help is most needed.
Keep PRs focused — one feature or fix per PR.
Match the style described in §12.
Document any new CLI flag in both print_usage() (in ltx-generate.cpp) and README.md.
Update this file (DEV.md) if you add a new module, change the GGUF schema, or significantly alter the data flow (e.g. AV pipeline in §5).
No model weights should ever be committed to the repo.

For questions, open a GitHub Discussion or issue in the audiohacking/ltx.cpp repository.

FilesExpand file tree

DEV.md

Latest commit

History

DEV.md

File metadata and controls

ltx.cpp — Developer Guide

Table of Contents

1. Project overview

2. Repository layout

3. Getting started

Prerequisites

Clone & initialise submodules

Build configurations

Obtaining model files

4. End-to-end data flow

Text-to-video (T2V)

Image-to-video (I2V) additions

5. Audio-video (AV) pipeline

6. Source file reference

ltx_common.hpp

scheduler.hpp

t5_encoder.hpp

video_vae.hpp

VaeDecoder

VaeEncoder

ltx_dit.hpp

ltx-generate.cpp

ltx-quantize.cpp

convert.py

7. GGUF model format conventions

DiT GGUF

VAE GGUF

T5 GGUF

8. Image-to-video (I2V) design

VaeEncoder

Frame-conditioning schedule

Hard-pinning at t=0

9. Key algorithms and design decisions

Rectified Flow (RF) scheduling

Classifier-free guidance

Patchify / unpatchify

Latent dimension formulas

Tokenizer

10. Backends (GPU: Metal, CUDA, Vulkan, ROCm)

11. Known limitations and open tasks

12. Coding conventions

13. Testing

14. Contributing

`ltx_common.hpp`

`scheduler.hpp`

`t5_encoder.hpp`

`video_vae.hpp`

`VaeDecoder`

`VaeEncoder`

`ltx_dit.hpp`

`ltx-generate.cpp`

`ltx-quantize.cpp`

`convert.py`