SpawnDev.ILGPU extends ILGPU with three browser GPU backends. It transpiles .NET IL into GPU shader languages (WGSL, GLSL, Wasm binary) at runtime.
dotnet build SpawnDev.ILGPU/SpawnDev.ILGPU.csproj # Main library (~2s)
dotnet build SpawnDev.ILGPU.slnx # Full solution
dotnet run --project SpawnDev.ILGPU.DemoConsole # Desktop tests (CUDA, OpenCL, CPU)
dotnet run --project SpawnDev.ILGPU.Demo # Browser tests (Blazor WASM → /tests)Target: net10.0. PublishTrimmed and RunAOTCompilation must remain false — ILGPU relies on IL reflection at runtime.
Detailed constraints live in each directory's own CLAUDE.md. Read the relevant one when working in that area.
| Directory | What | Context File |
|---|---|---|
SpawnDev.ILGPU/WebGPU/ |
WGSL transpiler, dispatch, buffers | WebGPU/CLAUDE.md |
SpawnDev.ILGPU/Wasm/ |
Wasm binary compiler, worker dispatch | Wasm/CLAUDE.md |
SpawnDev.ILGPU/WebGL/ |
GLSL transpiler, Transform Feedback | WebGL/CLAUDE.md |
ILGPU/ |
Forked ILGPU core (IR, types, runtime) | ILGPU/CLAUDE.md |
ILGPU.Algorithms/ |
Forked algorithms (Scan, RadixSort) | ILGPU.Algorithms/CLAUDE.md |
SpawnDev.ILGPU.P2P/ |
Distributed GPU compute via WebRTC | SpawnDev.ILGPU.P2P/CLAUDE.md |
PlaywrightMultiTest/ |
Unified test runner | PlaywrightMultiTest/CLAUDE.md |
.claude/skills/ilgpu_transpiler/ |
Hard-won transpiler mapping rules | SKILL.md |
| Backend | Target | Shader Language | Key Constraint |
|---|---|---|---|
| WebGPU | Browser | WGSL | 4-byte alignment, uniformity analysis |
| WebGL | Browser | GLSL ES 3.0 | No shared memory/atomics/barriers |
| Wasm | Browser | WebAssembly binary | SharedArrayBuffer + multi-worker dispatch |
| CUDA | Desktop | PTX | Via upstream ILGPU |
| OpenCL | Desktop | OpenCL C | Via upstream ILGPU |
| CPU | Desktop | .NET | Via upstream ILGPU |
Tests in SpawnDev.ILGPU.Demo.Shared/UnitTests/BackendTestBase*.cs (~211 tests, Tests1-10). Backend-specific classes inherit and override unsupported tests. See PlaywrightMultiTest/CLAUDE.md for running tests.
Current version: 4.9.2-rc.10 (April 2026, locally published at D:\users\SpawnDevPackages; nuget.org has 4.9.2-rc.9 top). rc.10 headline: LocalMemory(N>=32) WGSL codegen 5-layer fix (unblocks Tuvok's Vp9Idct8x8Kernel + 16x16/32x32/iADST/iHT kernel family on WebGPU bit-exact), AcceleratorRequirements capability-gating API + extension methods, UnsupportedKernelFeatureException typed exception wired at WebGL GenericAtomic + AtomicCAS codegen sites. Regression test LocalMemoryRepro_Int64_ShortByteViews locks down both the WebGPU fix and the WebGL architectural varying-count ceiling. rc.10 docs landed in README "What's New in 4.9.2." Current P2P sibling: 4.9.2-rc.22 (WebTorrent 3.1.4, full 6-gap audit closed, binary wire framing for BufferSend/BufferData via P2PBinaryFrame + ConfigureHighThroughputSctp helper - 10MB WebRTC dispatch passes in 25s). Pending at HEAD (historical context, shipped in rc.7+): f16 emulation Phases 1 + 2 + 3 - Capabilities.Float16 always true across WebGPU / WebGL / Wasm / OpenCL; Capabilities.Float16Native distinguishes native-vs-emulated on backends where both paths exist (WebGPU, OpenCL); WebGPUBackend.ForceEmulatedF16 test flag. Emulation paths: WebGPU WGSL _f16_to_f32 / _f32_to_f16 helpers + packed u16 storage; WebGL GLSL helpers + Transform Feedback uint output; OpenCL vload_half / vstore_half built-ins (no extension required) + f32 arithmetic. Full hardwareConcurrency multi-worker barrier dispatch with wait/notify barriers (memory.atomic.wait32/notify with spurious wakeup defense loop) and in-Wasm phase dispatcher (no JS-Wasm boundary crossings between phases). All large sort tests (260K-4M) passing including SpawnSceneSimulation (1.4M elements, multi-frame). rc.7 key fixes: WGSL spinlock-key refactor (tuple-keyed array<atomic<u32>> for f64 Min/Max/Exchange), Wasm cascade-safe Dispose (per-worker TCS fault on dispose), wait/notify barrier with wakeup loop (replaced pure spin after diagnosing spurious wakeup bug - not a V8 bug), shared memory alloca overlap (same-size dedup), IR address space aliasing (InferAddressSpaces guards), struct/scratch overlap, multi-pass scan, Float16, unsigned ops, 256 threads, memory.grow(), ViewSourceSequencer, subViewByteOffset, atomic RMW opcode table, CopyFromBuffer, onesComplementMask .tt template, per-worker scratch, atomic.fence at 3 sync points, float atomic stores, broadcast atomic store/load, barrier counter zeroing between groups.
Every kernel compilation auto-dumps generated code to a local folder via ShaderDebugService (registered in the demo's Program.cs). Use this — do NOT ask TJ to manually run tests or capture output.
- Run the demo, go to
/tests - Click "Set Debug Folder" → pick a local folder (e.g.,
_debugdump) - Folder persists in IndexedDB across sessions — set once, works forever
debugfolder/
├── _DEBUG_README.md
├── latest.json ← live test results (updated each test)
├── test-run-YYYY-MM-DD_HH-mm-ss.json ← permanent test run history
├── wgsl/ ← WebGPU shaders with metadata headers
│ └── NNN_KernelName.wgsl
├── glsl/ ← WebGL shaders with metadata headers
│ └── NNN_KernelName.glsl
└── wasm/ ← Wasm binaries + compilation info
├── NNN_KernelName.wasm ← disassemble: wasm2wat --enable-threads
└── NNN_KernelName.txt ← params, locals, barriers, shared mem size
- Find a kernel: Grep the
.txtfiles forhasBarriers=True,helpers=1, etc. - Disassemble Wasm:
wasm2wat --enable-threads NNN_kernel.wasm > kernel.wat - Read WGSL/GLSL: Files include metadata headers (kernel name, workgroup size, shared mem, bindings, timestamp)
- Track test results:
latest.jsonupdates after every test. Comparetest-run-*.jsonacross runs. - The files are on disk. Do NOT ask TJ to capture output or run tests manually. Read the dump folder.
UnitTestsView writes results to the same debug folder via the ResultsDirectory parameter. latest.json is overwritten after EVERY test completion — it contains the full test suite state in real-time: pass/fail/skip/pending counts and per-test details (class, method, result, error, duration, stack trace). A timestamped test-run-*.json is written when the full run finishes.
During test runs, read latest.json to see results as they happen. Don't wait for the run to finish. Parse it with node -e to find failures:
node -e "const d=JSON.parse(require('fs').readFileSync('path/to/latest.json','utf8')); console.log('Pass:',d.passed,'Fail:',d.failed,'Skip:',d.skipped,'Pending:',d.pending); d.tests.filter(t=>t.result==='Error').forEach(t=>console.log('FAIL:',t.className+'.'+t.method,'-',(t.error||'?').substring(0,200)));"- Bugs found here are HIGHEST PRIORITY. SpawnDev.ILGPU is the foundation for SpawnDev.ILGPU.ML, SpawnScene, and every project that uses GPU compute. A bug here is a bug in everything. When a consuming project discovers a SpawnDev.ILGPU bug, stop all other work and fix it here first — with unit tests. No workarounds in consumers. No "fix it later." Treat every release as the final release.
- Correctness is non-negotiable. Performance is a close second. Kernels dispatch thousands of times/sec.
- No workarounds that mask problems. Fix root causes.
- Cross-backend impact — changes to
ILGPU/affect all 6 backends. Consider all of them. - No quick fixes — plan before implementing complex changes.
- Do not hardcode evolving hardware limits — preserve full i64 index paths.
These apply everywhere, not just one directory:
- No backend-specific kernel variants. NEVER create backend-specific copies of algorithm kernels (e.g.,
WasmRadixSortKernel1) to work around bugs. The same kernel must work on all 6 backends. Fix bugs in the codegen, dispatch, or memory management — not by duplicating the algorithm. Only acceptable if it is absolutely IMPOSSIBLE to fix any other way. - Blazor WASM is single-threaded — all async, no blocking calls
- T4 Templates in
ILGPU/— check for.ttbefore editing.cs. Generated files are silently overwritten. - Device loss detection — WebGPU:
device.lostpromise. WebGL:webglcontextlostevent. Guards on dispatch/synchronize. Intentional disposal filtered out.
Kernels that use features some backends can't implement (atomics on WebGL, native f64 on WebGPU) will silently produce wrong output if they land on the wrong backend. Declare requirements up-front and the selection path filters out incapable backends:
using SpawnDev.ILGPU;
using var acc = context.CreatePreferredAccelerator(
new AcceleratorRequirements
{
RequiresAtomics = true, // rules out WebGL
RequiresFloat64Native = true, // rules out WebGPU + WebGL
});
// -> on desktop: CUDA > OpenCL > CPU
// -> on browser with the above combo: only Wasm survives
// -> throws NotSupportedException naming the requirements when nothing matchesOther entry points: context.EnumerateCompatibleDevices(requirements) for ranking your own pick, device.Satisfies(requirements) for per-device checks.
All flags (mirror the 6-backend feature matrix below): RequiresAtomics, RequiresSharedMemory, RequiresBarriers, RequiresFloat16, RequiresFloat16Native, RequiresFloat64, RequiresFloat64Native, RequiresInt64, RequiresInt64Native, RequiresInt64Atomics, RequiresSubGroups. AcceleratorRequirements.None = no filter (accepts every backend).
Use this INSTEAD of hand-rolling if (backend == WebGL) skip; in consuming projects. The logic belongs in one place; consumers declare intent, not backend knowledge.
A follow-up pass will add UnsupportedKernelFeatureException thrown at kernel compile time (not just selection time) for the "consumer pinned to WebGL anyway" case. Not shipped yet.
maxStorageBuffersPerShaderStage = 10 (Chrome). WebGPU spec minimum is 8. Every ArrayView kernel parameter uses one storage buffer binding. Scalar parameters (int, float, etc.) are packed into a single _scalar_params buffer.
Total bindings = (number of ArrayView params) + 1 (_scalar_params) + (any struct params)
If total > 10: InvalidOperationException at dispatch time (v4.9.1+). Before v4.9.1, this silently produced "Invalid BindGroupLayout due to a previous error."
How to stay under the limit:
- Combine related ArrayViews using struct packing (e.g.,
ArrayView<MyStruct>with multiple fields instead of separate arrays) - Maximum safe ArrayView count: 9 (leaves room for _scalar_params)
- Check
accelerator.MaxStorageBufferBindingsat runtime
ArrayView<byte>, ArrayView<sbyte>, ArrayView<short>, ArrayView<ushort>, ArrayView<Half> (ILGPU.Half) supported on all 6 backends.
Use ILGPU.Half, NOT System.Half in kernel signatures. Implicit conversion operators exist for interop.
Per-backend implementation:
- WebGPU: Packed into
array<atomic<u32>>. Load via atomicLoad + shift + mask. Store via atomicAnd + atomicOr (thread-safe sub-word writes). Float16 load/store calls_f16_to_f32/_f32_to_f16helpers fromWGSLEmulationLibrary.F16Functionswhen!shader-f16; native WGSLf16type otherwise.WebGPUBackend.ForceEmulatedF16test flag forces the emulation path for verification. - Wasm: Native
i32.load8_s/u,i32.load16_s/u,i32.store8,i32.store16. Float16 emulated via inline IEEE 754 bit conversion at load/store. - WebGL: texelFetch from R32I with shift+mask in GLSL. Float16 load/store calls
_f16_to_f32/_f32_to_f16helpers fromGLSLEmulationLibrary.F16Functions; capability reports true (always emulated on WebGL). - OpenCL:
Capabilities.Float16always true. Whencl_khr_fp16present: nativehalftype. When absent: Float16 promoted tofloatfor arithmetic,vload_half/vstore_halfbuilt-ins (available without the extension) handle buffer load/store.Capabilities.Float16Nativeselects the path. - CUDA/CPU: Native support.
Gotchas:
- WGSL requires explicit parenthesization for mixed-precedence shift/mask expressions
- WebGPU sub-word stores use atomic RMW (data race if non-atomic when threads write different halves of same u32)
arrayLength()on sub-word buffers returns u32 count, multiply by elements-per-word for actual element count
Zero-copy JS TypedArray/ArrayBuffer to GPU buffer transfer. Available on all 3 browser backends.
var jsArray = new Int16Array(data);
((IBrowserMemoryBuffer)buffer).CopyFromJS(jsArray);
// or
((IBrowserMemoryBuffer)buffer).CopyFromJS(arrayBuffer);Backend notes:
- WebGPU: Uses
queue.WriteBufferdirectly - WebGL: Copies to backing array, sets
NeedsUpload = true(data uploaded on next dispatch, NOT immediately on GPU) - Wasm: Pure JS-to-JS copy within SharedArrayBuffer
CopyFromHost(sourceArray): source.Length must be <= buffer.Length - targetOffset. Throws if too large. Partial fills allowed.- Buffer sizes are padded to 4-byte alignment at creation (WebGPU requirement)
- Use
EnsureBufferpattern for grow-only reallocation (avoid Dispose+Allocate churn)
Captured scalar values (int, float, etc.) are automatically passed to GPU. ArrayViews CANNOT be captured - they must be explicit kernel parameters.
int multiplier = 5;
var kernel = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<float>>(
(index, buf) => { buf[index] = index * multiplier; });| Feature | WebGPU | WebGL | Wasm | CUDA | OpenCL | CPU |
|---|---|---|---|---|---|---|
| Shared Memory | Yes | No | Yes | Yes | Yes | Yes |
| Barriers | Yes | No | Yes | Yes | Yes | Yes |
| Atomics | Yes | No | Yes | Yes | Yes | Yes |
| Sub-word types | Yes | Yes | Yes | Yes | Yes | Yes |
| CopyFromJS | Yes | Yes | Yes | N/A | N/A | N/A |
| ILGPU Algorithms | Yes | No | Yes | Yes | Yes | Yes |
| Subgroups | Yes* | No | No | Yes | Yes* | N/A |
| f64 native | No (emulated) | No (emulated) | Yes | Yes | Yes | Yes |
| i64 native | No (emulated) | No (emulated) | Yes | Yes | Yes | Yes |
| f16 native | Native or emulated** | No (emulated)*** | No (emulated) | Yes | Native or emulated**** | Yes |
*Subgroups: WebGPU requires browser support + adapter feature. OpenCL: device-dependent.
**WebGPU f16: native WGSL f16 when the adapter exposes shader-f16, otherwise emulated in WGSL via _f16_to_f32 / _f32_to_f16 helpers with f32 arithmetic + packed u16 storage. Capabilities.Float16 always true; Capabilities.Float16Native distinguishes. Emulation is lossless.
***WebGL f16: emulated via _f16_to_f32 / _f32_to_f16 GLSL helpers. Load through texelFetch on R32I + bit-extract, store through Transform Feedback uint. Algorithm-family Half tests (RadixSort/Scan/Reduce) continue to skip (WebGL has no shared memory/barriers); the 5 non-algorithm Half tests run. Capabilities.Float16Native always false on WebGL.
****OpenCL f16: native half type when cl_khr_fp16 is available; emulated via vload_half / vstore_half (OpenCL built-ins that work without the extension) + f32 arithmetic otherwise. Capabilities.Float16 always true on OpenCL; Capabilities.Float16Native reflects the cl_khr_fp16 extension and selects the codegen path.