Skip to content

[Misc] Enable dataclasses.dataclass with more than 31 ndarrays on Metal#392

Merged
hughperkins merged 16 commits intomainfrom
hp/metal-31-arg-workaround
Mar 4, 2026
Merged

[Misc] Enable dataclasses.dataclass with more than 31 ndarrays on Metal#392
hughperkins merged 16 commits intomainfrom
hp/metal-31-arg-workaround

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Use @test_utils.test(arch=qd.metal) instead of @pytest.mark.parametrize,
remove manual qd.init() calls and __main__ block per project test guidelines.

Made-with: Cursor
When ndarrays are accessed via physical GPU addresses (to bypass Metal's
31 kernel parameter limit), the actual data buffers are not bound to the
command encoder. Metal therefore cannot track read/write dependencies
between consecutive kernel dispatches for these buffers.

Fix by calling [encoder useResource:usage:] for each physically-addressed
buffer before dispatch. Also add gpuAddress availability guard for
macOS 13.0+.

Made-with: Cursor
Ensures pending Metal command buffers are flushed before exposing
ndarray GPU memory via DLPack, preventing stale reads.

Made-with: Cursor
Made-with: Cursor
The sync() call in to_dlpack() is only needed on Metal where GPU
commands must be explicitly flushed before DLPack access. Cache the
Arch.metal enum at module level for fast lookup.

Made-with: Cursor

uint64_t MetalDevice::get_memory_physical_pointer(DeviceAllocation handle) {
const MetalMemory &memory = get_memory(handle.alloc_id);
if (__builtin_available(macOS 13.0, iOS 16.0, *)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WTF is this syntax !?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. By the way, my preferred aproach would be to remove this, and stop supporting macos < 13. Thoughts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please go ahead. It is not supported by Genesis either.

When physical storage buffer is enabled (Metal), load_buffer and
store_buffer were skipping the u1→u8/i8 conversion, causing SPIRV-Cross
to emit 2-byte (short) accesses for booleans instead of 1-byte. This
produced wrong strides and corrupted data for boolean arrays.

Reorder the conditionals so u1 handling takes priority over the u64
physical pointer path, matching the behavior of the non-physical path.

Made-with: Cursor
The Metal shader compiler on Apple Silicon miscompiles stores through
physical GPU pointers when the byte offset involves a runtime stride
multiply and the stored value is loop-invariant (only the first row of a
2D+ array gets written).

Change the SPIRV codegen to emit OpConvertUToPtr on the base address once
and OpPtrAccessChain for element indexing, instead of per-element
OpConvertUToPtr from ulong arithmetic.  SPIRV-Cross translates this to
base[index] rather than per-element reinterpret_cast, which the Metal
compiler handles correctly.

The struct-wrapper fallback is preserved for code paths without decomposed
pointer components (e.g. atomics).

Includes regression tests covering const stores, row-dependent stores,
various 2D/3D shapes, and multi-argument kernels.

Made-with: Cursor
Covers atomic_add (i32, f32), atomic_sub, atomic_min, atomic_max,
and a histogram pattern exercising concurrent writes from multiple
threads to the same ndarray through physical pointers.

Made-with: Cursor
Two-part fix for a miscompilation where SPIRV-Cross inlines physical
storage buffer OpLoad results as pointer dereference expressions at each
use site, causing re-reads from modified memory.

1. spirv_ir_builder.cpp: Emit MemoryAccessVolatileMask on all OpLoad
   instructions from physical storage buffer pointers.

2. SPIRV-Cross (spirv_glsl.cpp): Respect MemoryAccessVolatileMask on
   OpLoad by disabling expression forwarding, forcing a local temporary.

This fixes Genesis broadphase insertion sort corruption where the sort
key was re-read after array elements were shifted in place.

Made-with: Cursor
Update the SPIRV-Cross submodule to use the Genesis-Embodied-AI fork,
which includes a fix to respect MemoryAccessVolatileMask on OpLoad
(preventing expression forwarding for volatile physical pointer loads)
and regression tests for it.
@hughperkins hughperkins merged commit eff73c3 into main Mar 4, 2026
97 of 123 checks passed
@hughperkins hughperkins deleted the hp/metal-31-arg-workaround branch March 4, 2026 02:22
@hughperkins hughperkins changed the title [Misc] Enable dataclasses.dataclass with more than 31 ndarrays [Misc] Enable dataclasses.dataclass with more than 31 ndarrays on Metal Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants