-
Notifications
You must be signed in to change notification settings - Fork 436
Added MLP test for baseline and VEGETA engine. #316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: veg
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds new regression tests for a 4-layer MLP, including a baseline implementation and a VEGETA (tile GEMM) implementation with optional sparse modes.
Changes:
- Added
tests/regression/mlphost+kernel for baseline MLP inference with CPU reference verification - Added
tests/regression/mlp_vegetahost+kernel for batched VEGETA MLP inference, including (intended) 2:4 and 1:4 weight compression paths - Added per-test
common.handMakefilescaffolding for both tests
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/regression/mlp/main.cpp | Baseline MLP host-side setup, device execution, CPU reference, and verification |
| tests/regression/mlp/kernel.cpp | Baseline MLP device kernel (FC layers + softmax) |
| tests/regression/mlp/common.h | Baseline MLP shared constants and argument structs |
| tests/regression/mlp/Makefile | Build rules for baseline MLP regression |
| tests/regression/mlp_vegeta/main.cpp | VEGETA MLP host-side setup, (intended) sparsity compression, device execution, CPU reference |
| tests/regression/mlp_vegeta/kernel.cpp | VEGETA tile-GEMM based MLP kernel with TGEMM/UGEMM/VGEMM paths |
| tests/regression/mlp_vegeta/common.h | VEGETA MLP shared constants (tile sizes, modes) and argument structs |
| tests/regression/mlp_vegeta/Makefile | Build rules for VEGETA MLP regression |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Layer configuration structure | ||
| typedef struct { | ||
| uint32_t input_dim; | ||
| uint32_t output_dim; | ||
| uint64_t weights_addr; // Weight matrix: output_dim x input_dim | ||
| uint64_t bias_addr; // Bias vector: output_dim | ||
| } layer_config_t; |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This header uses uint32_t/uint64_t but doesn't include <stdint.h>/<cstdint>. Relying on transitive includes makes builds fragile; include the standard integer header directly from common.h.
tests/regression/mlp_vegeta/main.cpp
Outdated
| // Store metadata (simplified - in real impl would be tile-based) | ||
| int tile_in = i / TILE_SIZE; | ||
| int tile_out = j / TILE_SIZE; | ||
| int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES; | ||
| int row_in_tile = (i % TILE_SIZE) / 4; | ||
| int col_in_tile = j % TILE_SIZE; | ||
| int byte_offset = meta_offset + row_in_tile * TILE_SIZE + col_in_tile; | ||
| if (byte_offset < (int)metadata.size()) { | ||
| metadata[byte_offset] = mask; | ||
| } |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata layout written here (row_in_tile * TILE_SIZE + col_in_tile) does not match the 128-byte 16×16 nibble format used by other VEGETA sparse tests (8 bytes per row, 2 nibbles per byte; see tests/regression/sgemm_tile/main.cpp:124-132). As written, only 64 bytes/tile can be addressed (row_in_tile is 0..3) and no nibble packing is performed, so the device will interpret masks incorrectly.
|
|
||
| for (uint32_t row = 0; row < BATCH_SIZE; ++row) { | ||
| for (uint32_t col = 0; col < TILE_SIZE; ++col) { | ||
| output[row * out_dim + out_t * TILE_SIZE + col] = |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These scalar stores write g_C_tile into output in plain row-major order. However, subsequent layers pass layer_output back into VEGETA GEMM paths where tiles are loaded via vx_lt, which expects activations laid out as contiguous 16×16 tiles. Without a repack step (or storing the result in tile-major layout), later layers will load incorrect activation tiles.
| output[row * out_dim + out_t * TILE_SIZE + col] = | |
| uint32_t tile_index = out_t * TILE_SIZE * TILE_SIZE + row * TILE_SIZE + col; | |
| output[tile_index] = |
tests/regression/mlp/kernel.cpp
Outdated
| // Thread function: compute exp in parallel, thread 0 sums | ||
| void softmax_exp_sum_thread(softmax_args_t* __UNIFORM__ args) { | ||
| uint32_t i = blockIdx.x; | ||
|
|
||
| // All threads compute their exp value | ||
| if (i < args->size) { | ||
| args->data[i] = expf(args->data[i] - *(args->max_val)); | ||
| } | ||
|
|
||
| // Thread 0 does the sum reduction serially | ||
| if (i == 0) { | ||
| TYPE sum_exp = 0.0f; | ||
| for (uint32_t j = 0; j < args->size; ++j) { | ||
| sum_exp += args->data[j]; | ||
| } | ||
| *(args->sum_exp) = sum_exp; | ||
| } |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
softmax_exp_sum_thread computes exp for one element per work-item (indexed by blockIdx.x) and then work-item 0 immediately sums args->data[j] without any cross-work-item synchronization. With vx_spawn_threads(..., grid_dim=&size, block_dim=nullptr), these work-items run in parallel and there is no ordering guarantee, so the sum can read uninitialized/stale values. Split into two launches (one to compute exp in parallel, then a single-thread launch to sum), or launch with a non-NULL block_dim and use __syncthreads() for proper synchronization.
tests/regression/mlp_vegeta/main.cpp
Outdated
| for (int k = 0; k < 2; ++k) { | ||
| int pos = vals[k].second; | ||
| mask |= (1 << pos); | ||
| compressed[(comp_idx / 2) * out_dim + j] = dense[(i + pos) * out_dim + j]; |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In compress_2_4_sparse, the compressed write index uses (comp_idx / 2) * out_dim + j while comp_idx is incremented for each kept value. This causes the two kept values from the same 4-element group to be written to the same slot (second overwrites the first), corrupting the compressed weights.
| compressed[(comp_idx / 2) * out_dim + j] = dense[(i + pos) * out_dim + j]; | |
| compressed[comp_idx * out_dim + j] = dense[(i + pos) * out_dim + j]; |
tests/regression/mlp_vegeta/main.cpp
Outdated
| // Store metadata | ||
| uint8_t mask = (1 << max_pos); | ||
| int tile_in = i / TILE_SIZE; | ||
| int tile_out = j / TILE_SIZE; | ||
| int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES; | ||
| int row_in_tile = (i % TILE_SIZE) / 4; | ||
| int col_in_tile = j % TILE_SIZE; | ||
| int byte_offset = meta_offset + row_in_tile * TILE_SIZE + col_in_tile; | ||
| if (byte_offset < (int)metadata.size()) { | ||
| metadata[byte_offset] = mask; |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 1:4 compression metadata is written using the same row_in_tile * TILE_SIZE + col_in_tile byte addressing scheme as 2:4. This doesn't match the expected 128-byte packed-nibble format (8 bytes/row, two 4-bit masks per byte), so the VGEMM path will consume incorrect masks.
| // Store metadata | |
| uint8_t mask = (1 << max_pos); | |
| int tile_in = i / TILE_SIZE; | |
| int tile_out = j / TILE_SIZE; | |
| int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES; | |
| int row_in_tile = (i % TILE_SIZE) / 4; | |
| int col_in_tile = j % TILE_SIZE; | |
| int byte_offset = meta_offset + row_in_tile * TILE_SIZE + col_in_tile; | |
| if (byte_offset < (int)metadata.size()) { | |
| metadata[byte_offset] = mask; | |
| // Store metadata using 128-byte packed-nibble format: | |
| // - 8 bytes per row | |
| // - each byte holds two 4-bit masks (low/high nibble) for a pair of columns | |
| uint8_t mask = (1 << max_pos) & 0x0F; | |
| int tile_in = i / TILE_SIZE; | |
| int tile_out = j / TILE_SIZE; | |
| int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES; | |
| int row_in_tile = i % TILE_SIZE; | |
| int col_in_tile = j % TILE_SIZE; | |
| int col_pair = col_in_tile / 2; | |
| int byte_offset = meta_offset + row_in_tile * 8 + col_pair; | |
| if (byte_offset < (int)metadata.size()) { | |
| uint8_t &entry = metadata[byte_offset]; | |
| if (col_in_tile & 1) { | |
| // odd column: store in high nibble | |
| entry = (uint8_t)((entry & 0x0F) | (mask << 4)); | |
| } else { | |
| // even column: store in low nibble | |
| entry = (uint8_t)((entry & 0xF0) | mask); | |
| } |
| for (uint32_t in_t = 0; in_t < in_tiles; ++in_t) { | ||
| TYPE* A_tile = input + in_t * TILE_SIZE; | ||
| TYPE* B_tile = weights + in_t * TILE_SIZE * out_dim + out_t * TILE_SIZE; | ||
| vegeta_tgemm_tile_accumulate(A_tile, B_tile, g_C_tile); |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vx_lt/vx_lu/vx_lv load a contiguous 1KB (or larger) tile from memory (see kernel/include/vx_intrinsics.h:288-293). The pointers computed here (input + in_t * TILE_SIZE, weights + in_t * TILE_SIZE * out_dim + ...) assume plain row-major matrices with strides, so each tile load will read unrelated data rather than a 16×16 packed tile. Either pack activations/weights into a tile-major contiguous layout on the host (and keep intermediates in that layout), or change the kernel to explicitly gather/scatter into contiguous tile buffers before issuing VEGETA ops.
| vx_lt(1, (size_t)A_tile, 0); // Load A to T1 | ||
| vx_lm(1, (size_t)M_tile, 0); // Load metadata to M1 | ||
| vx_lu(2, (size_t)B_tile, 0); // Load B (2KB) to U2 | ||
| vx_ugemm(0, 1, 2); // T0 += T1 × U2 (with M1 metadata) |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In vegeta_ugemm_tile_accumulate, metadata is loaded into M1 and then vx_ugemm is invoked. Per vx_intrinsics.h, UGEMM/VGEMM interpret the first operand tile register as sparse using the metadata. Here the first operand is loaded from A_tile (the activation/input), but the metadata being passed corresponds to the weights. Unless activations are also compressed in the same sparse format (they aren't in the host code), this will produce incorrect results. To use sparse weights, the sparse operand should be the weights tile (and dimensions/layout adjusted accordingly).
| vx_lt(1, (size_t)A_tile, 0); // Load A to T1 | |
| vx_lm(1, (size_t)M_tile, 0); // Load metadata to M1 | |
| vx_lu(2, (size_t)B_tile, 0); // Load B (2KB) to U2 | |
| vx_ugemm(0, 1, 2); // T0 += T1 × U2 (with M1 metadata) | |
| vx_lt(1, (size_t)B_tile, 0); // Load sparse weights B to T1 | |
| vx_lm(1, (size_t)M_tile, 0); // Load metadata for weights into M1 | |
| vx_lu(2, (size_t)A_tile, 0); // Load dense activations A to U2 | |
| vx_ugemm(0, 1, 2); // T0 += T1(sparse weights) × U2(dense activations) with M1 metadata |
| vx_lt(1, (size_t)A_tile, 0); // Load A to T1 | ||
| vx_lm(1, (size_t)M_tile, 0); // Load metadata to M1 | ||
| vx_lv(1, (size_t)B_tile, 0); // Load B (4KB) to V1 | ||
| vx_vgemm(0, 1, 1); // T0 += T1 × V1 (with M1 metadata) |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as UGEMM: vx_vgemm treats the first operand (loaded from A_tile) as sparse using the metadata in M1, but A_tile is taken from activations while the metadata is generated for weights. This mismatch will corrupt the multiply unless the kernel/host are changed so the sparse operand is the compressed weights tile.
| vx_lt(1, (size_t)A_tile, 0); // Load A to T1 | |
| vx_lm(1, (size_t)M_tile, 0); // Load metadata to M1 | |
| vx_lv(1, (size_t)B_tile, 0); // Load B (4KB) to V1 | |
| vx_vgemm(0, 1, 1); // T0 += T1 × V1 (with M1 metadata) | |
| vx_lt(1, (size_t)B_tile, 0); // Load compressed weights (sparse) to T1 | |
| vx_lm(1, (size_t)M_tile, 0); // Load metadata for sparse weights to M1 | |
| vx_lv(1, (size_t)A_tile, 0); // Load activations (dense) to V1 | |
| vx_vgemm(0, 1, 1); // T0 += T1 × V1 (with M1 metadata applied to T1) |
| // Upload weights (compressed for sparse modes, dense for TGEMM) | ||
| std::vector<TYPE>& weights_to_upload = | ||
| (mlp_mode == MLP_MODE_TGEMM) ? h_weights_dense[layer] : h_weights_compressed[layer]; | ||
|
|
||
| size_t weight_size = weights_to_upload.size() * sizeof(TYPE); | ||
| RT_CHECK(vx_mem_alloc(device, weight_size, VX_MEM_READ, &weight_buffers[layer])); | ||
| RT_CHECK(vx_mem_address(weight_buffers[layer], &layer_configs[layer].weights_addr)); | ||
| RT_CHECK(vx_copy_to_dev(weight_buffers[layer], weights_to_upload.data(), 0, weight_size)); | ||
|
|
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weights_to_upload is uploaded as a flat row-major matrix (dense[i*out_dim+j] or similarly for compressed). The VEGETA kernel uses vx_lt/vx_lu/vx_lv, which load contiguous tiles, so weights (and activations) generally need to be packed into tile-major contiguous 16×16 blocks (and for U/V, packed exactly as the instruction expects). As-is, the device kernel will read incorrect tiles from these buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Print outputs | ||
| // std::cout << "Device output: "; | ||
| // for (int i = 0; i < OUTPUT_DIM; ++i) { | ||
| // std::cout << h_output[i] << " "; | ||
| // } | ||
| // std::cout << std::endl; | ||
|
|
||
| // std::cout << "CPU reference: "; | ||
| // for (int i = 0; i < OUTPUT_DIM; ++i) { | ||
| // std::cout << h_ref_output[i] << " "; | ||
| // } | ||
| // std::cout << std::endl; | ||
|
|
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out debug code should be removed before merging to keep the codebase clean.
| // Print outputs | |
| // std::cout << "Device output: "; | |
| // for (int i = 0; i < OUTPUT_DIM; ++i) { | |
| // std::cout << h_output[i] << " "; | |
| // } | |
| // std::cout << std::endl; | |
| // std::cout << "CPU reference: "; | |
| // for (int i = 0; i < OUTPUT_DIM; ++i) { | |
| // std::cout << h_ref_output[i] << " "; | |
| // } | |
| // std::cout << std::endl; |
No description provided.