Added MLP test for baseline and VEGETA engine. #316

tandonmitul27 · 2026-02-04T18:41:37Z

No description provided.

Copilot

Pull request overview

Adds new regression tests for a 4-layer MLP, including a baseline implementation and a VEGETA (tile GEMM) implementation with optional sparse modes.

Changes:

Added tests/regression/mlp host+kernel for baseline MLP inference with CPU reference verification
Added tests/regression/mlp_vegeta host+kernel for batched VEGETA MLP inference, including (intended) 2:4 and 1:4 weight compression paths
Added per-test common.h and Makefile scaffolding for both tests

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
tests/regression/mlp/main.cpp	Baseline MLP host-side setup, device execution, CPU reference, and verification
tests/regression/mlp/kernel.cpp	Baseline MLP device kernel (FC layers + softmax)
tests/regression/mlp/common.h	Baseline MLP shared constants and argument structs
tests/regression/mlp/Makefile	Build rules for baseline MLP regression
tests/regression/mlp_vegeta/main.cpp	VEGETA MLP host-side setup, (intended) sparsity compression, device execution, CPU reference
tests/regression/mlp_vegeta/kernel.cpp	VEGETA tile-GEMM based MLP kernel with TGEMM/UGEMM/VGEMM paths
tests/regression/mlp_vegeta/common.h	VEGETA MLP shared constants (tile sizes, modes) and argument structs
tests/regression/mlp_vegeta/Makefile	Build rules for VEGETA MLP regression

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-04T18:52:53Z

tests/regression/mlp/common.h

+// Layer configuration structure
+typedef struct {
+    uint32_t input_dim;
+    uint32_t output_dim;
+    uint64_t weights_addr;  // Weight matrix: output_dim x input_dim
+    uint64_t bias_addr;     // Bias vector: output_dim
+} layer_config_t;


This header uses uint32_t/uint64_t but doesn't include <stdint.h>/<cstdint>. Relying on transitive includes makes builds fragile; include the standard integer header directly from common.h.

Copilot · 2026-02-04T18:52:53Z

tests/regression/mlp_vegeta/main.cpp

+            // Store metadata (simplified - in real impl would be tile-based)
+            int tile_in = i / TILE_SIZE;
+            int tile_out = j / TILE_SIZE;
+            int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES;
+            int row_in_tile = (i % TILE_SIZE) / 4;
+            int col_in_tile = j % TILE_SIZE;
+            int byte_offset = meta_offset + row_in_tile * TILE_SIZE + col_in_tile;
+            if (byte_offset < (int)metadata.size()) {
+                metadata[byte_offset] = mask;
+            }


The metadata layout written here (row_in_tile * TILE_SIZE + col_in_tile) does not match the 128-byte 16×16 nibble format used by other VEGETA sparse tests (8 bytes per row, 2 nibbles per byte; see tests/regression/sgemm_tile/main.cpp:124-132). As written, only 64 bytes/tile can be addressed (row_in_tile is 0..3) and no nibble packing is performed, so the device will interpret masks incorrectly.

Copilot · 2026-02-04T18:52:54Z

tests/regression/mlp_vegeta/kernel.cpp

+
+        for (uint32_t row = 0; row < BATCH_SIZE; ++row) {
+            for (uint32_t col = 0; col < TILE_SIZE; ++col) {
+                output[row * out_dim + out_t * TILE_SIZE + col] = 


These scalar stores write g_C_tile into output in plain row-major order. However, subsequent layers pass layer_output back into VEGETA GEMM paths where tiles are loaded via vx_lt, which expects activations laid out as contiguous 16×16 tiles. Without a repack step (or storing the result in tile-major layout), later layers will load incorrect activation tiles.

Suggested change

output[row * out_dim + out_t * TILE_SIZE + col] =

uint32_t tile_index = out_t * TILE_SIZE * TILE_SIZE + row * TILE_SIZE + col;

output[tile_index] =

Copilot · 2026-02-04T18:52:54Z

tests/regression/mlp/kernel.cpp

+// Thread function: compute exp in parallel, thread 0 sums
+void softmax_exp_sum_thread(softmax_args_t* __UNIFORM__ args) {
+    uint32_t i = blockIdx.x;
+
+    // All threads compute their exp value
+    if (i < args->size) {
+        args->data[i] = expf(args->data[i] - *(args->max_val));
+    }
+
+    // Thread 0 does the sum reduction serially
+    if (i == 0) {
+        TYPE sum_exp = 0.0f;
+        for (uint32_t j = 0; j < args->size; ++j) {
+            sum_exp += args->data[j];
+        }
+        *(args->sum_exp) = sum_exp;
+    }


softmax_exp_sum_thread computes exp for one element per work-item (indexed by blockIdx.x) and then work-item 0 immediately sums args->data[j] without any cross-work-item synchronization. With vx_spawn_threads(..., grid_dim=&size, block_dim=nullptr), these work-items run in parallel and there is no ordering guarantee, so the sum can read uninitialized/stale values. Split into two launches (one to compute exp in parallel, then a single-thread launch to sum), or launch with a non-NULL block_dim and use __syncthreads() for proper synchronization.

Copilot · 2026-02-04T18:52:54Z

tests/regression/mlp_vegeta/main.cpp

+            for (int k = 0; k < 2; ++k) {
+                int pos = vals[k].second;
+                mask |= (1 << pos);
+                compressed[(comp_idx / 2) * out_dim + j] = dense[(i + pos) * out_dim + j];


In compress_2_4_sparse, the compressed write index uses (comp_idx / 2) * out_dim + j while comp_idx is incremented for each kept value. This causes the two kept values from the same 4-element group to be written to the same slot (second overwrites the first), corrupting the compressed weights.

Suggested change

compressed[(comp_idx / 2) * out_dim + j] = dense[(i + pos) * out_dim + j];

compressed[comp_idx * out_dim + j] = dense[(i + pos) * out_dim + j];

Copilot · 2026-02-04T18:52:55Z

tests/regression/mlp_vegeta/main.cpp

+            // Store metadata
+            uint8_t mask = (1 << max_pos);
+            int tile_in = i / TILE_SIZE;
+            int tile_out = j / TILE_SIZE;
+            int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES;
+            int row_in_tile = (i % TILE_SIZE) / 4;
+            int col_in_tile = j % TILE_SIZE;
+            int byte_offset = meta_offset + row_in_tile * TILE_SIZE + col_in_tile;
+            if (byte_offset < (int)metadata.size()) {
+                metadata[byte_offset] = mask;


The 1:4 compression metadata is written using the same row_in_tile * TILE_SIZE + col_in_tile byte addressing scheme as 2:4. This doesn't match the expected 128-byte packed-nibble format (8 bytes/row, two 4-bit masks per byte), so the VGEMM path will consume incorrect masks.

Suggested change

// Store metadata

uint8_t mask = (1 << max_pos);

int tile_in = i / TILE_SIZE;

int tile_out = j / TILE_SIZE;

int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES;

int row_in_tile = (i % TILE_SIZE) / 4;

int col_in_tile = j % TILE_SIZE;

int byte_offset = meta_offset + row_in_tile * TILE_SIZE + col_in_tile;

if (byte_offset < (int)metadata.size()) {

metadata[byte_offset] = mask;

// Store metadata using 128-byte packed-nibble format:

// - 8 bytes per row

// - each byte holds two 4-bit masks (low/high nibble) for a pair of columns

uint8_t mask = (1 << max_pos) & 0x0F;

int tile_in = i / TILE_SIZE;

int tile_out = j / TILE_SIZE;

int meta_offset = (tile_in * out_tiles + tile_out) * M_TILE_BYTES;

int row_in_tile = i % TILE_SIZE;

int col_in_tile = j % TILE_SIZE;

int col_pair = col_in_tile / 2;

int byte_offset = meta_offset + row_in_tile * 8 + col_pair;

if (byte_offset < (int)metadata.size()) {

uint8_t &entry = metadata[byte_offset];

if (col_in_tile & 1) {

// odd column: store in high nibble

entry = (uint8_t)((entry & 0x0F) | (mask << 4));

} else {

// even column: store in low nibble

entry = (uint8_t)((entry & 0xF0) | mask);

}

Copilot · 2026-02-04T18:52:55Z

tests/regression/mlp_vegeta/kernel.cpp

+        for (uint32_t in_t = 0; in_t < in_tiles; ++in_t) {
+            TYPE* A_tile = input + in_t * TILE_SIZE;
+            TYPE* B_tile = weights + in_t * TILE_SIZE * out_dim + out_t * TILE_SIZE;
+            vegeta_tgemm_tile_accumulate(A_tile, B_tile, g_C_tile);


vx_lt/vx_lu/vx_lv load a contiguous 1KB (or larger) tile from memory (see kernel/include/vx_intrinsics.h:288-293). The pointers computed here (input + in_t * TILE_SIZE, weights + in_t * TILE_SIZE * out_dim + ...) assume plain row-major matrices with strides, so each tile load will read unrelated data rather than a 16×16 packed tile. Either pack activations/weights into a tile-major contiguous layout on the host (and keep intermediates in that layout), or change the kernel to explicitly gather/scatter into contiguous tile buffers before issuing VEGETA ops.

Copilot · 2026-02-04T18:52:55Z

tests/regression/mlp_vegeta/kernel.cpp

+    vx_lt(1, (size_t)A_tile, 0);  // Load A to T1
+    vx_lm(1, (size_t)M_tile, 0);  // Load metadata to M1
+    vx_lu(2, (size_t)B_tile, 0);  // Load B (2KB) to U2
+    vx_ugemm(0, 1, 2);             // T0 += T1 × U2 (with M1 metadata)


In vegeta_ugemm_tile_accumulate, metadata is loaded into M1 and then vx_ugemm is invoked. Per vx_intrinsics.h, UGEMM/VGEMM interpret the first operand tile register as sparse using the metadata. Here the first operand is loaded from A_tile (the activation/input), but the metadata being passed corresponds to the weights. Unless activations are also compressed in the same sparse format (they aren't in the host code), this will produce incorrect results. To use sparse weights, the sparse operand should be the weights tile (and dimensions/layout adjusted accordingly).

Suggested change

vx_lt(1, (size_t)A_tile, 0); // Load A to T1

vx_lm(1, (size_t)M_tile, 0); // Load metadata to M1

vx_lu(2, (size_t)B_tile, 0); // Load B (2KB) to U2

vx_ugemm(0, 1, 2); // T0 += T1 × U2 (with M1 metadata)

vx_lt(1, (size_t)B_tile, 0); // Load sparse weights B to T1

vx_lm(1, (size_t)M_tile, 0); // Load metadata for weights into M1

vx_lu(2, (size_t)A_tile, 0); // Load dense activations A to U2

vx_ugemm(0, 1, 2); // T0 += T1(sparse weights) × U2(dense activations) with M1 metadata

Copilot · 2026-02-04T18:52:55Z

tests/regression/mlp_vegeta/kernel.cpp

+    vx_lt(1, (size_t)A_tile, 0);  // Load A to T1
+    vx_lm(1, (size_t)M_tile, 0);  // Load metadata to M1
+    vx_lv(1, (size_t)B_tile, 0);  // Load B (4KB) to V1
+    vx_vgemm(0, 1, 1);             // T0 += T1 × V1 (with M1 metadata)


Same issue as UGEMM: vx_vgemm treats the first operand (loaded from A_tile) as sparse using the metadata in M1, but A_tile is taken from activations while the metadata is generated for weights. This mismatch will corrupt the multiply unless the kernel/host are changed so the sparse operand is the compressed weights tile.

Suggested change

vx_lt(1, (size_t)A_tile, 0); // Load A to T1

vx_lm(1, (size_t)M_tile, 0); // Load metadata to M1

vx_lv(1, (size_t)B_tile, 0); // Load B (4KB) to V1

vx_vgemm(0, 1, 1); // T0 += T1 × V1 (with M1 metadata)

vx_lt(1, (size_t)B_tile, 0); // Load compressed weights (sparse) to T1

vx_lm(1, (size_t)M_tile, 0); // Load metadata for sparse weights to M1

vx_lv(1, (size_t)A_tile, 0); // Load activations (dense) to V1

vx_vgemm(0, 1, 1); // T0 += T1 × V1 (with M1 metadata applied to T1)

Copilot · 2026-02-04T18:52:56Z

tests/regression/mlp_vegeta/main.cpp

+        // Upload weights (compressed for sparse modes, dense for TGEMM)
+        std::vector<TYPE>& weights_to_upload = 
+            (mlp_mode == MLP_MODE_TGEMM) ? h_weights_dense[layer] : h_weights_compressed[layer];
+
+        size_t weight_size = weights_to_upload.size() * sizeof(TYPE);
+        RT_CHECK(vx_mem_alloc(device, weight_size, VX_MEM_READ, &weight_buffers[layer]));
+        RT_CHECK(vx_mem_address(weight_buffers[layer], &layer_configs[layer].weights_addr));
+        RT_CHECK(vx_copy_to_dev(weight_buffers[layer], weights_to_upload.data(), 0, weight_size));
+


weights_to_upload is uploaded as a flat row-major matrix (dense[i*out_dim+j] or similarly for compressed). The VEGETA kernel uses vx_lt/vx_lu/vx_lv, which load contiguous tiles, so weights (and activations) generally need to be packed into tile-major contiguous 16×16 blocks (and for U/V, packed exactly as the instruction expects). As-is, the device kernel will read incorrect tiles from these buffers.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/regression/mlp_vegeta/main.cpp

tests/regression/mlp_vegeta/kernel.cpp

tests/regression/mlp_vegeta/main.cpp

tests/regression/mlp/main.cpp

Copilot · 2026-02-05T16:22:21Z

tests/regression/mlp/main.cpp

+    // Print outputs
+    // std::cout << "Device output: ";
+    // for (int i = 0; i < OUTPUT_DIM; ++i) {
+    //     std::cout << h_output[i] << " ";
+    // }
+    // std::cout << std::endl;
+
+    // std::cout << "CPU reference: ";
+    // for (int i = 0; i < OUTPUT_DIM; ++i) {
+    //     std::cout << h_ref_output[i] << " ";
+    // }
+    // std::cout << std::endl;
+


Commented-out debug code should be removed before merging to keep the codebase clean.

Suggested change

// Print outputs

// std::cout << "Device output: ";

// for (int i = 0; i < OUTPUT_DIM; ++i) {

// std::cout << h_output[i] << " ";

// }

// std::cout << std::endl;

// std::cout << "CPU reference: ";

// for (int i = 0; i < OUTPUT_DIM; ++i) {

// std::cout << h_ref_output[i] << " ";

// }

// std::cout << std::endl;

tests/regression/mlp_vegeta/kernel.cpp

Copilot AI review requested due to automatic review settings February 4, 2026 18:41

Copilot started reviewing on behalf of tandonmitul27 February 4, 2026 18:42 View session

Added MLP tests

ba5fae8

tandonmitul27 force-pushed the veg branch from cf117e5 to ba5fae8 Compare February 4, 2026 18:47

Copilot AI reviewed Feb 4, 2026

View reviewed changes

MLP test corrections

3f9e97b

tandonmitul27 requested a review from Copilot February 5, 2026 16:17

Copilot started reviewing on behalf of tandonmitul27 February 5, 2026 16:17 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

Added a separate LSU for Vegeta Engine

abd4805

	output[row * out_dim + out_t * TILE_SIZE + col] =
	uint32_t tile_index = out_t * TILE_SIZE * TILE_SIZE + row * TILE_SIZE + col;
	output[tile_index] =

	compressed[(comp_idx / 2) * out_dim + j] = dense[(i + pos) * out_dim + j];
	compressed[comp_idx * out_dim + j] = dense[(i + pos) * out_dim + j];

Added MLP test for baseline and VEGETA engine. #316

Are you sure you want to change the base?

Added MLP test for baseline and VEGETA engine. #316

Uh oh!

Conversation

tandonmitul27 commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant