⚡️ Optimizing GPMA #53

JoelMathewC · 2023-05-26T19:44:46Z

JoelMathewC
May 26, 2023
Maintainer

Here we will try to optimize the GPMA implementation for better speedup. Couple of areas to look at

For edge_update_list the input data canbe preformatted so that it doesn't have to be reformatted at every run.
The edge_update_list data can be stored in pinned memory to make GPU transfer faster

[NEW IDEA] A new approach that is possible is to insert all possible edges into the GPMA and then only update the values at every time, we could avoid the rebalancing and reallocation overhead of GPMA.

JoelMathewC · 2023-05-26T20:27:15Z

JoelMathewC
May 26, 2023
Maintainer Author

Base

Dataset: Foorah_large_8
num_nodes: 55000
num_edge_updates: ~500,000

The code below attempts to evaluate the amount of time the edge_update_list function takes in comparison to the actual update_gpma function. Additionally, we also verify the time taken by update_gpma when the edge already exists and the value is simply updated.

from gdrive.MyDrive.gpma.gpma import GPMA, init_gpma, edge_update_list
import time

# Attributed
max_num_nodes = 55000

# Initialization
g = GPMA()
init_gpma(g,max_num_nodes)


for t in range(dataset["time_periods"]):
  
  graph_additions = graph_updates[str(t)]["add"]
  graph_deletions = graph_updates[str(t)]["add"]

  t0 = time.time()
  update_gpma_time_add = edge_update_list(g, graph_additions, is_reverse_edge=True)
  t1 = time.time()
  
  update_gpma_time_delete = edge_update_list(g, graph_additions, is_reverse_edge=True, is_delete=True)

  print(f"T = {t}  | Edges to be added: {len(graph_additions)} |  Edges to be deleted: {len(graph_deletions)}  | Time (edge_update_list):
 {t1 - t0}  | Time (update_gpma) add: {update_gpma_time_add}  | Time (update_gpma) delete: {update_gpma_time_delete}")

The output for the above code is

T	Edges to be added	Edges to be updated	edge_update_list_time	update_gpma_add_time	update_gpma_update_time
0	220000	220000	0.610	0.0146	0.00163
1	22000	22000	0.0593	0.00192	0.000651
2	23980	23980	0.0620	0.00211	0.00071
3	26138	26138	0.0732	0.00215	0.000639
..	...	...	...	...	...

0 replies

JoelMathewC · 2023-05-26T21:58:23Z

JoelMathewC
May 26, 2023
Maintainer Author

🤔 What is the need for a graph datastructure

What does Naive do

Naive contains CSR representations of the graph for different timestamps on the CPU. At every timestamp, the following operations are performed

Move entire CSR structure of a timestamp to the GPU
GNN processing happens

[Move to GPU step] The row_offset, col_indices, eids of a timestamp has to be moved to the GPU.

What does GPMA do

GPMA maintains a CSR structure on the GPU. At every timestamp, the following operations are performed

Updates are moved to the GPU
GPMA performs updates
GNN processing happens

[Move to GPU step] We know that for an update to be performed (insertion, deletion, value updation) the data has to be organize in three lists of src, dst, value. This then has to moved to the GPU and then operated upon.

Conclusion

The relevance of using the graph datastructure is to tradeoff the time taken in the movement of large CSR arrays to GPU (in naive) with the more faster movement of smaller updates to the GPU (in GPMA). That means GPMA should ideally be faster than the Naive approach.

It may be noted that in the case of Naive it is possible to pin memory of each of those CSR arrays, however due to limitations of pinning large amounts of memory it is probably better to just pin the updates.

2 replies

JoelMathewC May 26, 2023
Maintainer Author

💔 But we have a problem

GPMA might need to update every value in the CSR array because the eid of every edge in every timestamp is different. If this is the case then the size of the updates is the same (or even more) than the size of CSR arrays and hence the time taken for movement to GPU will be similar and since GPMA has an additional GPMA step, GPMA will overall be slower than Naive and it would just make sense to use the Naive approach.

JoelMathewC May 26, 2023
Maintainer Author

🍼 Handling the eid problem (First Attempt)

Lets think about it from the opposite direction.

Assumption We have access to the forward and backward GPMA for a given timestamp. The forward GPMA has values populated with the edge id values.

[Last half of the solution] In such a situation we could possible write a CUDA kernel to perform a lookup of the forward GPMA in parallel and then update the backward GPMA. (Theoretically, this should be possible)

[First half of the solution] We need a way to label the forward graph in ascending order. It must be possible to use the threadIdx and blockIdx to label these edges. (Need to look into this)

(Attempt 1 : Approach 1) Labelling forward edges

Wrote a kernel executing on the GPU with 1 block and 1 thread. To label 100,000 edges it takes 0.03sec (Which is slower than the time taken for inserting)

(Attempt 1 : Approach 2) Labelling forward edges

We can allocate additional space to maintain the sum of num_neighbours of each node. Then we can perform labelling in parallel for each node. Some optimizations can probably be performed to make sure this labelling is fast in all cases (power law graphs, normal graphs) all that.

It might be possible for us to do this space allocation during the process of updating the graph itself.

JoelMathewC · 2023-05-27T11:09:46Z

JoelMathewC
May 27, 2023
Maintainer Author

📌 Pinned Memory

We will store the updates in pinned memory for fast transfer to the GPU. Was able to achieve this using the following code and it seemed to have parallel performance with pytorch's pinned memory movement.

float move_pinned_to_gpu(std::vector<int> data)
{
    int *pinnedMemory;
    cudaMallocHost(&pinnedMemory, sizeof(int) * data.size());

    // Copy vector data to pinned memory
    cudaMemcpy(pinnedMemory, data.data(), sizeof(int) * data.size(), cudaMemcpyHostToHost);

    auto start_time = std::chrono::high_resolution_clock::now();
    int *deviceMemory;
    cudaMalloc(&deviceMemory, sizeof(int) * data.size());
    // Copy data from pinned memory to device memory
    cudaMemcpy(deviceMemory, pinnedMemory, sizeof(int) * data.size(), cudaMemcpyHostToDevice);
    auto end_time = std::chrono::high_resolution_clock::now();
    std::chrono::duration<float> time = (end_time - start_time);

    // Free the allocated memory
    cudaFree(deviceMemory);
    cudaFreeHost(pinnedMemory);

    return time.count();
}

Implementation

I have implemented a init_graph_updates that takes in graph updates and then preprocessing it to create src arrays and dst arrays and store them on pinned memory. The GPMA class maintains a vector of pointers to keep track of the locations of these arrays in pinned memory. The attributes are as shown

    std::vector<KEY_TYPE *> add_updates;
    // Maintains the number of additions in every timestamp
    std::vector<int> add_updates_count;

    std::vector<KEY_TYPE *> delete_updates;
    // Maintains the number of deletions in every timestamp
    std::vector<int> delete_updates_count;

    std::vector<VALUE_TYPE *> add_value_updates;
    std::vector<VALUE_TYPE *> delete_value_updates;

Note: We will modify the approach to storing redundant 1 and VALUE_NONE in the value arrays once the labelling strategy has been finalized.

5 replies

JoelMathewC May 27, 2023
Maintainer Author

🥳 It worked

A comparison of the results from when using pinned memory and when not using pinned memory.

JoelMathewC May 28, 2023
Maintainer Author

✅ Correctness Verified

The correctness of the GPMA insertions have been verified. The correctness of insertions was verified with a sample graph.

The verification of time taken at each step is given below.

T=0  | (edge_update_to_t) : 0.002875804901123047  | (edge_update_list) : 0.41697239875793457 | graph_add: 165000  | edge_alloc: 0.0002928549947682768 | thrust_cast: 0.00021365200518630445 | update_gpma: 0.001557984040118754
T=1  | (edge_update_to_t) : 0.0014309883117675781  | (edge_update_list) : 0.06367039680480957 | graph_add: 16500  | edge_alloc: 0.00010322900197934359 | thrust_cast: 6.215900066308677e-05 | update_gpma: 0.000726532016415149
T=2  | (edge_update_to_t) : 0.015333890914916992  | (edge_update_list) : 0.050308942794799805 | graph_add: 17985  | edge_alloc: 0.00010279600246576592 | thrust_cast: 5.9574998886091635e-05 | update_gpma: 0.014269204810261726
T=3  | (edge_update_to_t) : 0.002786874771118164  | (edge_update_list) : 0.05547523498535156 | graph_add: 19603  | edge_alloc: 9.964399941964075e-05 | thrust_cast: 8.499799878336489e-05 | update_gpma: 0.0022033529821783304
T=4  | (edge_update_to_t) : 0.022668123245239258  | (edge_update_list) : 0.07557058334350586 | graph_add: 21368  | edge_alloc: 9.74750000750646e-05 | thrust_cast: 5.597500057774596e-05 | update_gpma: 0.021318668499588966

However, we forgot to account for the time taken to update the indegrees and outdegrees.

JoelMathewC May 28, 2023
Maintainer Author

➕ Added in_degree and out_degree support

This is calculated at each timestamp. The amount of time taken is measured and its lesser than 10^-3 so we can afford that calculation.

T=0  | (edge_update_to_t) : 0.01754307746887207  | (edge_update_list) : 0.42862725257873535 | graph_add: 165000  | update_gpma_time: 0.016043556854128838 | node_degrees_time: 0.0022486269008368254
T=1  | (edge_update_to_t) : 0.015970468521118164  | (edge_update_list) : 0.059059858322143555 | graph_add: 16500  | update_gpma_time: 0.014956517145037651 | node_degrees_time: 0.00038305300404317677
T=2  | (edge_update_to_t) : 0.0029304027557373047  | (edge_update_list) : 0.052205562591552734 | graph_add: 17985  | update_gpma_time: 0.0026762359775602818 | node_degrees_time: 0.0004018119943793863
T=3  | (edge_update_to_t) : 0.0037012100219726562  | (edge_update_list) : 0.059420108795166016 | graph_add: 19603  | update_gpma_time: 0.003450328018516302 | node_degrees_time: 0.0003983679926022887
T=4  | (edge_update_to_t) : 0.016904830932617188  | (edge_update_list) : 0.07481598854064941 | graph_add: 21368  | update_gpma_time: 0.016165006905794144 | node_degrees_time: 0.00046842300798743963

JoelMathewC May 28, 2023
Maintainer Author

Finalizing Everything

The code works correctly, verification of the correctness is completed as well. The maximum time taken for insertion of about ~16,000 edges 0.02 seconds. Hopefully this should be sufficient to give GPMA a fighting chance, at least in higher features sizes.

Note: The code written so far and the time calculations do not account for edge ids.

JoelMathewC May 29, 2023
Maintainer Author

🔨 Value array fix

We have removed the need to store value arrays in pinned memory. Instead we use thrust fill to allocate values. There is no serious overhead in time since the time for allocation and fill takes the order of 10^-5 time.

JoelMathewC · 2023-05-28T16:49:33Z

JoelMathewC
May 28, 2023
Maintainer Author

FINALLY YESS 🥳

7 replies

JoelMathewC May 29, 2023
Maintainer Author

Some Evidence

I removed all checks and returned every edge from the GPMA and it seems that while there are many edges in GPMA that aren't in original (expected because we didn't perform any filtering for empty spaces), there are the same number of missing edges in the gpma one as compared to when the edges were filtered for value[I] != VALUE_NONE.

This implies that some of the edges which were expected to be added to GPMA were not added. So we need to redirect our attention from filtering and look at every step of the addition process to verify where there is a loss in data.

JoelMathewC May 29, 2023
Maintainer Author

More Evidence

After commenting out the delete node portion of the code, the results still seem to be showing a reduction in number of nodes with each step which is quite odd.

ROW_OFFSET:  55001
COL_IDX_SIZE:  262144
VALUES_SIZE:  262144
T=0 -> 56650 |  set(ACTUAL_EDGE_LIST) - set(GPMA): 56650 | deleted edges = 0 | added edges = 56650
---------

Add edge count:  5665
Delete edge count:  566
ROW_OFFSET:  55001
COL_IDX_SIZE:  262144
VALUES_SIZE:  262144
T=1 -> 61749 |  set(ACTUAL_EDGE_LIST) - set(GPMA): 56084 | deleted edges = 566 | added edges = 5665
---------

Looked into it a bit further and landed at the following, seems the initial insert is not happening at all

ROW_OFFSET:  55001
COL_IDX_SIZE:  262144
VALUES_SIZE:  262144
T=0 -> 56650 |  set(ACTUAL_EDGE_LIST) N set(GPMA): 0 | set(ACTUAL_EDGE_LIST) - set(GPMA): 56650 | deleted edges = 0 | added edges = 56650
---------

Add edge count:  5665
Delete edge count:  566
ROW_OFFSET:  55001
COL_IDX_SIZE:  262144
VALUES_SIZE:  262144
T=1 -> 61749 |  set(ACTUAL_EDGE_LIST) N set(GPMA): 5665 | set(ACTUAL_EDGE_LIST) - set(GPMA): 56084 | deleted edges = 566 | added edges = 5665
---------

Why is initial insertion failing?

Interestingly enough the initialization works if you run the code separately. It seems that edge_update_to_t(gpma,0) has to be run twice to work.

Note Normally testing gpma with thrust on CPU then in device works for num_nodes=50,000

JoelMathewC May 30, 2023
Maintainer Author

Breakthrough

Turns out that printf calls will not be printed to the output of a ipynb when run directly in a cell. The output can only be observed when the script is run from the terminal. The real output log the entire time was

NOT REVERSE
Add edge count:  56650
Delete edge count:  0
GPUassert: out of memory gpma.cu 179
ROW_OFFSET:  55001
COL_IDX_SIZE:  262144
VALUES_SIZE:  262144
T=0 -> Expected: 56650 |  set(ACTUAL_EDGE_LIST) N set(GPMA): 0 | deleted edges = 0 | added edges = 56650

JoelMathewC May 30, 2023
Maintainer Author

🎉 IT WORKS!

As per this the call to cub_sort_key_value requires some temporary storage which can become a overhead as in our case. So as suggested in the article and the official docs we can modify the function to use DoubleBuffer thus enabling us to avoid the following error.

GPUassert: out of memory gpma.cu 179

The use of DoubleBuffer overwrites into existing unsorted memory, which works out great for us since we didn't really want to preserve the unsorted memory. (or so I presume)

As per docs there seems to be no performance degradation in terms of time.

CHATGPT explained this function as follows,

When using cub::DeviceRadixSort::SortPairs, the algorithm requires two sets of input and output buffers, often referred to as the current and alternate buffers. These buffers are used for intermediate computations during the sorting process.

The SortPairs algorithm performs multiple iterations, with each iteration using one set of buffers as the input and the other set as the output. The roles of the current and alternate buffers swap in each iteration to allow for in-place sorting.

Here's an overview of how the current and alternate buffers are used:

Initialization: Both the current and alternate buffers are initialized with the input data.

Iterations: The algorithm performs multiple iterations, typically equal to the number of bits in the key or another internal representation of the data. In each iteration, the current buffer is sorted, and the result is stored in the alternate buffer.

Swap: After each iteration, the roles of the current and alternate buffers are swapped. The buffer that was previously the current buffer becomes the alternate buffer, and vice versa. This swap allows the algorithm to use the sorted data as the input for the next iteration.

Final Result: After all iterations are completed, the final sorted result is stored in the buffer that was the current buffer in the last iteration.

Since GPMA expects tmp_keys/values and keys/values to both independently store the sorted matrix the following code can be appended after the change

    SIZE_TYPE THREADS_NUM = 128;
    SIZE_TYPE BLOCKS_NUM = CALC_BLOCKS_NUM(THREADS_NUM, size);
    memcpy_kernel<KEY_TYPE><<<BLOCKS_NUM, THREADS_NUM>>>(d_keys.Alternate(), d_keys.Current(), size);
    memcpy_kernel<VALUE_TYPE><<<BLOCKS_NUM, THREADS_NUM>>>(d_values.Alternate(), d_values.Current(), size);
    cErr(cudaDeviceSynchronize());

A similar logic was used in the original implementation for copy.

Verification

The get_gpma_edge_list was used to return the edges for a large graph of 10 timestamps and the forward and backward graphs were verified for correctness.

JoelMathewC May 30, 2023
Maintainer Author

RESOLVED ORIGINAL ISSUE

The original issue was how do we retrieve (src,dst) from GPMA. The following inferences were made by performing tests:

We cannot loop through the keys vector directly, we need to use row_offset and access the specified areas in the offset to access the keys vector.
Not performing the check for KEY_MAX seems to work alright, however we shall keep the check for safer side

The resulting condition is

for (int node = 0; node < h_ref_row_offset.size() - 1; ++node)
    {
        SIZE_TYPE beg = h_ref_row_offset[node];
        SIZE_TYPE end = h_ref_row_offset[node + 1];
        for (SIZE_TYPE i = beg; i < end; ++i)
        {
            // h_ref_keys[i] != KEY_MAX && h_ref_keys[i] != KEY_NONE
            // KEY_TYPE mask = (KEY_TYPE)node << 32;
            // unsigned int dst = (h_ref_keys[i] - mask);
            unsigned int dst = (h_ref_keys[i] & 0xffffffff);
            if (h_ref_keys[i] != KEY_MAX && dst != COL_IDX_NONE && h_ref_values[i] != VALUE_NONE)
            {
                // relevant logic
            }
        }
}

JoelMathewC · 2023-05-30T04:09:12Z

JoelMathewC
May 30, 2023
Maintainer Author

📕 RESULTS

There is no degradation in speed.

It is interesting to note that the update times still continue to increase as feature sizes increase as shown here

Let's stop looking at speed for now, since we have the general elements all in place!

2 replies

JoelMathewC May 31, 2023
Maintainer Author

Ways to speed up GPMA

Maybe try pre-allocating GPU memory like how pytorch does it

JoelMathewC Jun 6, 2023
Maintainer Author

Update time increasing with feature size?

On further verification I realize that the difference between Naive and GPMA is the same throughout and thus suspect that GPMA updates take a total of 0.02 seconds for the case of foorah_large for one epoch. The increasing update time is probably because of multiple GPU kernels running in parallel, as in seastar kernels and GPMA kernels.

However when I added a CUDA device synchronize at the end of every timestamp the results remained the same. Will need to investigate this further.

JoelMathewC · 2023-05-31T07:31:15Z

JoelMathewC
May 31, 2023
Maintainer Author

🫙 GPMA Storage

Ran a test for GPMA on foorah_large_24 which has 55,000 nodes and a maximum of ~120,000 edges with a total of 10 timestamps. Each node has a feature size of 24. After running a storage test the results are as follows

Naive	GPMA	PyG-T
2834.875	3278.875	2306.000

Some more additional info is as given

Location	Naive	GPMA	PyG-T
After storing features	702.875	702.875	710.0
After storing the model	704.875	704.875	712.0
After storing edge_lst & edge_weight	NA	NA	736.0
After initializing SeastarGraph	718.875	1156.875	NA

How much data does the graph comprise of

I ran a separate script to verify how much GPU data is used to store foorah_large_24 and it is about 620.875 MB. The question is can GPMA represent this same amount of data in less than this amount of memory.

Okay I ran a separate check and it seems that the size of the variable that stores edges_lst takes only 71 MB. While this does feel a bit pointless in trying to save this amount of space, I'm hoping we can see some gains in certain usecases and it always does help to optimize 😉

So our GPMA needs to take less that 71 MB

8 replies

JoelMathewC Jun 1, 2023
Maintainer Author

We're close

Okay so I tested out the following python code and noticed that we experience something similar when working with Pytorch as well.

initial_used_gpu_mem = pynvml.nvmlDeviceGetMemoryInfo(handle).used
a = torch.randn(1000000).cuda()
used_gpu_mem = pynvml.nvmlDeviceGetMemoryInfo(handle).used - initial_used_gpu_mem
print((used_gpu_mem * 1.0) / (1024**2))

Here also about 100 MB is allocated apart from the space allocated for the tensor.

My Theory

Some portion of the user allocated GPU memory is used to store some sort of data necessary for the functioning (bookkeeping maybe) of the package (whatever it may be). Currently GPMA seems to take about 550 MB. We need to bring that amount down somehow.

It's interesting to note that our TEMP actually took only 100 MB. Lets see if we can get GPMA down to that amount as well.

JoelMathewC Jun 2, 2023
Maintainer Author

Deepcopy issue

The current deepcopy implementation in PyBind11 is as follows. Here the default copy constructor generated by the compiler is invoked. This has problems because this does not create a separate memory allocation for dynamically allocated data, that will have to be done manually with a custom copy constructor.

.def("__copy__", [](const GPMA &self)
             { return GPMA(self); })
        .def(
            "__deepcopy__", [](const GPMA &self, py::dict)
            { return GPMA(self); },
            "memo"_a);

Okay I ran the code again and verified the output, deepcopy seems to be incrementing the memory by only 2MB as it should be.

However the deallocation does not happen till the data is actually deleted using the del keyword in python.

Settling Doubts

If assigned to None and garbage collector is invoked will the memory be freed. If not then why is del successful at deallocation.
Is deep copy actually performing a deepcopy by allocating new memory

JoelMathewC Jun 2, 2023
Maintainer Author

🍼Coming back to the 550 MB overhead issue

Removed all the functions we defined (edge_update_to_t, build_reverse_gpma, label_edges, copy_label_edges etc) but the 550 MB overhead was still there
Removed every other GPMA associated function other than the class definition, the 550 MB overhead disappeared.

Inference: The source of the 550 MB overhead is from one of those helper functions.

Problem Identified

After painstakingly performing manually binary search on all the functions present I was able to isolate the overhead to a single function in the gpma.cu file.

__global__ void rebalancing_kernel(SIZE_TYPE unique_update_size, SIZE_TYPE seg_length, SIZE_TYPE level, KEY_TYPE
 *keys, VALUE_TYPE *values, SIZE_TYPE *update_nodes, KEY_TYPE *update_keys, VALUE_TYPE *update_values, 
SIZE_TYPE *unique_update_nodes, SIZE_TYPE *update_offset, SIZE_TYPE lower_bound, SIZE_TYPE upper_bound, 
SIZE_TYPE *row_offset)

Analyzing further

I commented out all the cudaMalloc code and that doesn't seem to have reduced the memory overhead.
By commenting out the following lines we were able to get 12 MB freed.

compact_kernel(update_width, key, value, compacted_size, tmp_keys, tmp_values, tmp_exscan, tmp_label);
cErr(cudaDeviceSynchronize());

Commented out everything except the first occurrence of the following line and realized that this one line was creating the overhead of 538 MB (550 - 12, because for some reason 12 was freed in the previous case)

cErr(cudaDeviceSynchronize());

Found an instance where the same issue was flagged by another user here

It might be possible that this unexpected behaviour is due to cudaDeviceSynchronize being deprecated

JoelMathewC Jun 6, 2023
Maintainer Author

Overhead Mystery Resolved

Based on this reference (in the GPU Memory notes section) the 550 MB is associated with the CUDA context object and hence can be ignored. Additionally it may be noted that if we initialize two GPMA objects, the CUDA context object is only created for the first one. So we can create a dummy GPMA object to handle that overhead so that the initial_used_gpu_mem accounts for that memory and then proceed with the benchmarking.

This also implies the same result regarding the CUDA context object and we will ignore that in benchmarking.

JoelMathewC Jun 6, 2023
Maintainer Author

Regarding Pytorch Deallocation

This was a useful reference for pytorch deallocation. Didnt seem to work for me though.

JoelMathewC · 2023-06-06T11:49:32Z

JoelMathewC
Jun 6, 2023
Maintainer Author

Edge labelling & Faster backward graph generation

Labelling of edges and building of a labelled backward graph in an efficient manner is discussed #56. We noticed that there was a considerable speed-up with the introduction of this feature, we believe it is primarily due to the speed of generating the labelled backward graph. The results are as shown below.

0 replies

JoelMathewC · 2023-06-20T10:56:13Z

JoelMathewC
Jun 20, 2023
Maintainer Author

Something we overlooked

While GPMA does have outstanding performance on foorah_large we overlooked one aspect, i.e the relationship between insertion (slowest step in GPMA) time and number of timestamps.

In the case of foorah_large since the number of timestamps is only 10, the total time spent in performing insertions in the entire epoch is of order 0.01 which is outweighed by the GNN processing time. However when the number of timestamps increases this becomes an issue.

In the case of sx-mathoverflow with over 100 timestamps, the total insertion time in the epoch is of the order of 0.1 and that is not outweighed by the GNN computation and hence the slower performance.

Seastar-GPMA Output for 100 timestamps

Seastar-GPMA Output for 10 timestamps

1 reply

JoelMathewC Jun 20, 2023
Maintainer Author

Lookahead optimization

This is based on two observations about GPMA

GPMA takes lesser time to do one insertion of 10x size than 10 seperate insertions
GPMA takes much lesser time to perform updates than insertions

The idea is to pass in a lookahead so that that many insertions are done at once, and then moving forward the values are just updated. This could however impact the performance of label_edges() and build_reverse_gpma(), it could also increase the time taken to run the kernel.

UPDATE: Decided against this optimization

JoelMathewC · 2023-06-25T20:57:17Z

JoelMathewC
Jun 25, 2023
Maintainer Author

One last bug 🐛

GPMA goes out of memory for update_gpma calls. The following evidence is available

The issue is in one of init_gpma, init_graph_updates or edge_update_t.
Original GPMA works perfectly without going out of memory

1 reply

JoelMathewC Nov 4, 2023
Maintainer Author

This was resolved and we're good to close this out.

⚡️ Optimizing GPMA #53

Uh oh!

Uh oh!

JoelMathewC May 26, 2023 Maintainer

Replies: 9 comments · 26 replies

Uh oh!

JoelMathewC May 26, 2023 Maintainer Author

Base

Uh oh!

Uh oh!

JoelMathewC May 26, 2023 Maintainer Author

🤔 What is the need for a graph datastructure

What does Naive do

What does GPMA do

Conclusion

Uh oh!

Uh oh!

JoelMathewC May 26, 2023 Maintainer Author

💔 But we have a problem

Uh oh!

Uh oh!

JoelMathewC May 26, 2023 Maintainer Author

🍼 Handling the eid problem (First Attempt)

(Attempt 1 : Approach 1) Labelling forward edges

(Attempt 1 : Approach 2) Labelling forward edges

Uh oh!

Uh oh!

JoelMathewC May 27, 2023 Maintainer Author

📌 Pinned Memory

Implementation

Uh oh!

Uh oh!

JoelMathewC May 27, 2023 Maintainer Author

🥳 It worked

Uh oh!

Uh oh!

JoelMathewC May 28, 2023 Maintainer Author

✅ Correctness Verified

Uh oh!

Uh oh!

JoelMathewC May 28, 2023 Maintainer Author

➕ Added in_degree and out_degree support

Uh oh!

Uh oh!

JoelMathewC May 28, 2023 Maintainer Author

Finalizing Everything

Uh oh!

Uh oh!

JoelMathewC May 29, 2023 Maintainer Author

🔨 Value array fix

Uh oh!

JoelMathewC May 28, 2023 Maintainer Author

FINALLY YESS 🥳

Uh oh!

Uh oh!

JoelMathewC May 29, 2023 Maintainer Author

Some Evidence

Uh oh!

Uh oh!

JoelMathewC May 29, 2023 Maintainer Author

More Evidence

Why is initial insertion failing?

Uh oh!

JoelMathewC May 30, 2023 Maintainer Author

Breakthrough

Uh oh!

Uh oh!

JoelMathewC May 30, 2023 Maintainer Author

🎉 IT WORKS!

Verification

Uh oh!

JoelMathewC May 30, 2023 Maintainer Author

JoelMathewC
May 26, 2023
Maintainer

Replies: 9 comments 26 replies

JoelMathewC
May 26, 2023
Maintainer Author

JoelMathewC
May 26, 2023
Maintainer Author

JoelMathewC May 26, 2023
Maintainer Author

JoelMathewC May 26, 2023
Maintainer Author

JoelMathewC
May 27, 2023
Maintainer Author

JoelMathewC May 27, 2023
Maintainer Author

JoelMathewC May 28, 2023
Maintainer Author

JoelMathewC May 28, 2023
Maintainer Author

JoelMathewC May 28, 2023
Maintainer Author

JoelMathewC May 29, 2023
Maintainer Author

JoelMathewC
May 28, 2023
Maintainer Author

JoelMathewC May 29, 2023
Maintainer Author

JoelMathewC May 29, 2023
Maintainer Author

JoelMathewC May 30, 2023
Maintainer Author

JoelMathewC May 30, 2023
Maintainer Author

JoelMathewC May 30, 2023
Maintainer Author