Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 4 additions & 5 deletions docs/conceptual/ck_tile/CK-tile-index.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,13 @@
.. _ck_tile_index:

************************
CK Tile Index
************************

CK Tile documentation structure:
****************************************************
CK Tile conceptual documentation table of contents
****************************************************

.. toctree::
:maxdepth: 2

index
introduction_motivation
buffer_views
tensor_views
Expand Down
156 changes: 0 additions & 156 deletions docs/conceptual/ck_tile/MERMAID_DIAGRAMS.md

This file was deleted.

5 changes: 2 additions & 3 deletions docs/conceptual/ck_tile/adaptors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ A TensorAdaptor encapsulates a sequence of :ref:`coordinate transformations <ck_
.. image:: diagrams/adaptors_1.svg
:alt: Diagram
:align: center
Core Components

Core Components
~~~~~~~~~~~~~~~

Each TensorAdaptor contains:
Expand Down Expand Up @@ -115,7 +115,7 @@ Custom adaptors can be created by specifying which transforms to use and how the
make_tuple(sequence<0>{}) // to single dim 0
);

// The adaptor is embedded in the :ref:`descriptor <ck_tile_descriptors>`
// The adaptor is embedded in the descriptor
// To use it:
multi_index<1> top_coord{5}; // 1D coordinate
// This internally calculates: row = 5/3 = 1, col = 5%3 = 2
Expand Down Expand Up @@ -309,7 +309,6 @@ A practical example showing how adaptors create efficient :ref:`GPU memory acces
// - Dimension 0,1: Thread indices
// - Dimension 2,3: Vector indices within thread
// Enables coalesced memory access on GPU
// See :ref:`ck_tile_thread_mapping` for thread mapping details

Common Transform Chains
-----------------------
Expand Down
20 changes: 20 additions & 0 deletions docs/conceptual/ck_tile/buffer_views.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,25 @@
.. _ck_tile_buffer_views:

**********************************
Buffer Views - Raw Memory Access
**********************************

Overview
--------

At the foundation of the CK Tile system lies BufferView, a compile-time abstraction that provides structured access to raw memory regions within GPU kernels. This serves as the bridge between the hardware's physical memory model and the higher-level abstractions that enable efficient GPU programming. BufferView encapsulates the complexity of GPU memory hierarchies while exposing a unified interface that works seamlessly across different memory address spaces including global memory shared across the entire device, local data share (LDS) memory shared within a workgroup, or the ultra-fast register files private to each thread.

BufferView serves as the foundation for :ref:`ck_tile_tensor_views`, which add multi-dimensional structure on top of raw memory access. Understanding BufferView is essential before moving on to more complex abstractions like :ref:`ck_tile_distribution` and :ref:`ck_tile_tile_window`.

By providing compile-time knowledge of buffer properties through template metaprogramming, BufferView enables the compiler to generate optimal machine code for each specific use case. This zero-overhead abstraction ensures that the convenience of a high-level interface comes with no runtime performance penalty.

One of BufferView's most important features is its advanced handling of out-of-bounds memory access. Unlike CPU programming where such accesses typically result in segmentation faults or undefined behavior, GPU programming must gracefully handle cases where threads attempt to access memory beyond allocated boundaries. BufferView provides configurable strategies for these scenarios, where developers can choose between returning either numerical zero values or custom sentinel values for invalid accesses. This flexibility is important for algorithms that naturally extend beyond data boundaries, such as convolutions with padding or matrix operations with non-aligned dimensions.

The abstraction extends beyond simple memory access to encompass both scalar and vector data types. GPUs achieve their highest efficiency when loading or storing multiple data elements in a single instruction. BufferView seamlessly supports these vectorized operations, automatically selecting the appropriate hardware instructions based on the data type and access pattern. This capability transforms what would be multiple memory transactions into single, efficient operations that fully utilize the available memory bandwidth.

BufferView also incorporates AMD GPU-specific optimizations that leverage unique hardware features. The AMD buffer addressing mode, for instance, provides hardware-accelerated bounds checking that ensures memory safety without the performance overhead of software-based checks. Similarly, BufferView exposes atomic operations that are crucial for parallel algorithms requiring thread-safe updates to shared data structures. These hardware-specific optimizations are abstracted behind a portable interface, ensuring that code remains maintainable while achieving optimal performance.

Memory coherence and caching policies represent another layer of complexity that BufferView manages transparently. Different GPU memory spaces have different coherence guarantees and caching behaviors. Global memory accesses can be cached in L1 and L2 caches with various coherence protocols, while LDS memory provides workgroup-level coherence with specialized banking structures (see :ref:`ck_tile_lds_bank_conflicts` for details on avoiding bank conflicts). BufferView encapsulates these details, automatically applying the appropriate memory ordering constraints and cache control directives based on the target address space and operation type.

Address Space Usage Patterns
----------------------------
Expand Down Expand Up @@ -51,6 +70,7 @@ Address Space Usage Patterns
.. image:: diagrams/buffer_views_1.svg
:alt: Diagram
:align: center

C++ Implementation
------------------

Expand Down
14 changes: 2 additions & 12 deletions docs/conceptual/ck_tile/convolution_example.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,6 @@ The key insight is that convolution can be transformed from a complex nested loo



.. image:: diagrams/convolution_example.svg
:alt: Diagram
:align: center

.. image:: diagrams/convolution_example.svg
:alt: Diagram
:align: center
Expand All @@ -88,7 +84,6 @@ Non-overlapping tiles:

// Original matrix: shape=(6, 6), strides=(6, 1)
// Tiled view: shape=(3, 3, 2, 2), strides=(12, 2, 6, 1)
// See :ref:`ck_tile_descriptors` for descriptor details
using TileDescriptor = TensorDescriptor<
Sequence<kNumTiles, kNumTiles, kTileSize, kTileSize>,
Sequence<12, 2, 6, 1>
Expand Down Expand Up @@ -243,7 +238,6 @@ The im2col transformation converts the 4D windows tensor into a 2D matrix suitab
>;

// Step 2: Apply merge transforms to create 2D im2col layout
// See :ref:`ck_tile_transforms` for transform operations
using Im2colDescriptor = decltype(
transform_tensor_descriptor(
WindowsDescriptor{},
Expand Down Expand Up @@ -312,7 +306,6 @@ Combining all components into an optimized convolution implementation:
>;

// Tile distribution for matrix multiplication
// See :ref:`ck_tile_tile_distribution` for details
using ATileDist = TileDistribution<
Sequence<TileM, TileK>,
Sequence<BlockM, 1>
Expand All @@ -327,7 +320,6 @@ Combining all components into an optimized convolution implementation:
>;

// Thread-local accumulator
// See :ref:`ck_tile_static_distributed_tensor`
StaticDistributedTensor<DataType, CTileDist> c_accumulator;

// Initialize accumulator
Expand All @@ -339,7 +331,6 @@ Combining all components into an optimized convolution implementation:
// Main GEMM loop over K dimension
for (index_t k_tile = 0; k_tile < PatchSize; k_tile += TileK) {
// Create tile windows for im2col matrix and kernel
// See :ref:`ck_tile_tile_window` for window operations
auto a_window = make_tile_window<ATileDist>(
input, Im2colDesc{H, W, K},
{blockIdx.y * TileM, k_tile}
Expand All @@ -350,7 +341,7 @@ Combining all components into an optimized convolution implementation:
{k_tile, 0}
);

// Load tiles - see :ref:`ck_tile_load_store_traits` for optimization
// Load tiles
auto a_tile = a_window.load();
auto b_tile = b_window.load();

Expand Down Expand Up @@ -476,7 +467,6 @@ CK Tile enables several optimizations for convolution:
__shared__ float smem_b[TileK][TileN];

// Collaborative loading with proper bank conflict avoidance
// See :ref:`ck_tile_lds_bank_conflicts` for optimization
auto load_tile_to_smem = [&](auto& window, float smem[][TileK]) {
#pragma unroll
for (index_t i = threadIdx.y; i < TileM; i += blockDim.y) {
Expand Down Expand Up @@ -560,7 +550,7 @@ This example demonstrates how CK Tile transforms convolution from a memory-bound

- **Sliding windows** can be efficiently represented using tensor descriptors with appropriate strides
- **Im2col transformation** converts convolution to matrix multiplication without data copies
- **Tile distribution** enables optimal work distribution across GPU threads (see :ref:`ck_tile_tile_distribution`)
- **Tile distribution** enables optimal work distribution across GPU threads (see :ref:`ck_tile_distribution`)
- **Multi-channel support** extends naturally through higher-dimensional descriptors
- **Performance optimizations** like vectorization and shared memory are seamlessly integrated (see :ref:`ck_tile_gemm_optimization` for similar techniques)

Expand Down
2 changes: 1 addition & 1 deletion docs/conceptual/ck_tile/coordinate_movement.rst
Original file line number Diff line number Diff line change
Expand Up @@ -317,7 +317,7 @@ Movement Through Adaptors
Advanced Movement Patterns
==========================

Real-world applications use advanced movement patterns for optimal memory access. These patterns often relate to :ref:`ck_tile_tile_window` operations and :ref:`ck_tile_tile_distribution` concepts:
Real-world applications use advanced movement patterns for optimal memory access. These patterns often relate to :ref:`ck_tile_tile_window` operations and :ref:`ck_tile_distribution` concepts:

Tiled Access Pattern
--------------------
Expand Down
24 changes: 12 additions & 12 deletions docs/conceptual/ck_tile/descriptors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -315,18 +315,18 @@ Padding for Convolution

.. code-block:: cpp

// Add padding to spatial dimensions
auto padded = transform_tensor_descriptor(
input_tensor,
make_tuple(
make_pass_through_transform(N), // Batch
make_pass_through_transform(C), // Channel
make_pad_transform(H, pad_h, pad_h), // Height
make_pad_transform(W, pad_w, pad_w) // Width
),
make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{}),
make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{})
);
// Add padding to spatial dimensions
auto padded = transform_tensor_descriptor(
input_tensor,
make_tuple(
make_pass_through_transform(N), // Batch
make_pass_through_transform(C), // Channel
make_pad_transform(H, pad_h, pad_h), // Height
make_pad_transform(W, pad_w, pad_w) // Width
),
make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{}),
make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{})
);

For a complete convolution example, see :ref:`ck_tile_convolution_example`.

Expand Down
Loading