ROCm · spolifroni-amd · Jan 14, 2026 · Jan 14, 2026 · Jan 15, 2026 · Jan 19, 2026
@@ -1,14 +1,13 @@
 .. _ck_tile_index:
 
-************************
-CK Tile Index
-************************
-
-CK Tile documentation structure:
+****************************************************
+CK Tile conceptual documentation table of contents
+****************************************************
 
 .. toctree::
    :maxdepth: 2
 
+   index
    introduction_motivation
    buffer_views
    tensor_views

@@ -59,8 +59,8 @@ A TensorAdaptor encapsulates a sequence of :ref:`coordinate transformations <ck_
 .. image:: diagrams/adaptors_1.svg
    :alt: Diagram
    :align: center
-Core Components
 
+Core Components
 ~~~~~~~~~~~~~~~
 
 Each TensorAdaptor contains:
@@ -115,7 +115,7 @@ Custom adaptors can be created by specifying which transforms to use and how the
        make_tuple(sequence<0>{})      // to single dim 0
    );
 
-   // The adaptor is embedded in the :ref:`descriptor <ck_tile_descriptors>`
+   // The adaptor is embedded in the descriptor
    // To use it:
    multi_index<1> top_coord{5};  // 1D coordinate
    // This internally calculates: row = 5/3 = 1, col = 5%3 = 2
@@ -309,7 +309,6 @@ A practical example showing how adaptors create efficient :ref:`GPU memory acces
    // - Dimension 0,1: Thread indices
    // - Dimension 2,3: Vector indices within thread
    // Enables coalesced memory access on GPU
-   // See :ref:`ck_tile_thread_mapping` for thread mapping details
 
 Common Transform Chains
 -----------------------

@@ -1,6 +1,25 @@
 .. _ck_tile_buffer_views:
 
+**********************************
 Buffer Views - Raw Memory Access
+**********************************
+
+Overview
+--------
+
+At the foundation of the CK Tile system lies BufferView, a compile-time abstraction that provides structured access to raw memory regions within GPU kernels. This serves as the bridge between the hardware's physical memory model and the higher-level abstractions that enable efficient GPU programming. BufferView encapsulates the complexity of GPU memory hierarchies while exposing a unified interface that works seamlessly across different memory address spaces including global memory shared across the entire device, local data share (LDS) memory shared within a workgroup, or the ultra-fast register files private to each thread.
+
+BufferView serves as the foundation for :ref:`ck_tile_tensor_views`, which add multi-dimensional structure on top of raw memory access. Understanding BufferView is essential before moving on to more complex abstractions like :ref:`ck_tile_distribution` and :ref:`ck_tile_tile_window`.
+
+By providing compile-time knowledge of buffer properties through template metaprogramming, BufferView enables the compiler to generate optimal machine code for each specific use case. This zero-overhead abstraction ensures that the convenience of a high-level interface comes with no runtime performance penalty.
+
+One of BufferView's most important features is its advanced handling of out-of-bounds memory access. Unlike CPU programming where such accesses typically result in segmentation faults or undefined behavior, GPU programming must gracefully handle cases where threads attempt to access memory beyond allocated boundaries. BufferView provides configurable strategies for these scenarios, where developers can choose between returning either numerical zero values or custom sentinel values for invalid accesses. This flexibility is important for algorithms that naturally extend beyond data boundaries, such as convolutions with padding or matrix operations with non-aligned dimensions.
+
+The abstraction extends beyond simple memory access to encompass both scalar and vector data types. GPUs achieve their highest efficiency when loading or storing multiple data elements in a single instruction. BufferView seamlessly supports these vectorized operations, automatically selecting the appropriate hardware instructions based on the data type and access pattern. This capability transforms what would be multiple memory transactions into single, efficient operations that fully utilize the available memory bandwidth.
+
+BufferView also incorporates AMD GPU-specific optimizations that leverage unique hardware features. The AMD buffer addressing mode, for instance, provides hardware-accelerated bounds checking that ensures memory safety without the performance overhead of software-based checks. Similarly, BufferView exposes atomic operations that are crucial for parallel algorithms requiring thread-safe updates to shared data structures. These hardware-specific optimizations are abstracted behind a portable interface, ensuring that code remains maintainable while achieving optimal performance.
+
+Memory coherence and caching policies represent another layer of complexity that BufferView manages transparently. Different GPU memory spaces have different coherence guarantees and caching behaviors. Global memory accesses can be cached in L1 and L2 caches with various coherence protocols, while LDS memory provides workgroup-level coherence with specialized banking structures (see :ref:`ck_tile_lds_bank_conflicts` for details on avoiding bank conflicts). BufferView encapsulates these details, automatically applying the appropriate memory ordering constraints and cache control directives based on the target address space and operation type.
 
 Address Space Usage Patterns
 ----------------------------
@@ -51,6 +70,7 @@ Address Space Usage Patterns
 .. image:: diagrams/buffer_views_1.svg
    :alt: Diagram
    :align: center
+
 C++ Implementation
 ------------------
 

@@ -59,10 +59,6 @@ The key insight is that convolution can be transformed from a complex nested loo
 
 
 
-.. image:: diagrams/convolution_example.svg
-   :alt: Diagram
-   :align: center
-
 .. image:: diagrams/convolution_example.svg
    :alt: Diagram
    :align: center
@@ -88,7 +84,6 @@ Non-overlapping tiles:
 
         // Original matrix: shape=(6, 6), strides=(6, 1)
         // Tiled view: shape=(3, 3, 2, 2), strides=(12, 2, 6, 1)
-        // See :ref:`ck_tile_descriptors` for descriptor details
         using TileDescriptor = TensorDescriptor<
             Sequence<kNumTiles, kNumTiles, kTileSize, kTileSize>,
             Sequence<12, 2, 6, 1>
@@ -243,7 +238,6 @@ The im2col transformation converts the 4D windows tensor into a 2D matrix suitab
         >;
 
         // Step 2: Apply merge transforms to create 2D im2col layout
-        // See :ref:`ck_tile_transforms` for transform operations
         using Im2colDescriptor = decltype(
             transform_tensor_descriptor(
                 WindowsDescriptor{},
@@ -312,7 +306,6 @@ Combining all components into an optimized convolution implementation:
         >;
 
         // Tile distribution for matrix multiplication
-        // See :ref:`ck_tile_tile_distribution` for details
         using ATileDist = TileDistribution<
             Sequence<TileM, TileK>,
             Sequence<BlockM, 1>
@@ -327,7 +320,6 @@ Combining all components into an optimized convolution implementation:
         >;
 
         // Thread-local accumulator
-        // See :ref:`ck_tile_static_distributed_tensor`
         StaticDistributedTensor<DataType, CTileDist> c_accumulator;
 
         // Initialize accumulator
@@ -339,7 +331,6 @@ Combining all components into an optimized convolution implementation:
         // Main GEMM loop over K dimension
         for (index_t k_tile = 0; k_tile < PatchSize; k_tile += TileK) {
             // Create tile windows for im2col matrix and kernel
-            // See :ref:`ck_tile_tile_window` for window operations
             auto a_window = make_tile_window<ATileDist>(
                 input, Im2colDesc{H, W, K},
                 {blockIdx.y * TileM, k_tile}
@@ -350,7 +341,7 @@ Combining all components into an optimized convolution implementation:
                 {k_tile, 0}
             );
 
-            // Load tiles - see :ref:`ck_tile_load_store_traits` for optimization
+            // Load tiles 
             auto a_tile = a_window.load();
             auto b_tile = b_window.load();
 
@@ -476,7 +467,6 @@ CK Tile enables several optimizations for convolution:
     __shared__ float smem_b[TileK][TileN];
 
     // Collaborative loading with proper bank conflict avoidance
-    // See :ref:`ck_tile_lds_bank_conflicts` for optimization
     auto load_tile_to_smem = [&](auto& window, float smem[][TileK]) {
         #pragma unroll
         for (index_t i = threadIdx.y; i < TileM; i += blockDim.y) {
@@ -560,7 +550,7 @@ This example demonstrates how CK Tile transforms convolution from a memory-bound
 
 - **Sliding windows** can be efficiently represented using tensor descriptors with appropriate strides
 - **Im2col transformation** converts convolution to matrix multiplication without data copies  
-- **Tile distribution** enables optimal work distribution across GPU threads (see :ref:`ck_tile_tile_distribution`)
+- **Tile distribution** enables optimal work distribution across GPU threads (see :ref:`ck_tile_distribution`)
 - **Multi-channel support** extends naturally through higher-dimensional descriptors
 - **Performance optimizations** like vectorization and shared memory are seamlessly integrated (see :ref:`ck_tile_gemm_optimization` for similar techniques)
 

@@ -317,7 +317,7 @@ Movement Through Adaptors
 Advanced Movement Patterns
 ==========================
 
-Real-world applications use advanced movement patterns for optimal memory access. These patterns often relate to :ref:`ck_tile_tile_window` operations and :ref:`ck_tile_tile_distribution` concepts:
+Real-world applications use advanced movement patterns for optimal memory access. These patterns often relate to :ref:`ck_tile_tile_window` operations and :ref:`ck_tile_distribution` concepts:
 
 Tiled Access Pattern
 --------------------

@@ -315,18 +315,18 @@ Padding for Convolution
 
 .. code-block:: cpp
 
-// Add padding to spatial dimensions
-   auto padded = transform_tensor_descriptor(
-       input_tensor,
-       make_tuple(
-           make_pass_through_transform(N),    // Batch
-           make_pass_through_transform(C),    // Channel
-           make_pad_transform(H, pad_h, pad_h),  // Height
-           make_pad_transform(W, pad_w, pad_w)   // Width
-       ),
-       make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{}),
-       make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{})
-   );
+    // Add padding to spatial dimensions
+    auto padded = transform_tensor_descriptor(
+        input_tensor,
+        make_tuple(
+            make_pass_through_transform(N),    // Batch
+            make_pass_through_transform(C),    // Channel
+            make_pad_transform(H, pad_h, pad_h),  // Height
+            make_pad_transform(W, pad_w, pad_w)   // Width
+            ),
+        make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{}),
+        make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{})
+        );
 
 For a complete convolution example, see :ref:`ck_tile_convolution_example`.