Skip to content

Initial TBR chapter.#338

Open
gpx1000 wants to merge 5 commits intoKhronosGroup:mainfrom
gpx1000:TBR-and-IBR
Open

Initial TBR chapter.#338
gpx1000 wants to merge 5 commits intoKhronosGroup:mainfrom
gpx1000:TBR-and-IBR

Conversation

@gpx1000
Copy link
Contributor

@gpx1000 gpx1000 commented Aug 4, 2025

NB, fix the TBR link to the Simple Game Engine tutorial when it is published.

@cforfang
Copy link

cforfang commented Aug 5, 2025

Can I ask if a lot of this was written by AI? I'm very surprised by a lot of the text. Also the notion that one might as a SW developer 'choose' between TBR vs IMR, and have code trying to determine what to pick (ref the PowerConsumptionAnalyzer code) is very strange to me.

@gpx1000
Copy link
Contributor Author

gpx1000 commented Aug 5, 2025

Parts were written by AI, specifically the power consumption analyzer code as that is outside my normal wheelhouse; but looking at the references they looked solid and I edited it to make it read mostly correct to me. So, genesis by AI sure, but heavy human editing.

@cforfang
Copy link

cforfang commented Aug 5, 2025

I'm not able to a point by point feedback, but for example the VK_EXT_robustness2 section seems like total nonsense to me. And claims about use of VK_KHR_dynamic_rendering_local_read by Unity and Unreal is as far as I know also not true. As I scroll through the guide there is in general a lot of strange claims and commentary I think.

@gpx1000
Copy link
Contributor Author

gpx1000 commented Aug 5, 2025

Okay, no worries, I'll rewrite it.

@cforfang
Copy link

cforfang commented Aug 5, 2025

Probably not too useful to drop more 'random' drive-by comments like this, but for another example I think none of the use-cases mentioned for VK_EXT_shader_tile_image (bloom, edge-detection, FXAA, SSR) makes sense as the extension only gives access to the current pixel while all of these effects need access to other pixels.

FWIW I've ping some folks here at Arm to see if we can help review and support development of the guide -- I think it's a great initiative to be clear, but it probably needs some close review especially as there is not too much good and up-to-date public info about current mobile GPUs to pull from (hence also why the idea of the guide is good, of course) :)

@gpx1000
Copy link
Contributor Author

gpx1000 commented Aug 5, 2025

Thanks it's MUCH appreciated. I'm by far not the best expert at TBR; and I really want to try to get updated information out there. There's a reason I read all of the research articles linked and tried to put as much research into this chapter as I could. If we could get more details and more review, I'm much happier. Soon as I get a chance, I'm going to update from the comments already generated here.

@solidpixel
Copy link

solidpixel commented Aug 5, 2025

The chapter title is Tile-Based Rendering Best Practices, but most of what it talks about is nothing to do with tile-based rendering but related to other aspects of vendor-specific implementation detail or orthogonal mobile GPU issues (constant registers, coherent memory, thermal, etc). For a Vulkan guide I'd probably split this up - having a topic focused only on the effects of being tile based is useful and the rest is somewhat a distraction.

The most important things for tilers (good use of loadOp/storeOp) seems to be buried right at the end, and the second most important (good use of pipeline barriers to get pipelining) isn't mentioned at all.

@SaschaWillems
Copy link
Collaborator

Not that much of a hardware guy, but isn't laziliy allocated memory / transient attachment and important Vulkan concept for TBRs? If so might be good to add that.

@SaschaWillems
Copy link
Collaborator

And I second the remarks about the power consumption part of that chapter. I tried to understand the code and data, but felt kinda lost. Wouldn't stuff like that require querying vendor specific apis to get real world power usage? Didn't see that mentioned anywhere.

@SaschaWillems
Copy link
Collaborator

SaschaWillems commented Aug 5, 2025

Also some of the links don't point to anything usefull, e.g. these:

Imagination PowerVR Architecture Guide: Shows tile memory providing 10-20x bandwidth compared to external memory

Qualcomm Adreno Performance Guide: Demonstrates GMEM (tile memory) efficiency in mobile gaming scenarios

NVIDIA Tegra TBR Analysis: Research paper showing 60% power reduction through bandwidth optimization

IEEE Computer Graphics and Applications: Tile-Based Rendering analysis and improvements research

IEEE Transactions on Computers: Thermal management in mobile graphics processing research

Either point to or redirect to a (company) landing page instead of the linked e.g. "Research papers" or documents.

@SaschaWillems
Copy link
Collaborator

And other links don't make sense, e.g. this:

Vulkan-Hpp: Modern C++ bindings with TBR optimization examples

That links to the Vulkan-Hpp headers, I don't see why or how that relates to TBR optimizations?

@gpx1000
Copy link
Contributor Author

gpx1000 commented Aug 5, 2025

I'm going to rewrite this. Sorry not ready for prime time.

@ZehuiLin-Huawei
Copy link

Huawei Maleoon GPU Guide: Maleoon GPU Rendering Optimization


- **Attachment Configuration**: Final attachments use `VK_ATTACHMENT_STORE_OP_STORE`, intermediate attachments use `VK_ATTACHMENT_STORE_OP_DONT_CARE`
- **Load Operations**: Use `VK_ATTACHMENT_LOAD_OP_CLEAR` for new content, `VK_ATTACHMENT_LOAD_OP_DONT_CARE` for intermediate results
- **MSAA Efficiency**: TBR handles 4x MSAA efficiently due to tile memory resolve capabilities

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'd call out 4x specifically - makes it sound like you should prefer it over 2x, or 8x for example - which I'd not say is generic advice. Though tile memory resolve can be a good source of performance gain if you are going to be using MSAA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed that advice to ensure it's clear.


**Tile Memory Management Strategies:**

- **Memory Calculation**: Typical tile memory 512KB, calculate usage based on tile size (32x32 pixels), format size, and sample count

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calculation is not easy to perform for a developer as the determinations are not quite as simple as that. Different formats might not be stored in tile memory in the way you might naively expect. Also how MSAA affects tile size is also not widely documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, removed

=== Half-Precision Float Optimization

Using half-precision floats in shaders can speed up execution and reduce bandwidth on mobile TBR devices. Use low-precision numbers in fragment and compute shaders when visual quality permits:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth pointing out that mediump should be checked on as many devices as possible as they act as something of a hint - using mediump and testing only on one device that mayu under the hood still be using F32 can be very misleading and lead to visual issues on devices actually employing mediump.

I'd still recommend using it whenever possible, but it might be a worth while note/pointer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it to ensure the information I have in here is correct and something I've been able to verify. If you recommend adding it back, I'd be happy to.

…ines, and implementation-agnostic practices.
**Bandwidth Optimization Strategies:**

- **Attachment configuration**: Final attachments use `VK_ATTACHMENT_STORE_OP_STORE`; intermediate attachments use `VK_ATTACHMENT_STORE_OP_DONT_CARE` when you do not need the results.
- **Load operations**: Use `VK_ATTACHMENT_LOAD_OP_CLEAR` for new content; `VK_ATTACHMENT_LOAD_OP_DONT_CARE` for intermediate results you overwrite.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure here. I think that the loadOp can also be set to dont_care when rendering opaque objects, even for new content.


**Advanced TBR considerations:**

- Use subpasses and `VK_DEPENDENCY_BY_REGION_BIT` to enable local data reuse where beneficial; always measure on target devices.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth mentioning here the subpassLoad operator to read pixel value from tile memory.


- No explicit on-chip tile memory model exposed to applications.
- Overdraw tends to generate more external memory traffic than on tilers; minimizing overdraw is important.
- Applications should rely on standard Vulkan techniques (early depth/stencil, appropriate load/store ops, and subpasses where helpful) and profile on target devices.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am seeing "profile on target devices", "measure on target devices", "profiling results on target hardware" many times in this documentation. This kind of redundant phrases should be cleaned up.

@ZehuiLin-Huawei
Copy link

Currently information is scattered in various corners. And same information appears a few times including thing like "profile on target devices", or "Tile size not exposed by core Vulkan".
The documentation structure could be improved by establishing a main line of reasoning and developing the content within the framework of this logic. The current version does not seem to be really useful for developers.

@marty-johnson59 marty-johnson59 added the Priority Important to do this as soon as possible label Jan 22, 2026
…ation, attachment management, and mobile-specific techniques
== Understanding Tiler Architectures

Mobile GPUs operate in power-constrained environments, which makes bandwidth efficiency critical.
Since Vulkan hides many of the internal hardware details—like the exact size of a tile or how the GPU schedules work—the best way to optimize is to provide the driver with clear "intent."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why quotes around "intent"? Doesn't seem to need them for this usage.

Tilers usually process geometry twice: once to determine which triangles fall into which tiles (binning), and a second time to actually render the pixels.
To speed up the binning pass, consider storing your vertex positions in a separate buffer from other attributes like UVs or normals.
This allows the GPU to read only the data it needs to calculate tile coverage, significantly reducing unnecessary bandwidth.
Ideally, position data should be stored as `highp` to ensure accuracy during this critical phase.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree that input position needs to be "highp".

The actual calculation in the shader definitely needs to be done as fp32, but vertex position data in memory only needs to be precise enough in memory to maintain sufficient accuracy in the model object space coordinates. Using unorm16 coordinates for a typical 10 meter real-world equivalent model gives 0.15 mm quantization accuracy which is plenty good enough for most use cases.


Your primary tool for controlling bandwidth is the render pass attachment configuration.
Whether you are using traditional `VkRenderPass` objects or the modern `VK_KHR_dynamic_rendering` extension, the principles are the same.
The `loadOp` and `storeOp` settings are not just "cleanup" steps; they are direct instructions to the hardware.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why quotes around cleanup?

If you know you are going to completely overwrite the tile's contents—for example, by rendering opaque geometry that covers the entire screen—you can use `VK_ATTACHMENT_LOAD_OP_DONT_CARE`.
This tells the GPU it doesn't need to waste time loading the previous frame's data from memory OR performing a clear.

Similarly, use `VK_ATTACHMENT_STORE_OP_DONT_CARE` for any attachment you don't need after the pass is finished.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or use STORE_OP_NONE if you want to indicate "don't write", but also don't want to logically discard the existing content in memory.

This tells the GPU it doesn't need to waste time loading the previous frame's data from memory OR performing a clear.

Similarly, use `VK_ATTACHMENT_STORE_OP_DONT_CARE` for any attachment you don't need after the pass is finished.
Depth buffers and multisampled "resolve" sources are the most common candidates here.
Copy link

@solidpixel solidpixel Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioning resolve here is likely going to make this more confusing to read because it immediately makes you think of resolve attachments, whereas this is about the other kind of attachment. Just "depth attachments, stencil attachments, and multisampled color attachments"?

This extension simplifies your code by removing the need for render pass and framebuffer objects, but the hardware logic remains identical.
You must remain disciplined about your load and store operations to avoid performance regressions.

When using traditional render passes, try to structure them so that the driver can "merge" subpasses.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a throw away paragraph that really isn't explained anywhere. Subpass merging could be worth a bit more of a treatment than this, although obviously a bit of an evolutionary dead end.

[[overdraw-and-sorting]]
=== Managing Overdraw and Depth Logic

Overdraw is one of the biggest bottlenecks on mobile GPUs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overdraw cost has nothing to do with being tile based.

It's a problem on less performant mobile GPUs because the GPU is less powerful, so lots of blended overdraw means you can more easily become fragment bound, but that's nothing to do with the GPU being tile based. Delete?


Overdraw is one of the biggest bottlenecks on mobile GPUs.
Even though writes are deferred, executing a fragment shader multiple times for the same pixel consumes valuable execution unit (EU) cycles and power.
Sorting your opaque objects front-to-back is the most effective way to combat this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, nothing to do with being tile based, so feels out of place.

* Avoid using `discard` or writing to `gl_FragDepth` in your shaders unless absolutely necessary, as these operations can force the GPU to disable "early" depth testing and wait for the fragment shader to finish before it can determine visibility.

[[shader-concurrency]]
=== Shader Complexity and Concurrency

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also nothing to do with being tile based. Delete?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally keep this section with some changes, assuming we are focusing on TBR and mobile GPUs. However, I have no problem deleting it if we want to focus exclusively on TBR optimizations.

Complex shaders are more likely to cause register spilling on mobile GPUs, but this is somewhat orthogonal to tile-based rendering.

Note that this recommendation also applies to desktop GPUs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mobile optimization is a whole different topic (and huge) - the Arm best practices guides > 100 pages because it's a hard problem. Keep this on topic, add other parallel topics on other best practices if enough vendors agree on them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is better to keep them as separate topics, but not a strong opinion

This version already has mobile optimizations orthogonal to TBR, so I am not sure if it is intended.

[[precision-and-prefetch]]
=== Precision and Texture Optimization

Using `mediump` (16-bit) instead of `highp` (32-bit) in your shaders is a classic mobile optimization.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also nothing to do with being tile based. Delete?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar, I would keep it with some changes.

I think this is orthogonal to tilers, but it can still benefit bandwidth, so using mediump is recommended on mobile GPUs.

I personally prefer explicit types such as float16.

Note that some desktop GPUs can also benefit from float16.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep it. As a different topic parallel to this one.

To keep the hardware busy, you want these two stages to overlap as much as possible—the GPU should be binning the next frame while it is still shading the current one.

Incorrect use of pipeline barriers can break this overlap.
If you use a barrier that is too broad—like `VK_PIPELINE_STAGE_ALL_COMMANDS_BIT`—you might force the GPU to finish all pending fragment work before it can even start the binning pass for the next set of draws.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this renders - is this missing spaces around the "—" in "broad—like", etc (occurs in other places in the doc too),

@iagoCL
Copy link

iagoCL commented Mar 10, 2026

A bit orthogonal to tilers, but it might be useful to explain that compute shaders can have lower effective bandwidth compared to the vertex and fragment stages on mobile GPUs.

Developers may want to move work to fragment and avoid enabling USAGE_STORAGE on images that do not need it, as this can disable compression: https://developer.arm.com/documentation/101897/0304/Buffers-and-textures/AFBC-textures-for-Vulkan

This extension provides improved robustness when dangerous undefined behavior occurs, such as out-of-bounds array access. This is particularly important for TBR architectures where tile memory constraints can make buffer overruns more problematic.

**Mobile developer guidance:**
Mobile developers are strongly encouraged to use VK_EXT_robustness2 when targeting TBR GPUs, as tile memory constraints make out-of-bounds access more likely to cause visible artifacts or crashes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

Mobile developers targetting Mali GPUs are encouraged not to use bounds checking.

Robust buffer access is a debugging feature and we recommend be temporarily enabled it to investigate application crashes or visual artifacts. Enabling it in production will negatively impact performance.

Enabling bounds checking causes loss in performance for accesses to uniform buffers and shader storage buffers.

https://developer.arm.com/documentation/101897/0304/Buffers-and-textures/Robust-buffer-access?lang=en

* **IMR GPUs** typically process triangles and write the resulting fragments to memory almost immediately. They rely on high-bandwidth memory and large caches to handle the traffic. Overdraw on an IMR is expensive because every pixel written potentially triggers a memory write.
* **TBR GPUs** defer those writes. By "binning" the geometry and processing by tile, they can perform many operations—like blending and depth testing—entirely within the tile memory. The memory write only happens once the tile is finished.

While you shouldn't try to build a renderer that switches between TBR and IMR logic at runtime, understanding the difference helps you write code that is efficient for both. Good attachment management and avoiding unnecessary overdraw benefit every architecture, but they are absolutely essential for performance on a tiler.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick:
I am not sure what build a renderer that switches between TBR and IMR logic at runtime means.

Note that some GPUs can change between IMR and TBR.

I would change the sentence to something like:
It is not necessary to write a custom path for TBR and IMR GPUs. In general, understanding how a tiler works will help you write code that is efficient for both architectures.


Overdraw is one of the biggest bottlenecks on mobile GPUs.
Even though writes are deferred, executing a fragment shader multiple times for the same pixel consumes valuable execution unit (EU) cycles and power.
Sorting your opaque objects front-to-back is the most effective way to combat this.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting your opaque objects front-to-back is the most effective way to combat this.

This was recommended on mobile for longer than on PC. While it is relevant for older GPUs, it is no longer always recommended for newer architectures.

See: https://developer.arm.com/community/arm-community-blogs/b/mobile-graphics-and-gaming-blog/posts/immortalis-g925-the-fragment-prepass

* Avoid using `discard` or writing to `gl_FragDepth` in your shaders unless absolutely necessary, as these operations can force the GPU to disable "early" depth testing and wait for the fragment shader to finish before it can determine visibility.

[[shader-concurrency]]
=== Shader Complexity and Concurrency
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally keep this section with some changes, assuming we are focusing on TBR and mobile GPUs. However, I have no problem deleting it if we want to focus exclusively on TBR optimizations.

Complex shaders are more likely to cause register spilling on mobile GPUs, but this is somewhat orthogonal to tile-based rendering.

Note that this recommendation also applies to desktop GPUs.

[[precision-and-prefetch]]
=== Precision and Texture Optimization

Using `mediump` (16-bit) instead of `highp` (32-bit) in your shaders is a classic mobile optimization.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar, I would keep it with some changes.

I think this is orthogonal to tilers, but it can still benefit bandwidth, so using mediump is recommended on mobile GPUs.

I personally prefer explicit types such as float16.

Note that some desktop GPUs can also benefit from float16.

[[synchronization-and-subpasses]]
=== Synchronization and Pipeline Flow

Frequent synchronization points—like calling `vkQueueWaitIdle`—can cause the GPU to stall while waiting for the CPU, or vice versa.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this tiler specific? I think it a best practices for all GPUs?

Incorrect use of pipeline barriers can break this overlap.
If you use a barrier that is too broad—like `VK_PIPELINE_STAGE_ALL_COMMANDS_BIT`—you might force the GPU to finish all pending fragment work before it can even start the binning pass for the next set of draws.
Instead, use the most specific stages and access masks possible.
For example, if a compute shader produces data for a vertex buffer, the barrier should only synchronize the compute stage with the vertex input stage.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would mention that ALL_GRAPHICSALL_GRAPHICS should generally be avoided. It’s better to separate VERTEX_SHADER_BIT and FRAGMENT_SHADER_BIT, and mark resources according to the stages that actually use them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Priority Important to do this as soon as possible

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants