MDEV-31956 SSD based InnoDB buffer pool extension #4510

vlad-lesin · 2026-01-05T13:32:11Z

In one of the practical cloud MariaDB setups, a server node accesses its
datadir over the network, but also has a fast local SSD storage for
temporary data. The content of such temporary storage is lost when the
server container is destroyed.

The commit uses this ephemeral fast local storage (SSD) as an extension of
the portion of InnoDB buffer pool (DRAM) that caches persistent data
pages. This cache is separated from the persistent storage of data files
and ib_logfile0 and ignored during backup.

The following system variables were introduced:

innodb_extended_buffer_pool_size - the size of external buffer pool
file, if it equals to 0, external buffer pool will not be used;

innodb_extended_buffer_pool_path - the path to external buffer pool
file.

If innodb_extended_buffer_pool_size is not equal to 0, external buffer
pool file will be (re)created on startup.

Only clean pages will be flushed to external buffer pool file. There is
no need to flush dirty pages, as such pages will become clean after
flushing, and then will be evicted when they reach the tail of LRU list.

The general idea of this commit is to flush clean pages to external
buffer pool file when they are evicted.

A page can be evicted either by transaction thread or by background
thread of page cleaner. In some cases transaction thread is waiting for
page cleaner thread to finish its job. We can't do flushing in external
buffer pool file when transaction threads are waiting for eviction,
that would heart performance. That's why the only case for flushing is
when page cleaner thread evicts pages in background and there are no
waiters. For this purpose buf_pool_t::done_flush_list_waiters_count
variable was introduced, we flush evicted clean pages only if the
variable is zeroed.

Clean pages are evicted in buf_flush_LRU_list_batch() to keep some
amount of pages in buffer pool's free list. That's why we flush every
second page to external buffer pool file, otherwise there could be not
enough amount of pages in free list to let transaction threads to
allocate buffer pool pages without page cleaner waiting. This might be
not a good solution, but this is enough for prototyping.

External buffer pool page is introduced to store information in buffer
pool page hash about the certain page can be read from external buffer
pool file. The first several members of such page must be the same as the
members of internal page. External page frame must be equal to the
certain value to distinguish external page from internal one. External
buffer pages are preallocated on startup in external pages array. We
could get rid of the frame in external page, and check if the page's
address belongs to the array to distinguish external and internal pages.

There are also external pages free and LRU lists. When some internal page
is decided to be flushed in external buffer pool file, a new external
page is allocated either from the head of external free list, or from
the tail of external LRU list. Both lists are protected with
buf_pool.mutex. It makes sense, because a page is removed from internal
LRU list during eviction under buf_pool.mutex.

Then internal page is locked and the allocated external page is attached
to io request for external buffer pool file, and when write request is
completed, the internal page is replaced with external one in page hash,
external page is pushed to the head of external LRU list and internal
page is unlocked. After internal page was removed from external free list,
it was not placed in external LRU, and placed there only
after write completion, so the page can't be used by the other threads
until write is completed.

Page hash chain get element function has additional template parameter,
which notifies the function if external pages must be ignored or not. We
don't ignore external pages in page hash in two cases, when some page is
initialized for read and when one is reinitialized for new page creating.

When an internal page is initialized for read and external page with the
same page id is found in page hash, the internal page is locked,
the external page in replaced with newly initialized internal page in the
page hash chain, the external page is removed from external LRU list and
attached to io request to external buffer pool file. When the io request
is completed, external page is returned to external free list,
internal page is unlocked. So during read external page absents in both
external LRU and free lists and can't be reused.

When an internal page is initialized for new page creating and external
pages with the same page id is found in page hash, we just remove external
page from the page hash chain and external LRU list and push it to the
head of external free list. So the external page can be used for future
flushing.

There is also some magic with watch sentinels. If watch is goint to be
set for the external page with the same page id in page hash, we replace
the external page with sentinel in page hash and attach external
page to the sentinel's frame. When such sentinel with attached external
page should be removed, we replace it with the attached external page in
page hash instead of just removing the sentinel. This idea is not fully
implemented, as the function, which allocates external pages, does not
take into account that page hash can contain sentinels with attached
external pages. And, anyway, this code must be removed if we cherry-pick
the commit to 11.* branch, as change buffer is already removed in those
versions.

The pages are flushed to and read from external buffer pool file with
the same manner as they are flushed to their spaces, i.e. compressed and
encrypted pages stay compressed and encrypted in external buffer pool
file.

The Jira issue number for this PR is: MDEV-______

Description

TODO: fill description here

Release Notes

TODO: What should the release notes say about this change?
Include any changed system variables, status variables or behaviour. Optionally list any https://mariadb.com/kb/ pages that need changing.

How can this PR be tested?

TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended.
Consult the documentation on "Writing good test cases".

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

Basing the PR against the correct MariaDB version

This is a new feature or a refactoring, and the PR is based against the main branch.
This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

CLAassistant · 2026-01-05T13:32:18Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

dr-m

Before I review this deeper, could you please fix the build and the test innodb.ext_buf_pool on Microsoft Windows?

dr-m · 2026-01-14T07:13:41Z

storage/innobase/buf/buf0flu.cc

+      }
+      else
+      {
+      free_page:


warning C4102: 'free_page': unreferenced label [C:\buildbot\workers\prod\amd64-windows-packages\build\storage\innobase\innobase.vcxproj]

This label needs to be enclosed in #ifndef DBUG_OFF.

Originally, some implementations of the C preprocessor did not allow any white space before the # character. It is customary to start the preprocessor #if and #endif at the first column.

In one of the practical cloud MariaDB setups, a server node accesses its datadir over the network, but also has a fast local SSD storage for temporary data. The content of such temporary storage is lost when the server container is destroyed. The commit uses this ephemeral fast local storage (SSD) as an extension of the portion of InnoDB buffer pool (DRAM) that caches persistent data pages. This cache is separated from the persistent storage of data files and ib_logfile0 and ignored during backup. The following system variables were introduced: innodb_extended_buffer_pool_size - the size of external buffer pool file, if it equals to 0, external buffer pool will not be used; innodb_extended_buffer_pool_path - the path to external buffer pool file. If innodb_extended_buffer_pool_size is not equal to 0, external buffer pool file will be created on startup. Only clean pages will be flushed to external buffer pool file. There is no need to flush dirty pages, as such pages will become clean after flushing, and then will be evicted when they reach the tail of LRU list. The general idea of this commit is to flush clean pages to external buffer pool file when they are evicted. A page can be evicted either by transaction thread or by background thread of page cleaner. In some cases transaction thread is waiting for page cleaner thread to finish its job. We can't do flushing in external buffer pool file when transaction threads are waithing for eviction, that would heart performance. That's why the only case for flushing is when page cleaner thread evicts pages in background and there are no waiters. For this purprose buf_pool_t::done_flush_list_waiters_count variable was introduced, we flush evicted clean pages only if the variable is zeroed. Clean pages are evicted in buf_flush_LRU_list_batch() to keep some amount of pages in buffer pool's free list. That's why we flush every second page to external buffer pool file, otherwise there could be not enought amount of pages in free list to let transaction threads to allocate buffer pool pages without page cleaner waiting. This might be not a good solution, but this is enought for prototyping. External buffer pool page is introduced to store information in buffer pool page hash about the certain page can be read from external buffer pool file. The first several members of such page must be the same as the members of internal page. External page frame must be equal to the certain value to disthinguish external page from internal one. External buffer pages are preallocated on startup in external pages array. We could get rid of the frame in external page, and check if the page's address belongs to the array to distinguish external and internal pages. There are also external pages free and LRU lists. When some internal page is decided to be flushed in external buffer pool file, a new external page is allocated eighter from the head of external free list, or from the tail of external LRU list. Both lists are protected with buf_pool.mutex. It makes sense, because a page is removed from internal LRU list during eviction under buf_pool.mutex. Then internal page is locked and the allocated external page is attached to io request for external buffer pool file, and when write request is completed, the internal page is replaced with external one in page hash, external page is pushed to the head of external LRU list and internal page is unlocked. After internal page was removed from external free list, it was not placed in external LRU, and placed there only after write completion, so the page can't be used by the other threads until write is completed. Page hash chain get element function has additional template parameter, which notifies the function if external pages must be ignored or not. We don't ignore external pages in page hash in two cases, when some page is initialized for read and when one is reinitialized for new page creating. When an internal page is initialized for read and external page with the same page id is found in page hash, the internal page is locked, the external page in replaced with newly initialized internal page in the page hash chain, the external page is removed from external LRU list and attached to io request to external buffer pool file. When the io request is completed, external page is returned to external free list, internal page is unlocked. So during read external page absents in both external LRU and free lists and can't be reused. When an internal page is initialized for new page creating and external pages with the same page id is found in page hash, we just remove external page from the page hash chain and external LRU list and push it to the head of external free list. So the external page can be used for future flushing. The pages are flushed to and read from external buffer pool file with the same manner as they are flushed to their spaces, i.e. compressed and encrypted pages stay compressed and encrypted in external buffer pool file.

Fix some tests. Make ext_buf_pool test more stable avoiding race conditions for read/write counters.

Fix Windows and liburing issues.

dr-m · 2026-01-19T05:52:56Z

mysql-test/suite/innodb_i_s/innodb_buffer_pool_stats.result

-  `UNCOMPRESS_CURRENT` bigint(21) unsigned NOT NULL
+  `UNCOMPRESS_CURRENT` bigint(21) unsigned NOT NULL,
+  `NUMBER_PAGES_WRITTEN_TO_EXTERNAL_BUFFER_POOL` bigint(21) unsigned NOT NULL,
+  `NUMBER_PAGES_READ_FROM_EXTERNAL_BUFFER_POOL` bigint(21) unsigned NOT NULL


The test rocksdb.innodb_i_s_tables_disabled needs to be adjusted as well.

include/my_sys.h

dr-m · 2026-01-19T05:55:42Z

mysql-test/suite/innodb/t/ext_buf_pool.test

+--source include/have_innodb.inc
+--source include/have_debug.inc
+--source include/have_debug_sync.inc


Could we have a non-debug version of this test as well?

No, we can't. We have to inject debug code to have ability to test it.

dr-m · 2026-01-19T05:59:27Z

mysql-test/suite/innodb/t/ext_buf_pool.opt

@@ -0,0 +1 @@
+--innodb-buffer-pool-size=21M --innodb-extended-buffer-pool-size=1M


Typically, the local SSD is larger than the RAM, and therefore it would be more interesting to test a scenario where the buffer pool is smaller than the extended buffer pool. I understand that our current CI setup is not suitable for any performance testing. But, currently we only cover this functionality in debug-instrumented executables.

Would it be possible to write some sort of a test, or extend an existing one (such as encryption.innochecksum) with a variant that would use the minimal innodb_buffer_pool_size and a larger innodb_extended_buffer_pool_size? Currently, that test is running with innodb_buffer_pool_size=64M.

The test flushes small amount of pages into extended buffer pool file, that's why there was not any reason to set a big file size. Probably it makes sense to develop separate test, we can set the minimal innodb_buffer_pool_size and a larger innodb_extended_buffer_pool_size, but what exactly are we going to test in that separate test?

I was thinking that it would be useful to mimic a typical deployment scenario where we expect some pages to be written to the local SSD and read back from there. Sure, mtr tests are limited in size and not good for performance testing. Should there be any race conditions, having a test run on multiple platforms for every push would help catch them over time.

dr-m · 2026-01-19T06:01:46Z

storage/innobase/buf/buf0dblwr.cc

+  /* Disable external buffer pool flushing for the duration of double write
+  buffer creating, as double write pages will be removed from LRU */
+  ++buf_pool.done_flush_list_waiters_count;
+  SCOPE_EXIT([]() { --buf_pool.done_flush_list_waiters_count; });


Instead of doing this, could we create the doublewrite buffer in a single atomic mini-transaction, like I did in 1d1699e so that the #4405 innodb_log_archive=ON recovery can work from the very beginning? I could create a separate pull request for that.

I filed #4554 for the refactoring, which I hope will remove the need for this work-around.

Yes, it makes sense to remove the workaround, but I would prefer to do this after RQG testing shows good results for the current code version to exclude possible influence of the patches to each other.

dr-m · 2026-01-19T06:03:38Z

storage/innobase/buf/buf0flu.cc

+      }
+      else
+      {
+      free_page:


Originally, some implementations of the C preprocessor did not allow any white space before the # character. It is customary to start the preprocessor #if and #endif at the first column.

Use persistent named files for external buffer pool instead of temporary one.

Squash it. Fix for the following RQG test failures: 2. Scenario: The server is under load (9 concurrent sessions). At some point of time he crashes with mariadbd: 11.8-MDEV-31956-ext_buf_pool/storage/innobase/buf/buf0flu.cc:294: void buf_page_t::write_complete(buf_page_t::space_type, bool, uint32_t): Assertion `persistent == (om > 2)' failed. 4. Scenario: The server was some time under load (one connection). Intentional SIGKILL DB server followed by restart and running certain checks. All that did not show some error. But the shutdown hang like Fragment of rqg.log: # 2026-01-16T13:15:57 [1467965] INFO: DBServer_e::MySQL::MySQLd::stopServer: server[1]: Stopping server on port 25140 ... # 2026-01-16T13:28:22 [1467965] ERROR: DBServer_e::MySQL::MySQLd::stopServer: server[1]: Did not shut down properly. Terminate it == RQG loses the "patience" and sends finally SIGABRT to the process of the DB server. The server error log shows 2026-01-16 13:15:58 0 [Note] /data/Server_bin/11.8-MDEV-31956-ext_buf_pool_debug_Og/bin/mariadbd (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown ... 2026-01-16 13:15:58 0 [Note] InnoDB: FTS optimize thread exiting. 2026-01-16 13:16:01 0 [Note] InnoDB: Starting shutdown... .... 2026-01-16 13:16:01 0 [Note] InnoDB: Buffer pool(s) dump completed at 260116 13:16:01 2026-01-16 13:18:37 0 [Note] InnoDB: Waiting for page cleaner thread to exit .... 2026-01-16 13:26:24 0 [Note] InnoDB: Waiting for page cleaner thread to exit

Evict page on write completion if it's space was removed. Lock external buffer pool file on Linux.

dr-m

This is a partial review.

dr-m · 2026-01-21T14:41:05Z

storage/innobase/include/buf0buf.h

+  buf_page_base_t(const buf_page_base_t &b)
+      : id_(b.id_), hash(b.hash), frame(b.frame)
+#ifdef UNIV_DEBUG
+        ,
+        in_LRU_list(b.in_LRU_list), in_page_hash(b.in_page_hash),
+        in_free_list(b.in_free_list)
+#endif /* UNIV_DEBUG */
+  {
+  }


Would the following work:

buf_page_base_t(const buf_page_base_t &)= default;

There does not seem to be anything fancy here, just a straight copy.

dr-m · 2026-01-21T14:44:55Z

storage/innobase/include/buf0buf.h

+  bool external() const noexcept
+  {
+    /* TODO: we could just compare the address of the page, as it is done for
+    sentinel pages, and use *frame for something else */
+    return reinterpret_cast<std::uintptr_t>(frame) == EXT_BUF_FRAME;
+  }


Right, !buf_pool_t::is_uncompressed_current(this) would almost work, except that it would also hold for ROW_FORMAT=COMPRESSED pages that lack an uncompressed page frame. For those blocks, we have frame==nullptr.

dr-m · 2026-01-21T14:48:09Z

storage/innobase/include/buf0buf.h

  /** broadcast when a batch completes; protected by flush_list_mutex */
  pthread_cond_t done_flush_list;

+  /** The number of threads waiting for done_flush_list, must be set before
+  page cleaner wake up and reset after done_flush_list waiting is finished,
+  protected with flush_list_mutex */
+  size_t done_flush_list_waiters_count;


This kind of a counter must be somehow embedded in done_flush_list itself.

It would be good for the comment to mention why we need this counter. The only place where we read the counter (instead of incrementing or decrementing it) is in buf_flush_LRU_list_batch() when we skip the flushing to the extended buffer pool, to reduce latency when other threads are waiting for the buf_flush_page_cleaner().

Could there be an alternative solution that would allow us to avoid adding such a counter? Would the flushing be frequent enough if we invoked it when the page cleaner is considered to be idle? That is, avoid the call buf_pool.page_cleaner_set_idle(true) in buf_flush_page_cleaner() and just keep invoking the extended-buffer-pool flushing once per second as long as it is considered useful?

I guess that we can’t easily submit the same block to be written to both to the persistent data file and a location in the extended buffer pool, concurrently?

I realized that we can approximate this counter with a Boolean flag. We only want to know whether anyone is waiting for done_flush_list. So, each waiter can set the flag, and each caller of pthread_cond_broadcast(&buf_pool.done_flush_list) can clear the flag. This flag could be embedded in page_cleaner_status. That is, buf_pool_t::LRU_FLUSH would be changed from 4 to 8, and the new flag FLUSH_LIST_WAIT (or a better name if you can think of one) would be 4.

dr-m · 2026-01-21T14:49:38Z

storage/innobase/include/fil0fil.h

+  /** External buffer pool file handler */
+  pfs_os_file_t ext_bp_file;


Do we need any PERFORMANCE_SCHEMA instrumentation for this file, or could we use plain os_file_t handle? We are using asynchronous access, right? Is it covered by PERFORMANCE_SCHEMA in any way?

dr-m · 2026-01-21T14:51:02Z

storage/innobase/include/os0file.h

-    return off && len && node && (type & (PUNCH ^ WRITE_ASYNC))
-      ? punch_hole(off, len)
-      : DB_SUCCESS;
+    return off && len && (type & (PUNCH ^ WRITE_ASYNC)) && node()
+               ? punch_hole(off, len)
+               : DB_SUCCESS;


It makes sense to move the more expensive node() call last, but I don’t see a reason to change the indentation of the subsequent lines.

dr-m · 2026-01-21T14:52:55Z

storage/innobase/include/os0file.h

+  buf_page_t *bpage() const
+  {
+    return reinterpret_cast<buf_page_t *>(
+        reinterpret_cast<ptrdiff_t>(bpage_ptr) & ~ptrdiff_t(1));
+  };
+
+  bool ext_buf() const
+  {
+    return reinterpret_cast<ptrdiff_t>(bpage_ptr) & 1;
+  }
+
+  fil_node_t *node() const
+  {
+    ut_ad(!ext_buf());
+    return node_ptr;
+  }
+
+  ext_buf_page_t *ext_buf_page() const {
+    ut_ad(ext_buf());
+    return ext_buf_page_ptr;
+  };


These are missing noexcept. I think that the 1 should be replaced with EXT_BUF_FRAME or augmented with static_assert(EXT_BUF_FRAME == 1, "").

vlad-lesin requested a review from dr-m January 5, 2026 13:32

gkodinov added the MariaDB Corporation label Jan 6, 2026

vlad-lesin force-pushed the 11.8-MDEV-31956-ext_buf_pool branch 2 times, most recently from ee7d993 to bc5b76d Compare January 11, 2026 20:19

dr-m reviewed Jan 14, 2026

View reviewed changes

vlad-lesin force-pushed the 11.8-MDEV-31956-ext_buf_pool branch 2 times, most recently from 96a3b95 to c13adf1 Compare January 14, 2026 09:20

vlad-lesin force-pushed the 11.8-MDEV-31956-ext_buf_pool branch from c13adf1 to 99921cf Compare January 14, 2026 13:41

MDEV-31956 SSD based InnoDB buffer pool extension

64eb035

Fix some tests. Make ext_buf_pool test more stable avoiding race conditions for read/write counters.

vlad-lesin force-pushed the 11.8-MDEV-31956-ext_buf_pool branch from 99921cf to 64eb035 Compare January 15, 2026 10:04

MDEV-31956 SSD based InnoDB buffer pool extension

3ac1611

Fix Windows and liburing issues.

dr-m reviewed Jan 19, 2026

View reviewed changes

vlad-lesin force-pushed the 11.8-MDEV-31956-ext_buf_pool branch from 30a83bf to db5856a Compare January 19, 2026 09:08

MDEV-31956 SSD based InnoDB buffer pool extension

c78f5ac

Use persistent named files for external buffer pool instead of temporary one.

vlad-lesin force-pushed the 11.8-MDEV-31956-ext_buf_pool branch from db5856a to c78f5ac Compare January 19, 2026 09:11

vlad-lesin added 2 commits January 20, 2026 11:52

MDEV-31956 SSD based InnoDB buffer pool extension

4918d15

Evict page on write completion if it's space was removed. Lock external buffer pool file on Linux.

dr-m reviewed Jan 21, 2026

View reviewed changes

		@@ -0,0 +1 @@
		--innodb-buffer-pool-size=21M --innodb-extended-buffer-pool-size=1M

		/** External buffer pool file handler */
		pfs_os_file_t ext_bp_file;

Uh oh!

MDEV-31956 SSD based InnoDB buffer pool extension #4510

Are you sure you want to change the base?

MDEV-31956 SSD based InnoDB buffer pool extension #4510

Uh oh!

Conversation

vlad-lesin commented Jan 5, 2026

Description

Release Notes

How can this PR be tested?

Basing the PR against the correct MariaDB version

PR quality check

Uh oh!

CLAassistant commented Jan 5, 2026

Uh oh!

dr-m left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dr-m left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

dr-m left a comment •

edited

Loading