From 0b444ab13d83fec325104310fdbb53f00b48bfd1 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Mon, 9 Mar 2026 14:11:48 +0100 Subject: [PATCH 1/3] wishlist: add empty mount namespace creation Signed-off-by: Christian Brauner --- README.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/README.md b/README.md index b33db6b..0e5d30f 100644 --- a/README.md +++ b/README.md @@ -16,6 +16,16 @@ associated problem space. ## In-Progress +### Create empty mount namespaces via `unshare(UNSHARE_EMPTY_MNTNS)` and `clone3(CLONE_EMPTY_MNTNS)` + +Now that we have support for `nullfs` it is trivial to allow the +creation of completely empty mount namespaces, i.e., mount namespaces +that only have the `nullfs` mount located at it's root. + +**Usecase:** This allows to isolate tasks in completely empty mount +namespaces. It also allows the caller to avoid copying its current mount +table which is useless in the majority of container workload cases. + ### Ability to put user xattrs on `S_IFSOCK` socket entrypoint inodes in the file system Currently, the kernel only allows extended attributes in the From 4945c516eceb169a7416d958c032081d1b374dfc Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Mon, 9 Mar 2026 14:17:29 +0100 Subject: [PATCH 2/3] wishlist: add nullfs entry Signed-off-by: Christian Brauner --- README.md | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/README.md b/README.md index 0e5d30f..1ac5602 100644 --- a/README.md +++ b/README.md @@ -105,6 +105,34 @@ This creates a mount namespace where "wootwoot" has become the rootfs. The caller can `setns()` into this new mount namespace and assemble additional mounts without copying and destroying the entire parent mount table. +### Add immutable rootfs (`nullfs`) + +Currently `pivot_root()` doesn't work on the real rootfs because it +cannot be unmounted. Userspace has to do a recursive removal of the +initramfs contents manually before continuing the boot. + +Add an immutable rootfs called `nullfs` that serves as the parent mount +for anything that is actually useful such as the tmpfs or ramfs for +initramfs unpacking or the rootfs itself. The kernel mounts a +tmpfs/ramfs on top of it, unpacks the initramfs and fires up userspace +which mounts the rootfs and can then simply do: + +```c +chdir(rootfs); +pivot_root(".", "."); +umount2(".", MNT_DETACH); +``` + +This also means that the rootfs mount in unprivileged namespaces doesn't +need to become `MNT_LOCKED` anymore as it's guaranteed that the +immutable rootfs remains permanently empty so there cannot be anything +revealed by unmounting the covering mount. + +**Use-Case:** Simplifies the boot process by enabling `pivot_root()` to +work directly on the real rootfs. Removes the need for traditional +`switch_root` workarounds. In the future this also allows us to create +completely empty mount namespaces without risking to leak anything. + ### Query mount information via file descriptor with `statmount()` Extend `struct mnt_id_req` to accept a file descriptor and introduce From baf5a1766b5f48cb3ba7e308bfd0a1d10f120730 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Mon, 9 Mar 2026 14:19:10 +0100 Subject: [PATCH 3/3] wishlist: Allow MOVE_MOUNT_BENEATH on the rootfs Signed-off-by: Christian Brauner --- README.md | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/README.md b/README.md index 1ac5602..3bf5d93 100644 --- a/README.md +++ b/README.md @@ -133,6 +133,46 @@ work directly on the real rootfs. Removes the need for traditional `switch_root` workarounds. In the future this also allows us to create completely empty mount namespaces without risking to leak anything. +### Allow `MOVE_MOUNT_BENEATH` on the rootfs + +Allow `MOVE_MOUNT_BENEATH` to target the caller's rootfs, enabling +root-switching without `pivot_root(2)`. The traditional approach to +switching the rootfs involves `pivot_root(2)` or a `chroot_fs_refs()`-based +mechanism that atomically updates `fs->root` for all tasks sharing the +same `fs_struct`. This has consequences for `fork()`, `unshare(CLONE_FS)`, +and `setns()`. + +Instead, decompose root-switching into individually atomic, locally-scoped +steps: + +```c +fd_tree = open_tree(-EBADF, "/newroot", + OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC); +fchdir(fd_tree); +move_mount(fd_tree, "", AT_FDCWD, "/", + MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH); +chroot("."); +umount2(".", MNT_DETACH); +``` + +Since each step only modifies the caller's own state, the +`fork()`/`unshare()`/`setns()` races are eliminated by design. + +To make this work, `MNT_LOCKED` is transferred from the top mount to the +mount beneath. The new mount takes over the job of protecting the parent +mount from being revealed. This also makes it possible to safely modify +an inherited mount table after `unshare(CLONE_NEWUSER | CLONE_NEWNS)`: + +```sh +mount --beneath -t tmpfs tmpfs /proc +umount -l /proc +``` + +**Use-Case:** Containers created with `unshare(CLONE_NEWUSER | CLONE_NEWNS)` +can reshuffle an inherited mount table safely. `MOVE_MOUNT_BENEATH` on the +rootfs makes it possible to switch out the rootfs without the costly +`pivot_root(2)` and without cross-namespace vulnerabilities. + ### Query mount information via file descriptor with `statmount()` Extend `struct mnt_id_req` to accept a file descriptor and introduce