From 0d6355a39d0f23b62f25a46dd060401d9ce2ba61 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Mon, 9 Mar 2026 14:49:58 +0100 Subject: [PATCH 1/6] website: split monolithic README into individual feature pages Split the single ~1200-line README.md into 68 individual feature pages organized by status: in-progress (7), wishlist (42), and completed (19). Each feature gets its own Markdown file with YAML front matter containing title, status, categories, and commit SHA (for completed items). Section index pages provide introductory text and navigation structure. A new landing page ties everything together with a three-column layout. This makes features individually linkable, reduces merge conflicts on PRs that touch different features, and enables sidebar navigation via the Hugo Book theme. Signed-off-by: Christian Brauner --- website/content/_index.md | 42 +++++++++++++++ website/content/completed/_index.md | 7 +++ ...liable-way-to-check-for-pid-namespacing.md | 28 ++++++++++ ...call-to-query-information-about-a-mount.md | 37 +++++++++++++ ...itional-identifiers-for-pidfds-in-statx.md | 18 +++++++ ...ng-idmapped-mounts-from-idmapped-mounts.md | 43 +++++++++++++++ ...ermine-the-parent-process-id-of-a-pidfd.md | 28 ++++++++++ ...ption-of-scm-rights-for-af-unix-sockets.md | 53 +++++++++++++++++++ .../completed/idmapped-mounts-for-tmpfs.md | 31 +++++++++++ ...statx-on-a-pidfd-return-additional-info.md | 28 ++++++++++ ...tory-instead-of-the-top-level-directory.md | 34 ++++++++++++ ...thout-rescanning-of-proc-self-mountinfo.md | 26 +++++++++ .../namespace-binfmt-misc-filesystem.md | 17 ++++++ ...-translate-a-pid-between-pid-namespaces.md | 17 ++++++ ...pace-to-have-an-idmapping-when-attached.md | 18 +++++++ .../scm-pidfd-auxiliary-socket-message.md | 21 ++++++++ .../security-hook-for-create-user-ns.md | 15 ++++++ .../completed/set-comm-field-before-exec.md | 35 ++++++++++++ .../support-idmapped-mounts-for-overlayfs.md | 15 ++++++ ...t-if-into-account-for-routing-decisions.md | 25 +++++++++ .../unmounting-of-obstructed-mounts.md | 24 +++++++++ website/content/in-progress/_index.md | 9 ++++ ...et-entrypoint-inodes-in-the-file-system.md | 41 ++++++++++++++ .../add-immutable-rootfs-nullfs.md | 37 +++++++++++++ .../allow-move-mount-beneath-on-the-rootfs.md | 48 +++++++++++++++++ ...mpty-mntns-and-clone3-clone-empty-mntns.md | 18 +++++++ ...custom-rootfs-via-open-tree-and-fsmount.md | 39 ++++++++++++++ ...tion-via-file-descriptor-with-statmount.md | 22 ++++++++ ...support-detached-mounts-with-pivot-root.md | 20 +++++++ website/content/wishlist/_index.md | 9 ++++ .../a-timeout-for-the-flock-syscall.md | 20 +++++++ .../ability-to-only-open-regular-files.md | 20 +++++++ ...ability-to-reopen-a-struct-block-device.md | 18 +++++++ ...nt-setattr-locked-flag-to-mount-setattr.md | 22 ++++++++ .../add-process-by-pidfd-to-a-cgroup.md | 26 +++++++++ .../content/wishlist/asynchronous-close.md | 29 ++++++++++ ...pty-path-support-for-openat-and-openat2.md | 18 +++++++ .../at-empty-path-support-for-unlinkat.md | 11 ++++ .../automatic-growing-of-btrfs-filesystems.md | 30 +++++++++++ ...-message-describing-the-sender-s-cgroup.md | 21 ++++++++ website/content/wishlist/blobfs.md | 22 ++++++++ .../clock-monotonic-network-timestamps.md | 21 ++++++++ ...lone-pidfd-autokill-semantics-for-pid-1.md | 37 +++++++++++++ ...mount-point-belongs-to-the-current-user.md | 27 ++++++++++ ...rd-to-allow-mknod-in-non-initial-userns.md | 28 ++++++++++ ...ses-watched-via-pidfd-from-waitid-p-all.md | 22 ++++++++ ...g-with-classic-synchronous-system-calls.md | 14 +++++ ...-mount-properties-ignoring-any-failures.md | 12 +++++ ...hing-to-all-new-namespaces-of-a-process.md | 18 +++++++ .../filtering-on-received-file-descriptors.md | 16 ++++++ .../immutable-layers-for-overlayfs.md | 18 +++++++ .../immutable-loopback-block-devices.md | 23 ++++++++ .../inotify-events-for-bsd-file-locks.md | 26 +++++++++ .../wishlist/ioctl-api-for-overlayfs.md | 23 ++++++++ ...ing-of-o-tmpfile-files-with-replacement.md | 21 ++++++++ .../make-quotas-work-with-user-namespaces.md | 18 +++++++ ...fferent-uids-gids-to-the-same-uids-gids.md | 17 ++++++ .../namespaced-loop-and-block-devices.md | 14 +++++ ...open-thread-group-leader-via-pidfd-open.md | 18 +++++++ ...ath-based-acl-management-in-an-lsm-hook.md | 17 ++++++ .../per-cgroup-limit-for-coredump-sizes.md | 18 +++++++ ...creation-and-opening-of-non-file-inodes.md | 31 +++++++++++ .../race-free-mounting-of-block-devices.md | 18 +++++++ .../read-only-propagation-of-mounts.md | 29 ++++++++++ .../reasonable-eof-on-sock-seqpacket.md | 17 ++++++ ...ling-of-selinux-dropping-scm-rights-fds.md | 19 +++++++ .../security-hook-for-mount-setattr.md | 13 +++++ ...yring-for-dm-verity-volume-verification.md | 17 ++++++ .../unlinking-via-two-file-descriptors.md | 22 ++++++++ .../wishlist/upgrade-masks-in-openat2.md | 13 +++++ ...l-handling-of-lsm-denials-on-scm-rights.md | 41 ++++++++++++++ website/content/wishlist/xattrs-for-pidfd.md | 16 ++++++ 72 files changed, 1686 insertions(+) create mode 100644 website/content/_index.md create mode 100644 website/content/completed/_index.md create mode 100644 website/content/completed/a-reliable-way-to-check-for-pid-namespacing.md create mode 100644 website/content/completed/a-system-call-to-query-information-about-a-mount.md create mode 100644 website/content/completed/additional-identifiers-for-pidfds-in-statx.md create mode 100644 website/content/completed/allow-creating-idmapped-mounts-from-idmapped-mounts.md create mode 100644 website/content/completed/api-to-determine-the-parent-process-id-of-a-pidfd.md create mode 100644 website/content/completed/disabling-reception-of-scm-rights-for-af-unix-sockets.md create mode 100644 website/content/completed/idmapped-mounts-for-tmpfs.md create mode 100644 website/content/completed/make-statx-on-a-pidfd-return-additional-info.md create mode 100644 website/content/completed/mount-a-subdirectory-instead-of-the-top-level-directory.md create mode 100644 website/content/completed/mount-notifications-without-rescanning-of-proc-self-mountinfo.md create mode 100644 website/content/completed/namespace-binfmt-misc-filesystem.md create mode 100644 website/content/completed/namespace-ioctl-to-translate-a-pid-between-pid-namespaces.md create mode 100644 website/content/completed/require-a-user-namespace-to-have-an-idmapping-when-attached.md create mode 100644 website/content/completed/scm-pidfd-auxiliary-socket-message.md create mode 100644 website/content/completed/security-hook-for-create-user-ns.md create mode 100644 website/content/completed/set-comm-field-before-exec.md create mode 100644 website/content/completed/support-idmapped-mounts-for-overlayfs.md create mode 100644 website/content/completed/take-ip-unicast-if-into-account-for-routing-decisions.md create mode 100644 website/content/completed/unmounting-of-obstructed-mounts.md create mode 100644 website/content/in-progress/_index.md create mode 100644 website/content/in-progress/ability-to-put-user-xattrs-on-s-ifsock-socket-entrypoint-inodes-in-the-file-system.md create mode 100644 website/content/in-progress/add-immutable-rootfs-nullfs.md create mode 100644 website/content/in-progress/allow-move-mount-beneath-on-the-rootfs.md create mode 100644 website/content/in-progress/create-empty-mount-namespaces-via-unshare-unshare-empty-mntns-and-clone3-clone-empty-mntns.md create mode 100644 website/content/in-progress/create-mount-namespace-with-custom-rootfs-via-open-tree-and-fsmount.md create mode 100644 website/content/in-progress/query-mount-information-via-file-descriptor-with-statmount.md create mode 100644 website/content/in-progress/support-detached-mounts-with-pivot-root.md create mode 100644 website/content/wishlist/_index.md create mode 100644 website/content/wishlist/a-timeout-for-the-flock-syscall.md create mode 100644 website/content/wishlist/ability-to-only-open-regular-files.md create mode 100644 website/content/wishlist/ability-to-reopen-a-struct-block-device.md create mode 100644 website/content/wishlist/add-mount-setattr-locked-flag-to-mount-setattr.md create mode 100644 website/content/wishlist/add-process-by-pidfd-to-a-cgroup.md create mode 100644 website/content/wishlist/asynchronous-close.md create mode 100644 website/content/wishlist/at-empty-path-support-for-openat-and-openat2.md create mode 100644 website/content/wishlist/at-empty-path-support-for-unlinkat.md create mode 100644 website/content/wishlist/automatic-growing-of-btrfs-filesystems.md create mode 100644 website/content/wishlist/auxiliary-socket-message-describing-the-sender-s-cgroup.md create mode 100644 website/content/wishlist/blobfs.md create mode 100644 website/content/wishlist/clock-monotonic-network-timestamps.md create mode 100644 website/content/wishlist/clone-pidfd-autokill-semantics-for-pid-1.md create mode 100644 website/content/wishlist/determining-if-a-mount-point-belongs-to-the-current-user.md create mode 100644 website/content/wishlist/device-cgroup-guard-to-allow-mknod-in-non-initial-userns.md create mode 100644 website/content/wishlist/excluding-processes-watched-via-pidfd-from-waitid-p-all.md create mode 100644 website/content/wishlist/extend-io-uring-with-classic-synchronous-system-calls.md create mode 100644 website/content/wishlist/extend-mount-setattr-to-allow-changing-mount-properties-ignoring-any-failures.md create mode 100644 website/content/wishlist/extend-setns-to-allow-attaching-to-all-new-namespaces-of-a-process.md create mode 100644 website/content/wishlist/filtering-on-received-file-descriptors.md create mode 100644 website/content/wishlist/immutable-layers-for-overlayfs.md create mode 100644 website/content/wishlist/immutable-loopback-block-devices.md create mode 100644 website/content/wishlist/inotify-events-for-bsd-file-locks.md create mode 100644 website/content/wishlist/ioctl-api-for-overlayfs.md create mode 100644 website/content/wishlist/linking-of-o-tmpfile-files-with-replacement.md create mode 100644 website/content/wishlist/make-quotas-work-with-user-namespaces.md create mode 100644 website/content/wishlist/map-different-uids-gids-to-the-same-uids-gids.md create mode 100644 website/content/wishlist/namespaced-loop-and-block-devices.md create mode 100644 website/content/wishlist/open-thread-group-leader-via-pidfd-open.md create mode 100644 website/content/wishlist/path-based-acl-management-in-an-lsm-hook.md create mode 100644 website/content/wishlist/per-cgroup-limit-for-coredump-sizes.md create mode 100644 website/content/wishlist/race-free-creation-and-opening-of-non-file-inodes.md create mode 100644 website/content/wishlist/race-free-mounting-of-block-devices.md create mode 100644 website/content/wishlist/read-only-propagation-of-mounts.md create mode 100644 website/content/wishlist/reasonable-eof-on-sock-seqpacket.md create mode 100644 website/content/wishlist/reasonable-handling-of-selinux-dropping-scm-rights-fds.md create mode 100644 website/content/wishlist/security-hook-for-mount-setattr.md create mode 100644 website/content/wishlist/specification-of-a-keyring-for-dm-verity-volume-verification.md create mode 100644 website/content/wishlist/unlinking-via-two-file-descriptors.md create mode 100644 website/content/wishlist/upgrade-masks-in-openat2.md create mode 100644 website/content/wishlist/useful-handling-of-lsm-denials-on-scm-rights.md create mode 100644 website/content/wishlist/xattrs-for-pidfd.md diff --git a/website/content/_index.md b/website/content/_index.md new file mode 100644 index 0000000..66891da --- /dev/null +++ b/website/content/_index.md @@ -0,0 +1,42 @@ +--- +title: "Kernel Feature Wishlist" +type: docs +--- + +# Kernel Feature Wishlist + +A curated collection of kernel feature ideas maintained by the +[UAPI Group](https://uapi-group.org/). The items on this list are strictly +ideas. It is especially important to not take the items on this list as being +implementation requests. Some of the ideas are rather rough and unrefined. +They serve as entry points for exploring the associated problem space. + +## Sections + +{{< columns >}} + +### [In Progress]({{< relref "/in-progress" >}}) + +Features currently being designed or implemented by contributors. + +<---> + +### [Wishlist]({{< relref "/wishlist" >}}) + +Ideas and proposals waiting for someone to pick them up. + +<---> + +### [Completed]({{< relref "/completed" >}}) + +Features that have been merged into the kernel. + +{{< /columns >}} + +## Contributing + +See the [contribution guidelines](https://github.com/uapi-group/kernel-features/blob/main/CONTRIBUTING.md) for how to propose new features, claim existing ones, or mark items as completed. + +**When implementing ideas on this list or ideas inspired by this list, +please point that out explicitly and clearly in the associated patches +and Cc `Christian Brauner `.** diff --git a/website/content/completed/_index.md b/website/content/completed/_index.md new file mode 100644 index 0000000..456b38f --- /dev/null +++ b/website/content/completed/_index.md @@ -0,0 +1,7 @@ +--- +title: "Completed" +weight: 3 +bookCollapseSection: true +--- + +Features that have been implemented and merged into the Linux kernel. diff --git a/website/content/completed/a-reliable-way-to-check-for-pid-namespacing.md b/website/content/completed/a-reliable-way-to-check-for-pid-namespacing.md new file mode 100644 index 0000000..655d8a2 --- /dev/null +++ b/website/content/completed/a-reliable-way-to-check-for-pid-namespacing.md @@ -0,0 +1,28 @@ +--- +title: "A reliable way to check for PID namespacing" +weight: 190 +status: completed +categories: + - filesystems + - namespaces + - processes +--- + +[x] A reliable (non-heuristic) way to detect from userspace if the +current process is running in a PID namespace that is not the main +PID namespace. PID namespaces are probably the primary type of +namespace that identify a container environment. While many +heuristics exist to determine generically whether one is executed +inside a container, it would be good to have a correct, +well-defined way to determine this. + +**🙇 The inode number of the root PID namespace is fixed (0xEFFFFFFC) +and now considered API. It can be used to distinguish the root PID +namespace from all others. 🙇** + +**Use-Case:** tools such as `systemd-detect-virt` exist to determine +container execution, but typically resolve to checking for +specific implementations. It would be much nicer and universally +applicable if such a check could be done generically. It would +probably suffice to provide an `ioctl()` call on the `pidns` file +descriptor that reveals this kind of information in some form. diff --git a/website/content/completed/a-system-call-to-query-information-about-a-mount.md b/website/content/completed/a-system-call-to-query-information-about-a-mount.md new file mode 100644 index 0000000..f800319 --- /dev/null +++ b/website/content/completed/a-system-call-to-query-information-about-a-mount.md @@ -0,0 +1,37 @@ +--- +title: "A system call to query information about a mount" +weight: 120 +status: completed +categories: + - block-devices + - mounts + - namespaces +commit: "46eae99ef733" +--- + +[x] Implement a mount-specific companion to `statx()` that puts at least the +following information into `struct mount_info`: + +**🙇 `46eae99ef73302f9fb3dddcd67c374b3dffe8fd6 ("add statmount(2) syscall")` 🙇** + +* mount flags: `MOUNT_ATTR_RDONLY`, ... +* time flags: `MOUNT_ATTR_RELATIME`, ... + Could probably be combined with mount flags. +* propagation setting: `MS_SHARED)`, ... +* peer group +* mnt id of the mount +* mnt id of the mount's parent +* owning userns + +There's a bit more advanced stuff systemd would really want but which +I think is misplaced in a mountinfo system call including: +* list of primary and auxiliary block device major/minor +* diskseq value of those device nodes (This is a new block device feature + we added that allows preventing device recycling issues when e.g. + removing usb devices very quickly and is needed for udev.) +* uuid/fsid +* feature flags (`O_TMPFILE`, `RENAME_EXCHANGE` supported etc.) + +**Use-Case:** low-level userspace tools have to interact with advanced +mount information constantly. This is currently costly and brittel because +they have to go and parse `/proc//mountinfo`. diff --git a/website/content/completed/additional-identifiers-for-pidfds-in-statx.md b/website/content/completed/additional-identifiers-for-pidfds-in-statx.md new file mode 100644 index 0000000..3541009 --- /dev/null +++ b/website/content/completed/additional-identifiers-for-pidfds-in-statx.md @@ -0,0 +1,18 @@ +--- +title: "Additional identifiers for pidfds in `statx()`" +weight: 150 +status: completed +categories: + - pidfd + - processes +commit: "cb12fd8e0dab" +--- + +[x] Make `statx()` on a pidfd return additional recognizable identifiers +in `.stx_ino`. + +**🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇** + +It would be fantastic if issuing statx() on any pidfd would return some +reasonably stable 64bit identifier for the process in `.stx_ino`. This would +be perfect to identify processes pinned by a pidfd, and compare them. diff --git a/website/content/completed/allow-creating-idmapped-mounts-from-idmapped-mounts.md b/website/content/completed/allow-creating-idmapped-mounts-from-idmapped-mounts.md new file mode 100644 index 0000000..43163a9 --- /dev/null +++ b/website/content/completed/allow-creating-idmapped-mounts-from-idmapped-mounts.md @@ -0,0 +1,43 @@ +--- +title: "Allow creating idmapped mounts from idmapped mounts" +weight: 50 +status: completed +categories: + - filesystems + - mounts + - namespaces + - security +commit: "c4a16820d901" +--- + +[x] Allow creating idmapped mounts from idmapped mounts + +Add a new `OPEN_TREE_CLEAR` flag to `open_tree()` that can only be +used in conjunction with `OPEN_TREE_CLONE`. When specified it will clear +all mount properties from that mount including the mount's idmapping. +Requires the caller to be `ns_capable(mntns->user_ns)`. If idmapped mounts +are encountered the caller must be `ns_capable(sb->user_ns, CAP_SYS_ADMIN)` +in the filesystems user namespace. + +Locked mount properties cannot be changed. A mount's idmapping becomes +locked if it propagates across user namespaces. + +This is useful to get a new, clear mount and also allows the caller to +create a new detached mount with an idmapping attached to the mount. Iow, +the caller may idmap the mount afterwards. + +**🙇 `c4a16820d901 ("fs: add open_tree_attr()")` 🙇** + +**Use-Case:** A user may already use an idmapped mount for their home +directory. And once a mount has been idmapped the idmapping cannot be +changed anymore. This allows for simple semantics and allows to avoid +lifetime complexity in order to account for scenarios where concurrent +readers or writers might still use a given user namespace while it is about +to be changed. +But this poses a problem when the user wants to attach an idmapping to +a mount that is already idmapped. The new flag allows to solve this +problem. A sufficiently privileged user such as a container manager can +create a user namespace for the container which expresses the desired +ownership. Then they can create a new detached mount without any prior +mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this +mount. diff --git a/website/content/completed/api-to-determine-the-parent-process-id-of-a-pidfd.md b/website/content/completed/api-to-determine-the-parent-process-id-of-a-pidfd.md new file mode 100644 index 0000000..2c61dc7 --- /dev/null +++ b/website/content/completed/api-to-determine-the-parent-process-id-of-a-pidfd.md @@ -0,0 +1,28 @@ +--- +title: "API to determine the parent process ID of a pidfd" +weight: 20 +status: completed +categories: + - pidfd + - processes +commit: "cdda1f26e74b" +--- + +[x] API to determine the parent process ID of a pidfd + +An API to determine the parent process ID (ppid) of a pidfd would be +good. + +This information is relevant to code dealing with pidfds, since if +the ppid of a pidfd matches the process own pid it can call +`waitid()` on the process, if it doesn't it cannot and such a call +would fail. It would be very useful if this could be determined +easily before even calling that syscall. + +**🙇 `cdda1f26e74b ("pidfd: add ioctl to retrieve pid info")` 🙇** + +**Use-Case:** systemd manages a multitude of processes, most of which +are its own children, but many which are not. It would be great if +we could easily determine whether it is worth waiting for +`SIGCHLD`/`waitid()` on them or whether waiting for `POLLIN` on +them is the only way to get exit notification. diff --git a/website/content/completed/disabling-reception-of-scm-rights-for-af-unix-sockets.md b/website/content/completed/disabling-reception-of-scm-rights-for-af-unix-sockets.md new file mode 100644 index 0000000..57a5aca --- /dev/null +++ b/website/content/completed/disabling-reception-of-scm-rights-for-af-unix-sockets.md @@ -0,0 +1,53 @@ +--- +title: "Disabling reception of `SCM_RIGHTS` for `AF_UNIX` sockets" +weight: 180 +status: completed +categories: + - processes + - sockets +commit: "77cbe1a6d873" +--- + +[x] Ability to turn off `SCM_RIGHTS` reception for `AF_UNIX` +sockets. + +**🙇 `77cbe1a6d8730a07f99f9263c2d5f2304cf5e830 ("af_unix: Introduce SO_PASSRIGHTS")` 🙇** + +Right now reception of file descriptors is always on when +a process makes the mistake of invoking `recvmsg()` on such a +socket. This is problematic since `SCM_RIGHTS` installs file +descriptors in the recipient process' file descriptor +table. Getting rid of these file descriptors is not necessarily +easy, as they could refer to "slow-to-close" files (think: dirty +file descriptor referring to a file on an unresponsive NFS server, +or some device file descriptor), that might cause the recipient to +block for a longer time when it tries to them. Programs reading +from an `AF_UNIX` socket currently have three options: + +1. Never use `recvmsg()`, and stick to `read()`, `recv()` and + similar which do not install file descriptors in the recipients + file descriptor table. + +2. Ignore the problem, and simply `close()` the received file descriptors + it didn't expect, thus possibly locking up for a longer time. + +3. Fork off a thread that invokes `close()`, which mitigates the + risk of blocking, but still means a sender can cause resource + exhaustion in a recipient by flooding it with file descriptors, + as for each of them a thread needs to be spawned and a file + descriptor is taken while it is in the process of being closed. + +(Another option of course is to never talk `AF_UNIX` to peers that +are not trusted to not send unexpected file descriptors.) + +A simple knob that allows turning off `SCM_RIGHTS` right reception +would be useful to close this weakness, and would allow +`recvmsg()` to be called without risking file descriptors to be +installed in the file descriptor table, and thus risking a +blocking `close()` or a form of potential resource exhaustion. + +**Use-Case:** any program that uses `AF_UNIX` sockets and uses (or +would like to use) `recvmsg()` on it (which is useful to acquire +other metadata). Example: logging daemons that want to collect +timestamp or `SCM_CREDS` auxiliary data, or the D-Bus message +broker and suchlike. diff --git a/website/content/completed/idmapped-mounts-for-tmpfs.md b/website/content/completed/idmapped-mounts-for-tmpfs.md new file mode 100644 index 0000000..ab67b7e --- /dev/null +++ b/website/content/completed/idmapped-mounts-for-tmpfs.md @@ -0,0 +1,31 @@ +--- +title: "Idmapped mounts for tmpfs" +weight: 140 +status: completed +categories: + - filesystems + - mounts + - namespaces +commit: "7a80e5b8c6fa" +--- + +[x] Support idmapped mounts for tmpfs + +**🙇 `7a80e5b8c6fa ("shmem: support idmapped mounts for tmpfs")` 🙇** + +**Use-Case:** Runtimes such as Kubernetes use a lot of `tmpfs` mounts of +individual files or directories to expose information to containers/pods. +Instead of having to change ownership permanently allow them to use an +idmapped mount instead. + +@rata and @giuseppe brought this suggestion forward. For Kubernetes it is +sufficient to support idmapped mounts of `tmpfs` instances mounted in the +initial user namespace. However, in the future idmapped +mounts of `tmpfs` instances mounted in user namespaces should be supported. +Other container runtimes want to make use of this. The kernel is able to +support this since at least `5.17`. + +Things to remember are that `tmpfs` mounts can serve as lower- or upper +layers in `overlayfs` and care needs to be taken that this remains safe if +idmapped mounts of `tmpfs` instances mounted in user namespaces are +supported. diff --git a/website/content/completed/make-statx-on-a-pidfd-return-additional-info.md b/website/content/completed/make-statx-on-a-pidfd-return-additional-info.md new file mode 100644 index 0000000..fe169d9 --- /dev/null +++ b/website/content/completed/make-statx-on-a-pidfd-return-additional-info.md @@ -0,0 +1,28 @@ +--- +title: "Make statx() on a pidfd return additional info" +weight: 40 +status: completed +categories: + - pidfd + - processes +commit: "cb12fd8e0dab" +--- + +Make statx() on a pidfd return additional recognizable identifiers in +`.stx_btime`. + +**🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇** + +It would be fantastic if issuing statx() on any pidfd would return +the start time of the process in `.stx_btime` even after the process +died. + +These fields should in particular be queriable *after* the process +already exited and has been reaped, i.e. after its PID has already +been recycled. + +**Use-Case:** In systemd we maintain lists of processes in a hash +table. Right now, the key is the PID, but this is less than ideal +because of PID recycling. By being able to use the `.stx_btime` +and/or `.stx_ino` fields instead would be perfect to safely +identify, track and compare process even after they ceased to exist. diff --git a/website/content/completed/mount-a-subdirectory-instead-of-the-top-level-directory.md b/website/content/completed/mount-a-subdirectory-instead-of-the-top-level-directory.md new file mode 100644 index 0000000..b417977 --- /dev/null +++ b/website/content/completed/mount-a-subdirectory-instead-of-the-top-level-directory.md @@ -0,0 +1,34 @@ +--- +title: "Mount a subdirectory instead of the top-level directory" +weight: 80 +status: completed +categories: + - filesystems + - mounts + - namespaces +commit: "c5c12f871a30" +--- + +[x] Mount a subdirectory instead of the top-level directory + +Ability to mount a subdirectory of a regular file system instead of +the top-level directory. E.e. for a file system `/dev/sda1` which +contains a sub-directory `/foobar` mount `/foobar` without having +to mount its parent directory first. Consider something like this: + +``` +mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar +``` + +(This is of course already possible via some mount namespacing +shenanigans, but this requires namespacing to be available, and is +not precisely obvious to implement. Explicit kernel support at mount +time would be much preferable.) + +**🙇 `c5c12f871a30 ("fs: create detached mounts from detached mounts")` 🙇** + +**Use-Case:** `systemd-homed` currently mounts a sub-directory of +the per-user LUKS volume as the user's home directory (and not the +root directory of the per-user LUKS volume's file system!), and in +order to implement this invisibly from the host side requires a +complex mount namespace exercise. diff --git a/website/content/completed/mount-notifications-without-rescanning-of-proc-self-mountinfo.md b/website/content/completed/mount-notifications-without-rescanning-of-proc-self-mountinfo.md new file mode 100644 index 0000000..0fdc2c5 --- /dev/null +++ b/website/content/completed/mount-notifications-without-rescanning-of-proc-self-mountinfo.md @@ -0,0 +1,26 @@ +--- +title: "Mount notifications without rescanning of `/proc/self/mountinfo`" +weight: 70 +status: completed +categories: + - mounts +commit: "0f46d81f2bce" +--- + +[x] Mount notifications without rescanning of `/proc/self/mountinfo` + +Mount notifications that do not require continuous rescanning of +`/proc/self/mountinfo`. Currently, if a program wants to track +mounts established on the system it can receive `poll()`able +events via a file descriptor to `/proc/self/mountinfo`. When +receiving them it needs to rescan the file from the top and +compare it with the previous scan. This is both slow and +racy. It's slow on systems with a large number of mounts as the +cost for re-scanning the table has to be paid for every change to +the mount table. It's racy because quickly added and removed +mounts might not be noticed. + +**🙇 `0f46d81f2bce ("fanotify: notify on mount attach and detach")` 🙇** + +**Use-Case:** `systemd` tracks the mount table to integrate the mounts +into it own dependency management. diff --git a/website/content/completed/namespace-binfmt-misc-filesystem.md b/website/content/completed/namespace-binfmt-misc-filesystem.md new file mode 100644 index 0000000..39c2ef6 --- /dev/null +++ b/website/content/completed/namespace-binfmt-misc-filesystem.md @@ -0,0 +1,17 @@ +--- +title: "Namespace `binfmt_misc` filesystem" +weight: 160 +status: completed +categories: + - filesystems + - mounts + - namespaces +commit: "21ca59b365c0" +--- + +[x] Make the `binfmt_misc` filesystem namespaced. + +**🙇 `21ca59b365c0 ("binfmt_misc: enable sandboxed mounts")` 🙇** + +**Use-Case:** Allow containers and sandboxes to register their own binfmt +handlers. diff --git a/website/content/completed/namespace-ioctl-to-translate-a-pid-between-pid-namespaces.md b/website/content/completed/namespace-ioctl-to-translate-a-pid-between-pid-namespaces.md new file mode 100644 index 0000000..a5f31ca --- /dev/null +++ b/website/content/completed/namespace-ioctl-to-translate-a-pid-between-pid-namespaces.md @@ -0,0 +1,17 @@ +--- +title: "Namespace ioctl to translate a PID between PID namespaces" +weight: 10 +status: completed +categories: + - namespaces + - processes +commit: "ca567df74a28" +--- + +[x] Namespace ioctl to translate a PID between PID namespaces + +**🙇 `ca567df74a28a9fb368c6b2d93e864113f73f5c2 ("nsfs: add pid translation ioctls")` 🙇** + +**Use-Case:** This makes it possible to e.g., figure out what a given PID in +a PID namespace corresponds to in the caller's PID namespace. For example, to +figure out what the PID of PID 1 inside of a given PID namespace is. diff --git a/website/content/completed/require-a-user-namespace-to-have-an-idmapping-when-attached.md b/website/content/completed/require-a-user-namespace-to-have-an-idmapping-when-attached.md new file mode 100644 index 0000000..093a850 --- /dev/null +++ b/website/content/completed/require-a-user-namespace-to-have-an-idmapping-when-attached.md @@ -0,0 +1,18 @@ +--- +title: "Require a user namespace to have an idmapping when attached" +weight: 60 +status: completed +categories: + - mounts + - namespaces +commit: "dacfd001eaf2" +--- + +[x] Require a user namespace to have an idmapping when attached + +Enforce that the user namespace about to be attached to a mount must +have an idmapping written. + +**🙇 `dacfd001eaf2 ("fs/mnt_idmapping.c: Return -EINVAL when no map is written")` 🙇** + +**Use-Case:** Tighten the semantics. diff --git a/website/content/completed/scm-pidfd-auxiliary-socket-message.md b/website/content/completed/scm-pidfd-auxiliary-socket-message.md new file mode 100644 index 0000000..7878f02 --- /dev/null +++ b/website/content/completed/scm-pidfd-auxiliary-socket-message.md @@ -0,0 +1,21 @@ +--- +title: "`SCM_PIDFD` auxiliary socket message" +weight: 100 +status: completed +categories: + - pidfd + - processes + - sockets +commit: "5e2ff6704a27" +--- + +[x] `SCM_PIDFD` or similar auxiliary socket message, that is a modern +version of the `SCM_CREDS` message's `.pid` field, and provides a +`pidfd` file descriptor to the originating peer process. + +**🙇 `5e2ff6704a275be00 ("scm: add SO_PASSPIDFD and SCM_PIDFD)")` 🙇** + +**Use-Case:** security infrastructure (such as PolicyKit) can safely +reference clients this way without fearing PID +recycling. `systemd-journald` can acquire peer metadata this way in +a less racy fashion, in particular safe against PID recycling. diff --git a/website/content/completed/security-hook-for-create-user-ns.md b/website/content/completed/security-hook-for-create-user-ns.md new file mode 100644 index 0000000..acbb47d --- /dev/null +++ b/website/content/completed/security-hook-for-create-user-ns.md @@ -0,0 +1,15 @@ +--- +title: "Security hook for `create_user_ns()`" +weight: 130 +status: completed +categories: + - namespaces + - security +commit: "7cd4c5c2101c" +--- + +[x] (kAPI) Add security hook to `create_user_ns()`. + +**🙇 `7cd4c5c2101c ("security, lsm: Introduce security_create_user_ns()")` 🙇** + +**Use-Case:** Allow LSMs to monitor user namespace creation. diff --git a/website/content/completed/set-comm-field-before-exec.md b/website/content/completed/set-comm-field-before-exec.md new file mode 100644 index 0000000..8f1696a --- /dev/null +++ b/website/content/completed/set-comm-field-before-exec.md @@ -0,0 +1,35 @@ +--- +title: "Set `comm` field before `exec()`" +weight: 30 +status: completed +categories: + - processes +commit: "543841d18060" +--- + +[x] Set `comm` field before `exec()` + +There should be a way to control the process' `comm` field if +started via `fexecve()`/`execveat()`. + +Right now, when `fexecve()`/`execveat()` is used, the `comm` field +(i.e. `/proc/self/comm`) contains a name derived of the numeric fd, +which breaks `ps -C …` and various other tools. In particular when +the fd was opened with `O_CLOEXEC`, the number of the fd in the old +process is completely meaningless. + +The goal is add a way to tell `fexecve()`/`execveat()` what Name to use. + +Since `comm` is under user control anyway (via `PR_SET_NAME`), it +should be safe to also make it somehow configurable at fexecve() +time. + +See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294, +https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81. + +**🙇 `543841d18060 ("exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case")` 🙇** + +**Use-Case:** In systemd we generally would prefer using `fexecve()` +to safely and race-freely invoke processes, but the fact that `comm` +is useless after invoking a process that way makes the call +unfortunately hard to use for systemd. diff --git a/website/content/completed/support-idmapped-mounts-for-overlayfs.md b/website/content/completed/support-idmapped-mounts-for-overlayfs.md new file mode 100644 index 0000000..790c21e --- /dev/null +++ b/website/content/completed/support-idmapped-mounts-for-overlayfs.md @@ -0,0 +1,15 @@ +--- +title: "Support idmapped mounts for `overlayfs`" +weight: 170 +status: completed +categories: + - filesystems + - mounts +commit: "bc70682a497c" +--- + +[x] Support idmapped mounts for `overlayfs` + +**🙇 `bc70682a497c ("ovl: support idmapped layers")` 🙇** + +**Use-Case:** Allow containers to use `overlayfs` with idmapped mounts. diff --git a/website/content/completed/take-ip-unicast-if-into-account-for-routing-decisions.md b/website/content/completed/take-ip-unicast-if-into-account-for-routing-decisions.md new file mode 100644 index 0000000..d5e2413 --- /dev/null +++ b/website/content/completed/take-ip-unicast-if-into-account-for-routing-decisions.md @@ -0,0 +1,25 @@ +--- +title: "Take `IP_UNICAST_IF` into account for routing decisions" +weight: 110 +status: completed +categories: + - sockets +commit: "0e4d354762ce" +--- + +[x] `IP_UNICAST_IF` should be taken into account for routing decisions +at UDP `connect()` time (currently it isn't, only `SO_BINDTOINDEX` +is, but that does so much more than just that, and one often +doesn't want that) + +**🙇 `0e4d354762cefd3e16b4cff8988ff276e45effc4 ("net-next: Fix +IP_UNICAST_IF option behavior for connected sockets")` 🙇** + +**Use-Case:** DNS resolvers that associate DNS configuration with +specific network interfaces (example: `systemd-resolved`) typically +want to preferably route DNS traffic to the per-interface DNS +server via that interface, but not make further restrictions on the +origins or received replies, and all that without +privileges. `IP_UNICAST_IF` fulfills this role fine for TCP, but +for UDP it is not taken into account for the `connect()` routing +decision. diff --git a/website/content/completed/unmounting-of-obstructed-mounts.md b/website/content/completed/unmounting-of-obstructed-mounts.md new file mode 100644 index 0000000..a349679 --- /dev/null +++ b/website/content/completed/unmounting-of-obstructed-mounts.md @@ -0,0 +1,24 @@ +--- +title: "Unmounting of obstructed mounts" +weight: 90 +status: completed +categories: + - filesystems + - mounts +commit: "6ac392815628" +--- + +[x] ability to unmount obstructed mounts. (this means: you have a stack +of mounts on the very same inode, and you want to remove a mount in +the middle. right now, you can only remove the topmost mount.) + +**🙇 instead of the ability to unmount obstructured mounts we gained +the ability to mount beneath an existing mount, with mostly +equivalent outcome. `6ac392815628f317fcfdca1a39df00b9cc4ebc8b +("fs: allow to mount beneath top mount")` 🙇** + +**use-case:** this is useful for replacing mounts atomically, for +example for upgrading versioned disk images: first an old version +of the image is mounted. then a new version is mounted over the +existing mount point, and then the lower mount point is +removed. One such software would be `systemd-sysext`. diff --git a/website/content/in-progress/_index.md b/website/content/in-progress/_index.md new file mode 100644 index 0000000..4c98074 --- /dev/null +++ b/website/content/in-progress/_index.md @@ -0,0 +1,9 @@ +--- +title: "In Progress" +weight: 1 +bookCollapseSection: true +--- + +Features currently being designed or implemented by contributors. + +If you are working on one of these, please add your GitHub handle or email address so we can coordinate. diff --git a/website/content/in-progress/ability-to-put-user-xattrs-on-s-ifsock-socket-entrypoint-inodes-in-the-file-system.md b/website/content/in-progress/ability-to-put-user-xattrs-on-s-ifsock-socket-entrypoint-inodes-in-the-file-system.md new file mode 100644 index 0000000..8bdf205 --- /dev/null +++ b/website/content/in-progress/ability-to-put-user-xattrs-on-s-ifsock-socket-entrypoint-inodes-in-the-file-system.md @@ -0,0 +1,41 @@ +--- +title: "Ability to put user xattrs on `S_IFSOCK` socket entrypoint inodes in the file system" +weight: 20 +status: in-progress +categories: + - filesystems + - mounts + - namespaces + - sockets +--- + +Currently, the kernel only allows extended attributes in the +`user.*` namespace to be attached to directory and regular file +inodes. It would be tremendously useful to allow them to be +associated with socket inodes, too. + +**Use-Case:** There are two syslog RFCs in use today: RFC3164 and +RFC5424. `glibc`'s `syslog()` API generates events close to the +former, but there are programs which would like to generate the +latter instead (as it supports structured logging). The two formats +are not backwards compatible: a client sending RFC5424 messages to a +server only understanding RFC3164 will cause an ugly mess. On Linux +there's only a single `/dev/log` AF_UNIX/SOCK_DGRAM socket backing +`syslog()`, which is used in a one-way, fire-and-forget style. This +means that feature negotation is not really possible within the +protocol. Various tools bind mount the socket inode into `chroot()` +and container environments, hence it would be fantastic to associate +supported feature information directly with the inode (and thus +outside of the protocol) to make it easy for clients to determine +which features are spoken on a socket, in a way that survives bind +mounts. Implementation idea would be that syslog daemons +implementing RFC5425 could simply set an xattr `user.rfc5424` to `1` +(or something like that) on the socket inode, and clearly inform +clients in a natural and simple way that they'd be happy to parse +the newer format. Also see: +https://github.com/systemd/systemd/issues/19251 – This idea could +also be extended to other sockets and other protocols: by setting +some extended attribute on a socket inodes, services could advertise +which protocols they support on them. For example D-Bus sockets +could carry `user.dbus` set to `1`, and Varlink sockets +`user.varlink` set to `1` and so on. diff --git a/website/content/in-progress/add-immutable-rootfs-nullfs.md b/website/content/in-progress/add-immutable-rootfs-nullfs.md new file mode 100644 index 0000000..9ec9588 --- /dev/null +++ b/website/content/in-progress/add-immutable-rootfs-nullfs.md @@ -0,0 +1,37 @@ +--- +title: "Add immutable rootfs (`nullfs`)" +weight: 50 +status: in-progress +categories: + - filesystems + - mounts + - namespaces + - processes + - security +--- + +Currently `pivot_root()` doesn't work on the real rootfs because it +cannot be unmounted. Userspace has to do a recursive removal of the +initramfs contents manually before continuing the boot. + +Add an immutable rootfs called `nullfs` that serves as the parent mount +for anything that is actually useful such as the tmpfs or ramfs for +initramfs unpacking or the rootfs itself. The kernel mounts a +tmpfs/ramfs on top of it, unpacks the initramfs and fires up userspace +which mounts the rootfs and can then simply do: + +```c +chdir(rootfs); +pivot_root(".", "."); +umount2(".", MNT_DETACH); +``` + +This also means that the rootfs mount in unprivileged namespaces doesn't +need to become `MNT_LOCKED` anymore as it's guaranteed that the +immutable rootfs remains permanently empty so there cannot be anything +revealed by unmounting the covering mount. + +**Use-Case:** Simplifies the boot process by enabling `pivot_root()` to +work directly on the real rootfs. Removes the need for traditional +`switch_root` workarounds. In the future this also allows us to create +completely empty mount namespaces without risking to leak anything. diff --git a/website/content/in-progress/allow-move-mount-beneath-on-the-rootfs.md b/website/content/in-progress/allow-move-mount-beneath-on-the-rootfs.md new file mode 100644 index 0000000..6565eed --- /dev/null +++ b/website/content/in-progress/allow-move-mount-beneath-on-the-rootfs.md @@ -0,0 +1,48 @@ +--- +title: "Allow `MOVE_MOUNT_BENEATH` on the rootfs" +weight: 60 +status: in-progress +categories: + - filesystems + - mounts + - namespaces + - processes +--- + +Allow `MOVE_MOUNT_BENEATH` to target the caller's rootfs, enabling +root-switching without `pivot_root(2)`. The traditional approach to +switching the rootfs involves `pivot_root(2)` or a `chroot_fs_refs()`-based +mechanism that atomically updates `fs->root` for all tasks sharing the +same `fs_struct`. This has consequences for `fork()`, `unshare(CLONE_FS)`, +and `setns()`. + +Instead, decompose root-switching into individually atomic, locally-scoped +steps: + +```c +fd_tree = open_tree(-EBADF, "/newroot", + OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC); +fchdir(fd_tree); +move_mount(fd_tree, "", AT_FDCWD, "/", + MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH); +chroot("."); +umount2(".", MNT_DETACH); +``` + +Since each step only modifies the caller's own state, the +`fork()`/`unshare()`/`setns()` races are eliminated by design. + +To make this work, `MNT_LOCKED` is transferred from the top mount to the +mount beneath. The new mount takes over the job of protecting the parent +mount from being revealed. This also makes it possible to safely modify +an inherited mount table after `unshare(CLONE_NEWUSER | CLONE_NEWNS)`: + +```sh +mount --beneath -t tmpfs tmpfs /proc +umount -l /proc +``` + +**Use-Case:** Containers created with `unshare(CLONE_NEWUSER | CLONE_NEWNS)` +can reshuffle an inherited mount table safely. `MOVE_MOUNT_BENEATH` on the +rootfs makes it possible to switch out the rootfs without the costly +`pivot_root(2)` and without cross-namespace vulnerabilities. diff --git a/website/content/in-progress/create-empty-mount-namespaces-via-unshare-unshare-empty-mntns-and-clone3-clone-empty-mntns.md b/website/content/in-progress/create-empty-mount-namespaces-via-unshare-unshare-empty-mntns-and-clone3-clone-empty-mntns.md new file mode 100644 index 0000000..136da23 --- /dev/null +++ b/website/content/in-progress/create-empty-mount-namespaces-via-unshare-unshare-empty-mntns-and-clone3-clone-empty-mntns.md @@ -0,0 +1,18 @@ +--- +title: "Create empty mount namespaces via `unshare(UNSHARE_EMPTY_MNTNS)` and `clone3(CLONE_EMPTY_MNTNS)`" +weight: 10 +status: in-progress +categories: + - filesystems + - mounts + - namespaces + - processes +--- + +Now that we have support for `nullfs` it is trivial to allow the +creation of completely empty mount namespaces, i.e., mount namespaces +that only have the `nullfs` mount located at it's root. + +**Use-Case:** This allows to isolate tasks in completely empty mount +namespaces. It also allows the caller to avoid copying its current mount +table which is useless in the majority of container workload cases. diff --git a/website/content/in-progress/create-mount-namespace-with-custom-rootfs-via-open-tree-and-fsmount.md b/website/content/in-progress/create-mount-namespace-with-custom-rootfs-via-open-tree-and-fsmount.md new file mode 100644 index 0000000..39b182f --- /dev/null +++ b/website/content/in-progress/create-mount-namespace-with-custom-rootfs-via-open-tree-and-fsmount.md @@ -0,0 +1,39 @@ +--- +title: "Create mount namespace with custom rootfs via `open_tree()` and `fsmount()`" +weight: 40 +status: in-progress +categories: + - mounts + - namespaces + - processes +--- + +Add `OPEN_TREE_NAMESPACE` flag to `open_tree()` and `FSMOUNT_NAMESPACE` flag +to `fsmount()` that create a new mount namespace with the specified mount tree +as the rootfs mounted on top of a copy of the real rootfs. These return a +namespace file descriptor instead of a mount file descriptor. + +This allows `OPEN_TREE_NAMESPACE` to function as a combined +`unshare(CLONE_NEWNS)` and `pivot_root()`. + +When creating containers the setup usually involves using `CLONE_NEWNS` via +`clone3()` or `unshare()`. This copies the caller's complete mount namespace. +The runtime will also assemble a new rootfs and then use `pivot_root()` to +switch the old mount tree with the new rootfs. Afterward it will recursively +unmount the old mount tree thereby getting rid of all mounts. + +Copying all of these mounts only to get rid of them later is wasteful. With a +large mount table and a system where thousands of containers are spawned in +parallel this quickly becomes a bottleneck increasing contention on the +semaphore. + +**Use-Case:** Container runtimes can create an extremely minimal rootfs +directly: + +```c +fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE); +``` + +This creates a mount namespace where "wootwoot" has become the rootfs. The +caller can `setns()` into this new mount namespace and assemble additional +mounts without copying and destroying the entire parent mount table. diff --git a/website/content/in-progress/query-mount-information-via-file-descriptor-with-statmount.md b/website/content/in-progress/query-mount-information-via-file-descriptor-with-statmount.md new file mode 100644 index 0000000..426e105 --- /dev/null +++ b/website/content/in-progress/query-mount-information-via-file-descriptor-with-statmount.md @@ -0,0 +1,22 @@ +--- +title: "Query mount information via file descriptor with `statmount()`" +weight: 70 +status: in-progress +categories: + - mounts + - namespaces +--- + +Extend `struct mnt_id_req` to accept a file descriptor and introduce +`STATMOUNT_BY_FD` flag. When a valid fd is provided and `STATMOUNT_BY_FD` +is set, `statmount()` returns mount info about the mount the fd is on. + +This works even for "unmounted" mounts (mounts that have been unmounted using +`umount2(mnt, MNT_DETACH)`), if you have access to a file descriptor on that +mount. These unmounted mounts will have no mountpoint and no valid mount +namespace, so `STATMOUNT_MNT_POINT` and `STATMOUNT_MNT_NS_ID` are unset in +`statmount.mask` for such mounts. + +**Use-Case:** Query mount information directly from a file descriptor without +needing the mount ID, which is particularly useful for detached or unmounted +mounts. diff --git a/website/content/in-progress/support-detached-mounts-with-pivot-root.md b/website/content/in-progress/support-detached-mounts-with-pivot-root.md new file mode 100644 index 0000000..8234003 --- /dev/null +++ b/website/content/in-progress/support-detached-mounts-with-pivot-root.md @@ -0,0 +1,20 @@ +--- +title: "Support detached mounts with `pivot_root()`" +weight: 30 +status: in-progress +categories: + - filesystems + - mounts +--- + +The new rootfs must currently refer to an attached mount. This restriction +seems unnecessary. We should allow the new rootfs to refer to a detached +mount. + +This will allow a service- or container manager to create a new rootfs as +a detached, private mount that isn't exposed anywhere in the filesystem and +then `pivot_root()` into it. + +Since `pivot_root()` only takes path arguments the new rootfs would need to +be passed via `/proc//fd/`. In the long run we should add a new +`pivot_root()` syscall operating on file descriptors instead of paths. diff --git a/website/content/wishlist/_index.md b/website/content/wishlist/_index.md new file mode 100644 index 0000000..42f588d --- /dev/null +++ b/website/content/wishlist/_index.md @@ -0,0 +1,9 @@ +--- +title: "Wishlist" +weight: 2 +bookCollapseSection: true +--- + +Ideas and proposals for kernel features. Pick one up and start working on it! + +To claim an item, open a PR moving it to the **In Progress** section and add your GitHub handle or email address. diff --git a/website/content/wishlist/a-timeout-for-the-flock-syscall.md b/website/content/wishlist/a-timeout-for-the-flock-syscall.md new file mode 100644 index 0000000..9dda284 --- /dev/null +++ b/website/content/wishlist/a-timeout-for-the-flock-syscall.md @@ -0,0 +1,20 @@ +--- +title: "A timeout for the `flock()` syscall" +weight: 150 +status: wishlist +categories: + - block-devices + - processes +--- + +A timeout for the `flock()` syscall. Faking the time-out in userspace +is nasty: most code does it with `alarm()` (or equivalent APIs), but +that's racy since on a heavily loaded system the timeout might trigger +before the `flock()` call is entered, in particular if short time-outs +shall be used. More accurate is to do the locking in a short-lived +child processed, but that's difficult already in C, and almost +impossible in languages that do not allow `fork()` without `execve()`. + +**Use-Case:** as mentioned above systemd-udev allows synchronizing +block device probing via flock(). Often userspace wants to wait +for that, but without risking to hang forever. diff --git a/website/content/wishlist/ability-to-only-open-regular-files.md b/website/content/wishlist/ability-to-only-open-regular-files.md new file mode 100644 index 0000000..8b06f41 --- /dev/null +++ b/website/content/wishlist/ability-to-only-open-regular-files.md @@ -0,0 +1,20 @@ +--- +title: "Ability to only open regular files" +weight: 60 +status: wishlist +categories: + - mounts +--- + +`O_REGULAR` (inspired by the existing `O_DIRECTORY` flag for +`open()`), which opens a file only if it is of type `S_IFREG`. + +**Use-Case:** this would be very useful to write secure programs +that want to avoid being tricked into opening device nodes with +special semantics while thinking they operate on regular +files. This is particularly relevant as many device nodes (or even +FIFOs) come with blocking I/O (or even blocking `open()`!) by +default, which is not expected from regular files backed by "fast" +disk I/O. Consider implementation of a naive web browser which is +pointed to `file://dev/zero`, not expecting an endless amount of +data to read. diff --git a/website/content/wishlist/ability-to-reopen-a-struct-block-device.md b/website/content/wishlist/ability-to-reopen-a-struct-block-device.md new file mode 100644 index 0000000..fa48b6c --- /dev/null +++ b/website/content/wishlist/ability-to-reopen-a-struct-block-device.md @@ -0,0 +1,18 @@ +--- +title: "Ability to reopen a `struct block_device`" +weight: 310 +status: wishlist +categories: + - filesystems + - mounts +--- + +Add ability to reopen a `struct block_device`. This would allow using +`blkdev_get_by_path()`/`blkdev_get_{part,whole}()` to claim a device +with `BLK_OPEN_READ` and later on reopen with +`BLK_OPEN_READ | BLK_OPEN_WRITE`. This in turn would opening block +devices at `fsconfig(FS_CONFIG_SET_*)` time and then at `fill_super()` +time we would be able to reopen in case the `!(fc->sb_flags & SB_RDONLY)`. +Overall this has the effect that we're able to open devices early +giving the user early errors when they set mount options rather than +very late when the superblock is created. diff --git a/website/content/wishlist/add-mount-setattr-locked-flag-to-mount-setattr.md b/website/content/wishlist/add-mount-setattr-locked-flag-to-mount-setattr.md new file mode 100644 index 0000000..5443701 --- /dev/null +++ b/website/content/wishlist/add-mount-setattr-locked-flag-to-mount-setattr.md @@ -0,0 +1,22 @@ +--- +title: "Add `MOUNT_SETATTR_LOCKED` flag to `mount_setattr()`" +weight: 190 +status: wishlist +categories: + - mounts + - namespaces + - processes + - security +--- + +Add a new `MOUNT_SETATTR_LOCKED` flag to `mount_setattr(..., ..., MOUNT_SETATTR_LOCKED, ..., ...)`. +The `MOUNT_SETATTR_LOCKED` flag allow a `ns_capable(mntns->user_ns, +CAP_SYS_ADMIN)` caller to lock all mount properties. The mount properties +cannot be changed anymore. + +**Use-Case:** allowing processes to lock mount properties even for +privileged processes. Locking mount properties would currently involve +having to have the mount namespace of the container be owned by an ancestor +user namespace. But this doesn't just lock a single mount or mount subtree +it locks all mounts in the mount namespace, i.e., the mount table cannot be +altered. diff --git a/website/content/wishlist/add-process-by-pidfd-to-a-cgroup.md b/website/content/wishlist/add-process-by-pidfd-to-a-cgroup.md new file mode 100644 index 0000000..1c99f11 --- /dev/null +++ b/website/content/wishlist/add-process-by-pidfd-to-a-cgroup.md @@ -0,0 +1,26 @@ +--- +title: "Add process by PIDFD to a cgroup" +weight: 370 +status: wishlist +categories: + - cgroups + - pidfd + - processes + - sockets +--- + +At the moment the canonical way to add a process to a cgroup is by +echoing its PID into the `cgroup.procs` attribute in the target +cgroupfs directory of the cgroup. This is safe as long as the +process doing so just forked off the process it wants to migrate and +hence can control that it hasn't been reaped yet, and hence +guarantees the PID is valid. This is racy however if "foreign" +processes shall be moved into the cgroup. + +**Use-Case:** In systemd, all user sessions are wrapped in scope +units which are backed by a cgroup. The session processes moved into +the scope unit are typically "foreign" processes, i.e. not children +of the service manager, hence doing the movement is subject to races +in case the process dies and its PID is quickly recycled. (This +assumes systemd can acquire a pidfd of the foreign process without +races, for example via `SCM_PIDFD` and `SO_PEERPIDFD` or similar.) diff --git a/website/content/wishlist/asynchronous-close.md b/website/content/wishlist/asynchronous-close.md new file mode 100644 index 0000000..c451f23 --- /dev/null +++ b/website/content/wishlist/asynchronous-close.md @@ -0,0 +1,29 @@ +--- +title: "Asynchronous `close()`" +weight: 120 +status: wishlist +categories: + - mounts + - processes + - sockets +--- + +An asynchronous or forced `close()`, that guarantees that +userspace doesn't have to risk blocking for longer periods of time +when trying to get rid of unwanted file descriptors, possibly +received via `recvmsg()` + `SCM_RIGHTS` (see above). Currently, +`close()` of various file descriptors (for example those referring +to slow storage, e.g. non-responding NFS servers and such) might +take arbitrary amounts of time, potentially into the minute range +and more. This makes it risky accepting file descriptors on +publicly accessible `AF_UNIX` sockets, the way like IPC brokers +(e.g. D-Bus) do it: if a rogue client keeps sending file +descriptors that because unexpected must be closed immediately it +might cause the receiving process to effectively crawl, when it is +busy closing them all. A special form of `close()` that simply +detaches a file descriptor from the file descriptor table without +blocking on IO in any form would be great to close this issue. + +**Use-Case:** any program that receives file descriptors via `AF_UNIX` +from untrusted clients would benefit from this. e.g. D-Bus +brokers. diff --git a/website/content/wishlist/at-empty-path-support-for-openat-and-openat2.md b/website/content/wishlist/at-empty-path-support-for-openat-and-openat2.md new file mode 100644 index 0000000..2048e2e --- /dev/null +++ b/website/content/wishlist/at-empty-path-support-for-openat-and-openat2.md @@ -0,0 +1,18 @@ +--- +title: "`AT_EMPTY_PATH` support for `openat()` and `openat2()`" +weight: 420 +status: wishlist +categories: + - processes +--- + +To get an operable version of an `O_PATH` file descriptors, it is +possible to use `openat(fd, ".", O_DIRECTORY)` for directories, but +other files currently require going through +`open("/proc//fd/")` which depends on a functioning `procfs`. + +FreeBSD already has `O_EMPTY_PATH` for `openat`, while `fstatat` and +similar functions have `AT_EMPTY_PATH`. + +**Use-Case:** When dealing with `O_PATH` file descriptors, allow +re-opening an operable version without the need of `procfs`. diff --git a/website/content/wishlist/at-empty-path-support-for-unlinkat.md b/website/content/wishlist/at-empty-path-support-for-unlinkat.md new file mode 100644 index 0000000..0e22b00 --- /dev/null +++ b/website/content/wishlist/at-empty-path-support-for-unlinkat.md @@ -0,0 +1,11 @@ +--- +title: "`AT_EMPTY_PATH` support for `unlinkat()`" +weight: 290 +status: wishlist +categories: + - processes +--- + +**Use-Case:** When dealing with files/directories, allow passing +around only a file descriptor without having to keep the path around +to be able to unlink the file/directory. diff --git a/website/content/wishlist/automatic-growing-of-btrfs-filesystems.md b/website/content/wishlist/automatic-growing-of-btrfs-filesystems.md new file mode 100644 index 0000000..9201579 --- /dev/null +++ b/website/content/wishlist/automatic-growing-of-btrfs-filesystems.md @@ -0,0 +1,30 @@ +--- +title: "Automatic growing of `btrfs` filesystems" +weight: 360 +status: wishlist +categories: + - block-devices + - filesystems +--- + +An *auto-grow* feature in `btrfs` would be excellent. + +If such a mode is enabled, `btrfs` would automatically grow a file +system up to the size of its backing block devices. Example: btrfs +is created with 200M in size on a block device 2G in size. Once the +file system is filled up fully, `btrfs` would automatically grow the +file system as need in the increments it needs, up to the 2G that +the backing block device is in size. + +**Use-Case:** This would allow creating minimal, compact file +systems: just create them small on a sparse block device, and copy +files into it, as needed, create subvolumes and whatever else is +desired. As long as only files are created and written (but not +modified) the resulting fs should be automatically minimal in size. +This would specifically be useful in `systemd-homed`, which +maintains per-user `btrfs` file systems backed by block +devices. Currently, `homed` grows the file systems manually on login +and then shrinks them again on logout, but this is less than ideal, +since btrfs places files all over the backing store, and thus the +shrinking will generate a lot of nonsensical IO that could be +reduced if the file system was always kept minimal in size anyway. diff --git a/website/content/wishlist/auxiliary-socket-message-describing-the-sender-s-cgroup.md b/website/content/wishlist/auxiliary-socket-message-describing-the-sender-s-cgroup.md new file mode 100644 index 0000000..a6800cf --- /dev/null +++ b/website/content/wishlist/auxiliary-socket-message-describing-the-sender-s-cgroup.md @@ -0,0 +1,21 @@ +--- +title: "Auxiliary socket message describing the sender's cgroup" +weight: 40 +status: wishlist +categories: + - cgroups + - sockets +--- + +`SCM_CGROUPID` or a similar auxiliary socket message, that allows +receivers to figure out which cgroup a sender is part of. + +**Use-Case:** `systemd-journald` picks up cgroup information from +logging clients, in order to augment log records and allow +filtering via this meta-information. In particular it derives +service identity from that (so that requests such as "Show me all +log messages of service X!" can be answered). This is currently +racy, since it uses `SCM_CREDS`' `.pid` field for this, which it then +used to load `/proc/$PID/cgroup`. In particular for programs that +log and immediately exit, the cgroup information frequently cannot +be acquired anymore by `systemd-journald`. diff --git a/website/content/wishlist/blobfs.md b/website/content/wishlist/blobfs.md new file mode 100644 index 0000000..810be6f --- /dev/null +++ b/website/content/wishlist/blobfs.md @@ -0,0 +1,22 @@ +--- +title: "blobfs" +weight: 260 +status: wishlist +categories: + - filesystems +--- + +[`blobfs`](https://fuchsia.dev/fuchsia-src/concepts/filesystems/blobfs) +for Linux. i.e. a minimalistic file system, that can store +authenticated (Verity) data files, that can be written once, and +not be modified after that, and provide stable handles (i.e. is +content-addressable) to them. + +**Use-Case:** This would deliver just about enough to place +trusted OS resources (binaries, kernels, initrds, fs trees, other +resources) in them, without having to trust the medium and IO +underneath. Should be simple enough to even implement in a boot +loader and similar, without making things vulnerable to rogue file +system image attacks. The OS and its payloads (apps, containers, +…) could then be composed from these resources, through means like +overlayfs, namespacing and more. diff --git a/website/content/wishlist/clock-monotonic-network-timestamps.md b/website/content/wishlist/clock-monotonic-network-timestamps.md new file mode 100644 index 0000000..12db0ec --- /dev/null +++ b/website/content/wishlist/clock-monotonic-network-timestamps.md @@ -0,0 +1,21 @@ +--- +title: "`CLOCK_MONOTONIC` network timestamps" +weight: 130 +status: wishlist +categories: + - sockets +--- + +Currently network timestamps are exclusively in `CLOCK_REALTIME`, even +though for many (most?) a monotonic clock would be much preferable, as +calculations become easier when one doesn't have to think about clock +jumps and similar. + +**Use-Case:** `systemd-journald` collects implicit timestamps via +`AF_UNIX` time-stamping, in `CLOCK_REALTIME`, even though for its +internal logic only monotonic timestamps are used, as log records +are searched via bisection in ordered tables, that require +strictly increasing timestamps. In particular during boot (where +`CLOCK_REALTIME` is often not available, stable or subject to +corrections) it would be good to have reliable, monotonic +timestamps on all log records. diff --git a/website/content/wishlist/clone-pidfd-autokill-semantics-for-pid-1.md b/website/content/wishlist/clone-pidfd-autokill-semantics-for-pid-1.md new file mode 100644 index 0000000..bdba0b0 --- /dev/null +++ b/website/content/wishlist/clone-pidfd-autokill-semantics-for-pid-1.md @@ -0,0 +1,37 @@ +--- +title: "`CLONE_PIDFD_AUTOKILL` semantics for PID 1" +weight: 20 +status: wishlist +categories: + - cgroups + - pidfd + - processes +--- + +Allow obtaining a `CLONE_PIDFD_AUTOKILL` pidfd for PID 1. Currently +PID 1 cannot hand off an autokill pidfd for itself. Allowing this would +make it possible to create system-death-traps where the lifetime of +PID 1 is tied to another process. PID 1 creates a `CLONE_PIDFD_AUTOKILL` +pidfd for itself, hands it off to another task, and closes its own copy. +If that other task exits, PID 1 is taken down. + +**Use-Case:** Tie the lifetime of PID 1 to a critical process such as a +software TPM or other security-sensitive daemon. This ensures the system +is brought down if the critical process dies, rather than continuing to +run in a potentially compromised state. + +**Considerations:** When PID 1 is spawned there is no mechanism to start +it with a pidfd right away. There are two possible approaches: + +1. Place a pidfd at file descriptor position 3 in PID 1's file descriptor + table before `exec()`, similar to how the coredump usermodehelper works. + After `exec()` PID 1 knows that it already has an autokill pidfd for + itself opened at fd 3. + +2. Allow opening an autokill pidfd via `pidfd_open()`. This would require + mutual exclusion with `CLONE_PIDFD_AUTOKILL`: if an autokill pidfd + already exists from `clone3()` then no new autokill pidfd can be + created via `pidfd_open()`. This guarantees clean semantics. + +Permission checking would have to be strict. It should probably only be +allowed for the current thread-group leader on itself. diff --git a/website/content/wishlist/determining-if-a-mount-point-belongs-to-the-current-user.md b/website/content/wishlist/determining-if-a-mount-point-belongs-to-the-current-user.md new file mode 100644 index 0000000..421a25a --- /dev/null +++ b/website/content/wishlist/determining-if-a-mount-point-belongs-to-the-current-user.md @@ -0,0 +1,27 @@ +--- +title: "Determining if a mount point belongs to the current user" +weight: 80 +status: wishlist +categories: + - mounts + - namespaces + - processes +--- + +Ability to determine if a mount point belongs to the current user +namespace, in order to check if there's a chance a process can +safely unmount it (as that only works for mounts owned by the same +user namespaces — or one further down the tree, but not any up the +tree). A simple, additional field in `/proc/self/mountinfo` +container the owning user namespace ID would probably already +suffice. + +**Use-Case:** the `systemd` system and service manager tries to unmount +all established mounts on shutdown. Inside of container +environments where specific mounts are established by the +container manager (and not the payload itself) this will +ultimately fail if user namespaces are enabled. In order to clean +up the shutdown logic it would be very good to be able to +determine whether a specific mount could even possibly be +unmounted or whether it's not worth the effort to include the +unmount in the system shutdown transaction. diff --git a/website/content/wishlist/device-cgroup-guard-to-allow-mknod-in-non-initial-userns.md b/website/content/wishlist/device-cgroup-guard-to-allow-mknod-in-non-initial-userns.md new file mode 100644 index 0000000..a940608 --- /dev/null +++ b/website/content/wishlist/device-cgroup-guard-to-allow-mknod-in-non-initial-userns.md @@ -0,0 +1,28 @@ +--- +title: "Device cgroup guard to allow `mknod()` in non-initial userns" +weight: 280 +status: wishlist +categories: + - cgroups + - filesystems + - mounts + - namespaces + - security +--- + +If a container manager restricts its unprivileged (user namespaced) +children by a device cgroup, it is not necessary to deny `mknod()` +anymore. Thus, user space applications may map devices on different +locations in the file system by using `mknod()` inside the container. + +**Use-Case:** A use case for this, which is applied by users of GyroidOS, +is to run `virsh` for VMs inside an unprivileged container. `virsh` or +libvirt creates device nodes, e.g., `/var/run/libvirt/qemu/11-fgfg.dev/null` +which currently fails in a non-initial userns, even if a cgroup device white +list with the corresponding major, minor of `/dev/null` exists. Thus, in +this case the usual bind mounts or pre populated device nodes under `/dev` +are not sufficient. + +An initial group internal RFC exists in +(https://github.com/quitschbo/linux/tree/devcg_guard_rfc). +See commit message for more implementation specific details. diff --git a/website/content/wishlist/excluding-processes-watched-via-pidfd-from-waitid-p-all.md b/website/content/wishlist/excluding-processes-watched-via-pidfd-from-waitid-p-all.md new file mode 100644 index 0000000..465ffaa --- /dev/null +++ b/website/content/wishlist/excluding-processes-watched-via-pidfd-from-waitid-p-all.md @@ -0,0 +1,22 @@ +--- +title: "Excluding processes watched via `pidfd` from `waitid(P_ALL, …)`" +weight: 110 +status: wishlist +categories: + - pidfd + - processes +--- + +**Use-Case:** various programs use `waitid(P_ALL, …)` to collect exit +information of exited child processes. In particular PID 1 and +processes using `PR_SET_CHILD_SUBREAPER` use this as they may +collect unexpected children that have been reparented from dying +sub-processes, and that need to be reaped in order to clean up the +PID space. Currently, these programs cannot easily mix waiting for +specific sub-processes via `pidfd` with waiting for the other +*unexpected* children via `waitid(P_ALL, …)` since the latter also +reaps (and thus invalidates) the pidfd-tracked +children. Specifically, the `systemd` service manager would like +to use `pidfd`s to remove PID recycling security issues, but +currently cannot as it also needs to generically wait for such +unexpected children. diff --git a/website/content/wishlist/extend-io-uring-with-classic-synchronous-system-calls.md b/website/content/wishlist/extend-io-uring-with-classic-synchronous-system-calls.md new file mode 100644 index 0000000..6437812 --- /dev/null +++ b/website/content/wishlist/extend-io-uring-with-classic-synchronous-system-calls.md @@ -0,0 +1,14 @@ +--- +title: "Extend `io_uring` with classic synchronous system calls" +weight: 240 +status: wishlist +categories: + - io-uring + - mounts + - namespaces +--- + +The `io_uring` subsystem is open to adding classic existing synchronous +system calls (e.g. `setns()` or `mount()` or other) to `io_uring`. +They also said they would support adding new functionality into +`io_uring` that is not exposed through system calls yet. diff --git a/website/content/wishlist/extend-mount-setattr-to-allow-changing-mount-properties-ignoring-any-failures.md b/website/content/wishlist/extend-mount-setattr-to-allow-changing-mount-properties-ignoring-any-failures.md new file mode 100644 index 0000000..5ad019f --- /dev/null +++ b/website/content/wishlist/extend-mount-setattr-to-allow-changing-mount-properties-ignoring-any-failures.md @@ -0,0 +1,12 @@ +--- +title: "Extend `mount_setattr()` to allow changing mount properties ignoring any failures" +weight: 160 +status: wishlist +categories: + - mounts +--- + +**Use-Case:** workloads that know that there are mounts in a mount tree +whose attributes cannot be changed by the caller don't want +`mount_settattr()` to fail on the first mount it failed to convert. Give +them a flag to request changes ignoring failures. diff --git a/website/content/wishlist/extend-setns-to-allow-attaching-to-all-new-namespaces-of-a-process.md b/website/content/wishlist/extend-setns-to-allow-attaching-to-all-new-namespaces-of-a-process.md new file mode 100644 index 0000000..d65f6f4 --- /dev/null +++ b/website/content/wishlist/extend-setns-to-allow-attaching-to-all-new-namespaces-of-a-process.md @@ -0,0 +1,18 @@ +--- +title: "Extend `setns()` to allow attaching to all new namespaces of a process" +weight: 200 +status: wishlist +categories: + - namespaces + - pidfd + - processes +--- + +Add an extension to `setns()` to allow attaching to all namespaces of +a process `SETNS_PIDFD_ALL` different from the caller's namespaces. +Currently specifying e.g., `CLONE_NEWUSER` fails if the caller is in the +same user namespace as the target process. This is very inconvenient. + +**Use-Case:** Make it trivial to attach to all namespaces of a process +without having to figure out whether the caller is already in the same +namespace or not. diff --git a/website/content/wishlist/filtering-on-received-file-descriptors.md b/website/content/wishlist/filtering-on-received-file-descriptors.md new file mode 100644 index 0000000..bf5c58b --- /dev/null +++ b/website/content/wishlist/filtering-on-received-file-descriptors.md @@ -0,0 +1,16 @@ +--- +title: "Filtering on received file descriptors" +weight: 100 +status: wishlist +categories: + - filesystems + - sockets +--- + +An alternative to the previous item could be if some form of filtering +could be enforced on the file descriptors suitable for enqueuing on +the `AF_UNIX` socket. i.e. allow filtering by superblock type or +similar, so that policies such as "only `memfd`s are OK to be +received" may be expressed. (BPF?). + +**Use-Case:** as above. diff --git a/website/content/wishlist/immutable-layers-for-overlayfs.md b/website/content/wishlist/immutable-layers-for-overlayfs.md new file mode 100644 index 0000000..70fcdc6 --- /dev/null +++ b/website/content/wishlist/immutable-layers-for-overlayfs.md @@ -0,0 +1,18 @@ +--- +title: "Immutable layers for `overlayfs`" +weight: 340 +status: wishlist +categories: + - filesystems +--- + +`overlayfs` should permit *immutable* layers, i.e. layers whose +non-directory inodes may not be overriden in an upper writable +layer. + +**Use-Case:** This would be useful when implementing `/etc/` as a +stack of overlayfs layers, each shipping configuration for a +different facet of the system, with a writable layer on the top for +local modifications. In such a scenario it would be useful to allow +the user to change any configuration it likes, except of the files +and other inodes shipped in the lower layers. diff --git a/website/content/wishlist/immutable-loopback-block-devices.md b/website/content/wishlist/immutable-loopback-block-devices.md new file mode 100644 index 0000000..53ab1e4 --- /dev/null +++ b/website/content/wishlist/immutable-loopback-block-devices.md @@ -0,0 +1,23 @@ +--- +title: "Immutable loopback block devices" +weight: 140 +status: wishlist +categories: + - block-devices + - filesystems + - mounts +--- + +Truly immutable loopback block devices. Right now setting up a +loopback block device in read-only mode, backed by a read-only +file (stored on a regular read/write file system), and then +mounting it with `ext4` also in `MS_RDONLY`mode *will* result in +changes to the file, quite unexpectedly 🤯. Ideally, if a loopback +block device is set up in read-only mode this should guarantee +that the backing file remains unmodified by it. + +**Use-Case:** disk image build tools that want to reproducibly and +verifiable build images must be able to rely that mounting them in +read-only mode does not alter the images in any way. In particular +when working in computer forensics one must be able to rely that +file systems that are analyzed remain unmodified by the analysis. diff --git a/website/content/wishlist/inotify-events-for-bsd-file-locks.md b/website/content/wishlist/inotify-events-for-bsd-file-locks.md new file mode 100644 index 0000000..f0e8634 --- /dev/null +++ b/website/content/wishlist/inotify-events-for-bsd-file-locks.md @@ -0,0 +1,26 @@ +--- +title: "inotify() events for BSD file locks" +weight: 30 +status: wishlist +categories: + - block-devices + - filesystems +--- + +BSD file locks (i.e. `flock()`, as opposed to POSIX `F_SETLK` and +friends are inode-focussed, hence would be great if one could get +asynchronous notification when they are released via inotify. + +**Use-Case:** udevd probes block devices whenever they pop up to +create /dev/disk/by-label/* and similar symlinks. Formatting tools +can temporarily block this behaviour by taking a BSD file lock on +the block device (as per https://systemd.io/BLOCK_DEVICE_LOCKING), +in order to make sure udevd doesn't probe file systems/partition +tables that are only partially initialized. Currently, udevd uses +inotify `IN_CLOSE_WRITE` notifications to detect whenever +applications close a block device after writing to it, and +automatically reprobes the device. This works reasonably OK given +that block devices are usually closed at the same time as their +BSD file lock is released, and vice versa. However, this is not +fully correct: what udevd actually should be watching is the locks +being released, not the devices being closed. diff --git a/website/content/wishlist/ioctl-api-for-overlayfs.md b/website/content/wishlist/ioctl-api-for-overlayfs.md new file mode 100644 index 0000000..f11a7c6 --- /dev/null +++ b/website/content/wishlist/ioctl-api-for-overlayfs.md @@ -0,0 +1,23 @@ +--- +title: "`ioctl()` API for `overlayfs`" +weight: 350 +status: wishlist +categories: + - block-devices + - filesystems +--- + +`overlayfs` should have an `ioctl()`-based API (or similar) for +querying information of the backing file systems/block devices + +**Use-Case:** In systemd in various areas we automatically find the +block device backing the root file system and other file systems +(Example: `systemd-gpt-auto-generator` or `bootctl` wull try to find +auxiliary file systems of the OS image by looking in the GPT +partition table the root file system is located in). While this +logic is good enough to find the backing block devices of some more +complex storage such as dm-crypt, dm-verity or btrfs, once +`overlayfs` is used as backing for the root file system this logic +does not work anymore. It would be great if there was an API to +simply query `overlayfs` for the superblock information +(i.e. `.st_dev`) of the backing layers. diff --git a/website/content/wishlist/linking-of-o-tmpfile-files-with-replacement.md b/website/content/wishlist/linking-of-o-tmpfile-files-with-replacement.md new file mode 100644 index 0000000..5b097a4 --- /dev/null +++ b/website/content/wishlist/linking-of-o-tmpfile-files-with-replacement.md @@ -0,0 +1,21 @@ +--- +title: "Linking of `O_TMPFILE` files with replacement" +weight: 50 +status: wishlist +categories: + - filesystems +--- + +Ability to link an `O_TMPFILE` file into a directory while *replacing* an +existing file. (Currently there's only the ability to link it in, if the +file name doesn't exist yet.) + +**Use-Case:** there are many programs (e.g. `systemd-hostnamed` +when updating `/etc/hostname`) that atomically want to update a +file, so that either the old or the new version is in place, but +never a partially updated one. The canonical way to do this is by +creating a temporary file with the new contents, and then renaming +it to the filename of the file to update, thus atomically replacing +it. Currently, the temporary file for this must be created with a +random name, `O_TMPFILE` cannot be used, since for these files +atomic-replace is not supported, currently. diff --git a/website/content/wishlist/make-quotas-work-with-user-namespaces.md b/website/content/wishlist/make-quotas-work-with-user-namespaces.md new file mode 100644 index 0000000..ce3945c --- /dev/null +++ b/website/content/wishlist/make-quotas-work-with-user-namespaces.md @@ -0,0 +1,18 @@ +--- +title: "Make quotas work with user namespaces" +weight: 180 +status: wishlist +categories: + - mounts + - namespaces + - security +--- + +The quota codepaths in the kernel currently broken and inconsistent +and most interesting operations are guarded behind +`capable(CAP_SYS_ADMIN)`, i.e., require `CAP_SYS_ADMIN` in the initial +user namespace. We should rework these codepaths to work with user +namespaces and then see whether we can make them work with idmapped +mounts. + +**Use-Case:** using quotas correctly in containers. diff --git a/website/content/wishlist/map-different-uids-gids-to-the-same-uids-gids.md b/website/content/wishlist/map-different-uids-gids-to-the-same-uids-gids.md new file mode 100644 index 0000000..3772ee0 --- /dev/null +++ b/website/content/wishlist/map-different-uids-gids-to-the-same-uids-gids.md @@ -0,0 +1,17 @@ +--- +title: "Map different uids/gids to the same uids/gids?" +weight: 250 +status: wishlist +categories: + - filesystems + - mounts +--- + +Explore the idea of mapping different uids/gids to the same uids/gids, i.e. +65534:1000:1 50000:1000:1. This will only work if the mount is read-only as +the kernel wouldn't know what uid/gid would need to be put to disk +otherwise (65534? 50000? the first one that is mapped?). + +**Use-Case:** Delegate multiple {g,u}ids to the same user. Merging +ownership similar to how overlayfs merges files. Bindfs +(https://bindfs.org/docs/bindfs.1.html#sect3) allows this concept too. diff --git a/website/content/wishlist/namespaced-loop-and-block-devices.md b/website/content/wishlist/namespaced-loop-and-block-devices.md new file mode 100644 index 0000000..2d70077 --- /dev/null +++ b/website/content/wishlist/namespaced-loop-and-block-devices.md @@ -0,0 +1,14 @@ +--- +title: "Namespaced loop and block devices" +weight: 270 +status: wishlist +categories: + - block-devices + - mounts + - namespaces +--- + +Namespace-able loop and block devices, usable inside user namespaces. + +**Use-Case:** Allow mounting images inside nspawn containers, and using +RootImage= and friends in the systemd user manager. diff --git a/website/content/wishlist/open-thread-group-leader-via-pidfd-open.md b/website/content/wishlist/open-thread-group-leader-via-pidfd-open.md new file mode 100644 index 0000000..011f765 --- /dev/null +++ b/website/content/wishlist/open-thread-group-leader-via-pidfd-open.md @@ -0,0 +1,18 @@ +--- +title: "Open thread-group leader via `pidfd_open()`" +weight: 380 +status: wishlist +categories: + - pidfd +--- + +Extend `pidfd_open()` to allow opening the thread-group leader based on the +PID of an individual thread. Currently we do support: + +1. `pidfd_open(1234, 0)` on a thread-group leader PID +2. `pidfd_open(1234, PIDFD_THREAD)` on a thread + +Add an option to go from individual thread to thread-group leader. + +**Use-Case:** Allow for a race free way to go from individual thread +to thread-group leader pidfd. diff --git a/website/content/wishlist/path-based-acl-management-in-an-lsm-hook.md b/website/content/wishlist/path-based-acl-management-in-an-lsm-hook.md new file mode 100644 index 0000000..0eb0f11 --- /dev/null +++ b/website/content/wishlist/path-based-acl-management-in-an-lsm-hook.md @@ -0,0 +1,17 @@ +--- +title: "Path-based ACL management in an LSM hook" +weight: 330 +status: wishlist +categories: + - filesystems + - mounts + - security +--- + +The LSM module API should have the ability to do path-based (not +just inode-based) ACL management. + +**Use-Case:** This would be useful in BPF-LSM modules such as +systemd's `mntfsd` which allows unprivileged file system mounts in +some cases, and which would like to restrict ACL handling based on +the superblock involved. diff --git a/website/content/wishlist/per-cgroup-limit-for-coredump-sizes.md b/website/content/wishlist/per-cgroup-limit-for-coredump-sizes.md new file mode 100644 index 0000000..f0e8ee3 --- /dev/null +++ b/website/content/wishlist/per-cgroup-limit-for-coredump-sizes.md @@ -0,0 +1,18 @@ +--- +title: "Per-cgroup limit for coredump sizes" +weight: 220 +status: wishlist +categories: + - cgroups + - processes +--- + +A per-cgroup knob for coredump sizes. Currently coredump size +control is strictly per process, and primarily under control of +the processes themselves. It would be good if we had a per-cgroup +knob instead, that is under control of the service manager. + +**Use-Case:** coredumps can be heavy to generate. For different +usecases it would be good to be able to opt-in or opt-out +dynamically from coredumps for specific services, at runtime +without restarting them. diff --git a/website/content/wishlist/race-free-creation-and-opening-of-non-file-inodes.md b/website/content/wishlist/race-free-creation-and-opening-of-non-file-inodes.md new file mode 100644 index 0000000..c076de4 --- /dev/null +++ b/website/content/wishlist/race-free-creation-and-opening-of-non-file-inodes.md @@ -0,0 +1,31 @@ +--- +title: "Race-free creation and opening of non-file inodes" +weight: 230 +status: wishlist +categories: + - filesystems + - sockets +--- + +A way to race-freely create an (non-file) inode and immediately +open it. For regular files we have open(O_CREAT) for creating a +new file inode, and returning a pinning fd to it. This is missing +for other inode types, such as directories, device nodes, +FIFOs. The lack of such functionality means that when populating a +directory tree there's always a race involved: the inodes first +need to be created, and then opened to adjust their +permissions/ownership/labels/timestamps/acls/xattrs/…, but in the +time window between the creation and the opening they might be +replaced by something else. Addressing this race without proper +APIs is possible (by immediately fstat()ing what was opened, to +verify that it has the right inode type), but difficult to get +right. Hence, mkdirat_fd() that creates a directory *and* returns +an O_DIRECTORY fd to it would be great. As would be mknodeat_fd() +that creates a device node, FIFO or (dead) socket and returns an +O_PATH fd to it. And of course symlinkat_fd() that creates a +symlink and returns an O_PATH fd to it. + +**Use-Case:** any program that creates/unpacks not just files, but +directories, device nodes, fifos, and wants to ensure that they +safely get the right attributes applied, even if other code might +simultaneously have access to the same directory tree. diff --git a/website/content/wishlist/race-free-mounting-of-block-devices.md b/website/content/wishlist/race-free-mounting-of-block-devices.md new file mode 100644 index 0000000..25f1116 --- /dev/null +++ b/website/content/wishlist/race-free-mounting-of-block-devices.md @@ -0,0 +1,18 @@ +--- +title: "Race-free mounting of block devices" +weight: 300 +status: wishlist +categories: + - block-devices + - filesystems + - mounts +--- + +Introduce a new struct to `fsconfig()` as an alternative to the +`source` property. The struct contains at least a pointer to a path, +possibly a device minor and major, and a diskseq number. The VFS can +expose a helper that filesystems can call and use the diskseq number +to verify that the block device they are intending to mount is indeed +the one they want to mount. + +**Use-Case:** Race-free mounting of block devices. diff --git a/website/content/wishlist/read-only-propagation-of-mounts.md b/website/content/wishlist/read-only-propagation-of-mounts.md new file mode 100644 index 0000000..255e315 --- /dev/null +++ b/website/content/wishlist/read-only-propagation-of-mounts.md @@ -0,0 +1,29 @@ +--- +title: "Read-only propagation of mounts" +weight: 90 +status: wishlist +categories: + - mounts + - namespaces +--- + +A way to mark mounts that receive mount propagation events from +elsewhere so that these propagated mounts are established +read-only implicitly. Right now, if a mount receives a mount +propagation event it will have the exact same `MS_RDONLY`, +`MS_NODEV`, … flags as it has where it originated. It would be +very useful if an `MS_RDONLY` could be ORed into the mount flags +automatically whenever propagated elsewhere. + +**Use-Case:** various mount namespace based sandboxes +(e.g. `systemd`'s `ProtectSystem=` option) mark large parts of the +host file hierarchy read-only via mounting it +`MS_RDONLY|MS_BIND|MS_REMOUNT`, but generally intend to leave the +file hierarchy besides that the way it is, and that includes they +typically still want to be able to receive mount events to +directories such as `/mnt/` and `/media/` in these sandboxed +environments. Right now, any such propagation then happens in +writable mode, even if the file hierarchy otherwise is almost +entirely read-only. To close this gap it would be great if such +propagated mounts could implicitly gain `MS_RDONLY` as they are +propagated. diff --git a/website/content/wishlist/reasonable-eof-on-sock-seqpacket.md b/website/content/wishlist/reasonable-eof-on-sock-seqpacket.md new file mode 100644 index 0000000..bb35f21 --- /dev/null +++ b/website/content/wishlist/reasonable-eof-on-sock-seqpacket.md @@ -0,0 +1,17 @@ +--- +title: "Reasonable EOF on SOCK_SEQPACKET" +weight: 400 +status: wishlist +categories: + - sockets +--- + +Zero size datagrams cannot be distinguished from EOF on +`SOCK_SEQPACKET`. Both will cause `recvmsg()` to return zero. + +Idea how to improve things: maybe define a new MSG_XYZ flag for this, +which causes either of the two cases result in some recognizable error +code returned rather than a 0. + +**Use-Case:** Any code that wants to use `SOCK_SEQPACKET` and cannot +effort disallowing zero sized datagrams from their protocol. diff --git a/website/content/wishlist/reasonable-handling-of-selinux-dropping-scm-rights-fds.md b/website/content/wishlist/reasonable-handling-of-selinux-dropping-scm-rights-fds.md new file mode 100644 index 0000000..0b224c8 --- /dev/null +++ b/website/content/wishlist/reasonable-handling-of-selinux-dropping-scm-rights-fds.md @@ -0,0 +1,19 @@ +--- +title: "Reasonable Handling of SELinux dropping SCM_RIGHTS fds" +weight: 410 +status: wishlist +categories: + - sockets +--- + +Currently, if SELinux refuses to let some file descriptor through, it +will just drop them from the `SCM_RIGHTS` array. That's a terrible +idea, since applications rely on the precise arrangement of the array +to know which fd is which. By dropping entries silently, these apps +will all break. + +Idea how to improve things: leave the elements in the array in place, +but return a marker instead (i.e. negative integer, maybe `-EPERM`) that +tells userspace that there was an fd, but it was not allowed through. + +**Use-Case:** Any code that wants to use `SCM_RIGHTS` properly. diff --git a/website/content/wishlist/security-hook-for-mount-setattr.md b/website/content/wishlist/security-hook-for-mount-setattr.md new file mode 100644 index 0000000..2d20523 --- /dev/null +++ b/website/content/wishlist/security-hook-for-mount-setattr.md @@ -0,0 +1,13 @@ +--- +title: "Security hook for `mount_setattr()`" +weight: 210 +status: wishlist +categories: + - mounts + - security +--- + +(kAPI) Add security hook to `mount_setattr()`. + +**Use-Case:** Allow LSMs to make decisions about what mount properties to +allow and what to deny. diff --git a/website/content/wishlist/specification-of-a-keyring-for-dm-verity-volume-verification.md b/website/content/wishlist/specification-of-a-keyring-for-dm-verity-volume-verification.md new file mode 100644 index 0000000..accdeaf --- /dev/null +++ b/website/content/wishlist/specification-of-a-keyring-for-dm-verity-volume-verification.md @@ -0,0 +1,17 @@ +--- +title: "Specification of a keyring for dm-verity volume verification" +weight: 320 +status: wishlist +categories: + - block-devices +--- + +When activating a dm-verity volume allow specifying keyring to +validate root hash signature against. + +**Use-Case:** In systemd, we'd like to authenticate Portable Service +images, system extension images, configuration images, container +images with different keys, as they typically originate from +different sources and it should not be possible to generate a +system extension with a key pair that is supposed to be good for +container images only. diff --git a/website/content/wishlist/unlinking-via-two-file-descriptors.md b/website/content/wishlist/unlinking-via-two-file-descriptors.md new file mode 100644 index 0000000..34f3845 --- /dev/null +++ b/website/content/wishlist/unlinking-via-two-file-descriptors.md @@ -0,0 +1,22 @@ +--- +title: "Unlinking via two file descriptors" +weight: 70 +status: wishlist +categories: + - filesystems + - processes +--- + +`unlinkat3(dir_fd, name, inode_fd)`: taking one file descriptor +for the directory to remove a file in, and another one referring +to the inode of the filename to remove. This call should only +succeed if the specified filename still refers to the specified +inode. + +**Use-Case:** code that operates on a well-know path that might be +shared by multiple programs that jointly manage it might want to +safely remove a filename under the guarantee it still refers to +the expected inode. As a specific example, consider lock files, +that should be cleaned up only if they still refer to the assumed +owner's instance, but leave the file in place if another process +already took over the filename. diff --git a/website/content/wishlist/upgrade-masks-in-openat2.md b/website/content/wishlist/upgrade-masks-in-openat2.md new file mode 100644 index 0000000..c024a02 --- /dev/null +++ b/website/content/wishlist/upgrade-masks-in-openat2.md @@ -0,0 +1,13 @@ +--- +title: "Upgrade masks in `openat2()`" +weight: 170 +status: wishlist +categories: + - security +--- + +Add upgrade masks to `openat2()`. Extend `struct open_how` to allow +restrict re-opening of file descriptors. + +**Use-Case:** block services or containers from re-opening/upgrading an +`O_PATH` file descriptor through e.g. `/proc//fd/ Date: Mon, 9 Mar 2026 14:50:08 +0100 Subject: [PATCH 2/6] website: update Hugo config for multi-page content structure Remove the [modules] mount that mapped the root README.md as the index page. Content now lives directly in website/content/ with proper directory hierarchy. Add category taxonomy support for filtering features by subsystem. Enable built-in search. Update the edit link template to point at the correct paths under website/. Refresh the archetype template with the new front matter fields (status, categories). Signed-off-by: Christian Brauner --- website/archetypes/default.md | 6 ++++-- website/config.toml | 20 +++++++++---------- website/layouts/partials/docs/links/edit.html | 2 +- 3 files changed, 15 insertions(+), 13 deletions(-) diff --git a/website/archetypes/default.md b/website/archetypes/default.md index 00e77bd..8d45989 100644 --- a/website/archetypes/default.md +++ b/website/archetypes/default.md @@ -1,6 +1,8 @@ --- title: "{{ replace .Name "-" " " | title }}" -date: {{ .Date }} -draft: true +weight: 10 +status: wishlist +categories: [] --- +**Use-Case:** diff --git a/website/config.toml b/website/config.toml index 8702d75..37c7a2a 100644 --- a/website/config.toml +++ b/website/config.toml @@ -1,10 +1,13 @@ baseURL = "https://uapi-group.org/kernel-features" languageCode = "en-us" -title = "UAPI Group Kernel Feature Ideas" +title = "UAPI Group Kernel Feature Wishlist" theme = "hugo-book" +[taxonomies] + category = "categories" + [[menu.before]] - name = "⬅️ Back to top" + name = "Back to top" url = "/.." weight = 1 @@ -14,17 +17,14 @@ theme = "hugo-book" weight = 11 [markup.goldmark.renderer] - unsafe = true # Allow HTML in md files + unsafe = true [params] BookPortableLinks = true BookRepo = 'https://github.com/uapi-group/kernel-features' BookCommitPath = 'commit' - BookEditLink = '{{ .Site.Params.BookRepo }}/edit/main/{{ .Path }}' - BookIndexPage = 'README.md' + BookEditLink = '{{ .Site.Params.BookRepo }}/edit/main/website/{{ .Path }}' + BookSection = '/' BookTheme = 'auto' - -[modules] - [[module.mounts]] - source = '../README.md' - target = 'content/_index.md' + BookSearch = true + BookDateFormat = "January 2, 2006" diff --git a/website/layouts/partials/docs/links/edit.html b/website/layouts/partials/docs/links/edit.html index 02e8501..8068105 100644 --- a/website/layouts/partials/docs/links/edit.html +++ b/website/layouts/partials/docs/links/edit.html @@ -1,5 +1,5 @@ {{- return (partial "docs/text/template" (dict "Template" .Site.Params.BookEditLink "Context" (dict "Site" .Site "Page" .Page -"Path" (strings.TrimPrefix hugo.WorkingDir ( replace .Page.File.Path "_index.md" .Site.Params.BookIndexPage)) +"Path" .Page.File.Path )) | urls.JoinPath) -}} From 7481065bf6e3259114e1d6eaedec1a9ff40a9f22 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Mon, 9 Mar 2026 14:50:17 +0100 Subject: [PATCH 3/6] website: add status badges and category pills Add a content-before partial that renders a colored status badge (wishlist=blue, in-progress=yellow, completed=green) and clickable category pills at the top of each feature page. Completed items also show their commit SHA. Update _custom.scss with the badge, pill, and typography styles. Slightly widen the sidebar and TOC in _variables.scss. Signed-off-by: Christian Brauner --- website/assets/_custom.scss | 74 ++++++++++++++++++- website/assets/_variables.scss | 5 +- .../partials/docs/inject/content-before.html | 19 +++++ 3 files changed, 92 insertions(+), 6 deletions(-) create mode 100644 website/layouts/partials/docs/inject/content-before.html diff --git a/website/assets/_custom.scss b/website/assets/_custom.scss index 3859722..fdbda08 100644 --- a/website/assets/_custom.scss +++ b/website/assets/_custom.scss @@ -1,5 +1,73 @@ -nav>ul:last-of-type>li:last-child::before { - content: "⸻"; +// Status badges +.status-badge { display: inline-block; - margin: 0.5em 0; + padding: 0.15em 0.6em; + font-size: 0.8em; + font-weight: 600; + border-radius: 4px; + margin-bottom: 0.5em; + text-transform: uppercase; + letter-spacing: 0.03em; +} + +.status-in-progress { + background-color: #fff3cd; + color: #856404; +} + +.status-wishlist { + background-color: #d1ecf1; + color: #0c5460; +} + +.status-completed { + background-color: #d4edda; + color: #155724; +} + +// Category pills +.category-pills { + margin-bottom: 1em; +} + +.category-pill { + display: inline-block; + padding: 0.1em 0.5em; + font-size: 0.75em; + border-radius: 3px; + margin-right: 0.3em; + margin-bottom: 0.3em; + background-color: var(--gray-100); + color: var(--gray-500); + text-decoration: none; + + &:hover { + background-color: var(--gray-200); + text-decoration: none; + } +} + +// Separator before the last menu item +.book-menu nav > ul > li:last-child { + border-top: 1px solid var(--gray-200); + padding-top: 0.5em; + margin-top: 0.5em; +} + +// Typography refinements +.book-page { + line-height: 1.7; +} + +.markdown h1 { + margin-top: 0; +} + +// Commit reference styling +.commit-ref { + font-family: monospace; + font-size: 0.85em; + background-color: var(--gray-100); + padding: 0.15em 0.4em; + border-radius: 3px; } diff --git a/website/assets/_variables.scss b/website/assets/_variables.scss index bc8d840..ef64cfd 100644 --- a/website/assets/_variables.scss +++ b/website/assets/_variables.scss @@ -1,5 +1,4 @@ $body-min-width: 20rem; $container-max-width: 110rem; - -$menu-width: 15rem; -$toc-width: 15rem; +$menu-width: 16rem; +$toc-width: 16rem; diff --git a/website/layouts/partials/docs/inject/content-before.html b/website/layouts/partials/docs/inject/content-before.html new file mode 100644 index 0000000..2825159 --- /dev/null +++ b/website/layouts/partials/docs/inject/content-before.html @@ -0,0 +1,19 @@ +{{ with .Params.status }} + {{ $label := . }} + {{ if eq . "in-progress" }}{{ $label = "In Progress" }}{{ end }} + {{ if eq . "wishlist" }}{{ $label = "Wishlist" }}{{ end }} + {{ if eq . "completed" }}{{ $label = "Completed" }}{{ end }} + {{ $label }} +{{ end }} + +{{ with .Params.categories }} +
+ {{ range . }} + {{ . }} + {{ end }} +
+{{ end }} + +{{ with .Params.commit }} +

{{ . }}

+{{ end }} From 53f003ae73a8943cb5a1d0123b60689390e65f68 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Mon, 9 Mar 2026 14:50:25 +0100 Subject: [PATCH 4/6] Replace monolithic README with project overview The root README.md is now a concise project overview pointing to the website, CONTRIBUTING.md, and local development instructions. The feature content has moved to individual pages under website/content/. Update the website README.md to reflect the new content layout. Signed-off-by: Christian Brauner --- README.md | 1209 +-------------------------------------------- website/README.md | 48 +- 2 files changed, 56 insertions(+), 1201 deletions(-) diff --git a/README.md b/README.md index ce45731..5c972cf 100644 --- a/README.md +++ b/README.md @@ -1,1199 +1,42 @@ -# Kernel Features 🤞 🎁 🙏 +# Kernel Feature Wishlist -This is a list of kernel features that would be useful to have. The items on -the list are strictly ideas. It is especially important to not take the items -on this list as being implementation requests. Some of the ideas on this list -are rather rough and unrefined. They serve as entry points for exploring the -associated problem space. +A curated collection of kernel feature ideas maintained by the +[UAPI Group](https://uapi-group.org/). Browse the full list at +**[uapi-group.org/kernel-features](https://uapi-group.org/kernel-features/)**. -* **When implementing ideas on this list or ideas inspired by this list - please point that out explicitly and clearly in the associated patches - and Cc `Christian Brauner `.** -**Usecase:** There are two syslog RFCs in use today: RFC3164 and -RFC5424. `glibc`'s `syslog()` API generates events close to the -former, but there are programs which would like to generate the -latter instead (as it supports structured logging). The two formats -are not backwards compatible: a client sending RFC5424 messages to a -server only understanding RFC3164 will cause an ugly mess. On Linux -there's only a single `/dev/log` AF_UNIX/SOCK_DGRAM socket backing -`syslog()`, which is used in a one-way, fire-and-forget style. This -means that feature negotation is not really possible within the -protocol. Various tools bind mount the socket inode into `chroot()` -and container environments, hence it would be fantastic to associate -supported feature information directly with the inode (and thus -outside of the protocol) to make it easy for clients to determine -which features are spoken on a socket, in a way that survives bind -mounts. Implementation idea would be that syslog daemons -implementing RFC5425 could simply set an xattr `user.rfc5424` to `1` -(or something like that) on the socket inode, and clearly inform -clients in a natural and simple way that they'd be happy to parse -the newer format. Also see: -https://github.com/systemd/systemd/issues/19251 – This idea could -also be extended to other sockets and other protocols: by setting -some extended attribute on a socket inodes, services could advertise -which protocols they support on them. For example D-Bus sockets -could carry `user.dbus` set to `1`, and Varlink sockets -`user.varlink` set to `1` and so on. - -### Support detached mounts with `pivot_root()` - -The new rootfs must currently refer to an attached mount. This restriction -seems unnecessary. We should allow the new rootfs to refer to a detached -mount. - -This will allow a service- or container manager to create a new rootfs as -a detached, private mount that isn't exposed anywhere in the filesystem and -then `pivot_root()` into it. - -Since `pivot_root()` only takes path arguments the new rootfs would need to -be passed via `/proc//fd/`. In the long run we should add a new -`pivot_root()` syscall operating on file descriptors instead of paths. - -### Create mount namespace with custom rootfs via `open_tree()` and `fsmount()` - -Add `OPEN_TREE_NAMESPACE` flag to `open_tree()` and `FSMOUNT_NAMESPACE` flag -to `fsmount()` that create a new mount namespace with the specified mount tree -as the rootfs mounted on top of a copy of the real rootfs. These return a -namespace file descriptor instead of a mount file descriptor. - -This allows `OPEN_TREE_NAMESPACE` to function as a combined -`unshare(CLONE_NEWNS)` and `pivot_root()`. - -When creating containers the setup usually involves using `CLONE_NEWNS` via -`clone3()` or `unshare()`. This copies the caller's complete mount namespace. -The runtime will also assemble a new rootfs and then use `pivot_root()` to -switch the old mount tree with the new rootfs. Afterward it will recursively -unmount the old mount tree thereby getting rid of all mounts. - -Copying all of these mounts only to get rid of them later is wasteful. With a -large mount table and a system where thousands of containers are spawned in -parallel this quickly becomes a bottleneck increasing contention on the -semaphore. - -**Use-Case:** Container runtimes can create an extremely minimal rootfs -directly: - -```c -fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE); -``` - -This creates a mount namespace where "wootwoot" has become the rootfs. The -caller can `setns()` into this new mount namespace and assemble additional -mounts without copying and destroying the entire parent mount table. - -### Add immutable rootfs (`nullfs`) - -Currently `pivot_root()` doesn't work on the real rootfs because it -cannot be unmounted. Userspace has to do a recursive removal of the -initramfs contents manually before continuing the boot. - -Add an immutable rootfs called `nullfs` that serves as the parent mount -for anything that is actually useful such as the tmpfs or ramfs for -initramfs unpacking or the rootfs itself. The kernel mounts a -tmpfs/ramfs on top of it, unpacks the initramfs and fires up userspace -which mounts the rootfs and can then simply do: - -```c -chdir(rootfs); -pivot_root(".", "."); -umount2(".", MNT_DETACH); -``` - -This also means that the rootfs mount in unprivileged namespaces doesn't -need to become `MNT_LOCKED` anymore as it's guaranteed that the -immutable rootfs remains permanently empty so there cannot be anything -revealed by unmounting the covering mount. - -**Use-Case:** Simplifies the boot process by enabling `pivot_root()` to -work directly on the real rootfs. Removes the need for traditional -`switch_root` workarounds. In the future this also allows us to create -completely empty mount namespaces without risking to leak anything. - -### Allow `MOVE_MOUNT_BENEATH` on the rootfs - -Allow `MOVE_MOUNT_BENEATH` to target the caller's rootfs, enabling -root-switching without `pivot_root(2)`. The traditional approach to -switching the rootfs involves `pivot_root(2)` or a `chroot_fs_refs()`-based -mechanism that atomically updates `fs->root` for all tasks sharing the -same `fs_struct`. This has consequences for `fork()`, `unshare(CLONE_FS)`, -and `setns()`. - -Instead, decompose root-switching into individually atomic, locally-scoped -steps: - -```c -fd_tree = open_tree(-EBADF, "/newroot", - OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC); -fchdir(fd_tree); -move_mount(fd_tree, "", AT_FDCWD, "/", - MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH); -chroot("."); -umount2(".", MNT_DETACH); -``` - -Since each step only modifies the caller's own state, the -`fork()`/`unshare()`/`setns()` races are eliminated by design. - -To make this work, `MNT_LOCKED` is transferred from the top mount to the -mount beneath. The new mount takes over the job of protecting the parent -mount from being revealed. This also makes it possible to safely modify -an inherited mount table after `unshare(CLONE_NEWUSER | CLONE_NEWNS)`: +## Local development ```sh -mount --beneath -t tmpfs tmpfs /proc -umount -l /proc +git clone --recurse-submodules https://github.com/uapi-group/kernel-features.git +cd kernel-features/website +hugo server ``` -**Use-Case:** Containers created with `unshare(CLONE_NEWUSER | CLONE_NEWNS)` -can reshuffle an inherited mount table safely. `MOVE_MOUNT_BENEATH` on the -rootfs makes it possible to switch out the rootfs without the costly -`pivot_root(2)` and without cross-namespace vulnerabilities. - -### Query mount information via file descriptor with `statmount()` - -Extend `struct mnt_id_req` to accept a file descriptor and introduce -`STATMOUNT_BY_FD` flag. When a valid fd is provided and `STATMOUNT_BY_FD` -is set, `statmount()` returns mount info about the mount the fd is on. - -This works even for "unmounted" mounts (mounts that have been unmounted using -`umount2(mnt, MNT_DETACH)`), if you have access to a file descriptor on that -mount. These unmounted mounts will have no mountpoint and no valid mount -namespace, so `STATMOUNT_MNT_POINT` and `STATMOUNT_MNT_NS_ID` are unset in -`statmount.mask` for such mounts. - -**Use-Case:** Query mount information directly from a file descriptor without -needing the mount ID, which is particularly useful for detached or unmounted -mounts. - -### `AT_EMPTY_PATH` support `for unlinkat()` - -**Use-Case:** When dealing with files/directories, allow passing -around only a file descriptor without having to keep the path around -to be able to unlink the file/directory. - -### `AT_EMPTY_PATH` support for `openat()` and `openat2()` - -To get an operable version of an `O_PATH` file descriptors, it is -possible to use `openat(fd, ".", O_DIRECTORY)` for directories, but -other files currently require going through -`open("/proc//fd/")` which depends on a functioning `procfs`. - -FreeBSD already has `O_EMPTY_PATH` for `openat`, while `fstatat` and -similar functions have `AT_EMPTY_PATH`. - -**Use-Case:** When dealing with `O_PATH` file descriptors, allow -re-opening an operable version without the need of `procfs`. - ---- - -### TODO - -### xattrs for pidfd - -Since pidfds have been moved to a separate pidfs filesystem it is easy -to add support for xattrs on pidfds. That could be valuable to store -meta information along the pidfd. Storing an xattr should probably make -the pidfd automatically persistent, i.e., the reference for the dentry -is only put once the task is reaped. - -**Use-Case:** Store meta information alongside pidfds. - -### `CLONE_PIDFD_AUTOKILL` semantics for PID 1 - -Allow obtaining a `CLONE_PIDFD_AUTOKILL` pidfd for PID 1. Currently -PID 1 cannot hand off an autokill pidfd for itself. Allowing this would -make it possible to create system-death-traps where the lifetime of -PID 1 is tied to another process. PID 1 creates a `CLONE_PIDFD_AUTOKILL` -pidfd for itself, hands it off to another task, and closes its own copy. -If that other task exits, PID 1 is taken down. - -**Use-Case:** Tie the lifetime of PID 1 to a critical process such as a -software TPM or other security-sensitive daemon. This ensures the system -is brought down if the critical process dies, rather than continuing to -run in a potentially compromised state. - -**Considerations:** When PID 1 is spawned there is no mechanism to start -it with a pidfd right away. There are two possible approaches: - -1. Place a pidfd at file descriptor position 3 in PID 1's file descriptor - table before `exec()`, similar to how the coredump usermodehelper works. - After `exec()` PID 1 knows that it already has an autokill pidfd for - itself opened at fd 3. - -2. Allow opening an autokill pidfd via `pidfd_open()`. This would require - mutual exclusion with `CLONE_PIDFD_AUTOKILL`: if an autokill pidfd - already exists from `clone3()` then no new autokill pidfd can be - created via `pidfd_open()`. This guarantees clean semantics. - -Permission checking would have to be strict. It should probably only be -allowed for the current thread-group leader on itself. - -### inotify() events for BSD file locks - -BSD file locks (i.e. `flock()`, as opposed to POSIX `F_SETLK` and -friends are inode-focussed, hence would be great if one could get -asynchronous notification when they are released via inotify. - -**Use-Case:** udevd probes block devices whenever they pop up to -create /dev/disk/by-label/* and similar symlinks. Formatting tools -can temporarily block this behaviour by taking a BSD file lock on -the block device (as per https://systemd.io/BLOCK_DEVICE_LOCKING), -in order to make sure udevd doesn't probe file systems/partition -tables that are only partially initialized. Currently, udevd uses -inotify `IN_CLOSE_WRITE` notifications to detect whenever -applications close a block device after writing to it, and -automatically reprobes the device. This works reasonably OK given -that block devices are usually closed at the same time as their -BSD file lock is released, and vice versa. However, this is not -fully correct: what udevd actually should be watching is the locks -being released, not the devices being closed. - -### Auxiliary socket message describing the sender's cgroup - -`SCM_CGROUPID` or a similar auxiliary socket message, that allows -receivers to figure out which cgroup a sender is part of. - -**Use-Case:** `systemd-journald` picks up cgroup information from -logging clients, in order to augment log records and allow -filtering via this meta-information. In particular it derives -service identity from that (so that requests such as "Show me all -log messages of service X!" can be answered). This is currently -racy, since it uses `SCM_CREDS`' `.pid` field for this, which it then -used to load `/proc/$PID/cgroup`. In particular for programs that -log and immediately exit, the cgroup information frequently cannot -be acquired anymore by `systemd-journald`. - -### Linking of `O_TMPFILE` files with replacement - -Ability to link an `O_TMPFILE` file into a directory while *replacing* an -existing file. (Currently there's only the ability to link it in, if the -file name doesn't exist yet.) - -**Use-Case:** there are many programs (e.g. `systemd-hostnamed` -when updating `/etc/hostname`) that atomically want to update a -file, so that either the old or the new version is in place, but -never a partially updated one. The canonical way to do this is by -creating a temporary file with the new contents, and then renaming -it to the filename of the file to update, thus atomically replacing -it. Currently, the temporary file for this must be created with a -random name, `O_TMPFILE` cannot be used, since for these files -atomic-replace is not supported, currently. - -### Ability to only open regular files - -`O_REGULAR` (inspired by the existing `O_DIRECTORY` flag for -`open()`), which opens a file only if it is of type `S_IFREG`. - -**Use-Case:** this would be very useful to write secure programs -that want to avoid being tricked into opening device nodes with -special semantics while thinking they operate on regular -files. This is particularly relevant as many device nodes (or even -FIFOs) come with blocking I/O (or even blocking `open()`!) by -default, which is not expected from regular files backed by "fast" -disk I/O. Consider implementation of a naive web browser which is -pointed to `file://dev/zero`, not expecting an endless amount of -data to read. - -### Unlinking via two file descriptors - -`unlinkat3(dir_fd, name, inode_fd)`: taking one file descriptor -for the directory to remove a file in, and another one referring -to the inode of the filename to remove. This call should only -succeed if the specified filename still refers to the specified -inode. - -**Use-Case:** code that operates on a well-know path that might be -shared by multiple programs that jointly manage it might want to -safely remove a filename under the guarantee it still refers to -the expected inode. As a specific example, consider lock files, -that should be cleaned up only if they still refer to the assumed -owner's instance, but leave the file in place if another process -already took over the filename. - -### Determining if a mount point belongs to the current user - -Ability to determine if a mount point belongs to the current user -namespace, in order to check if there's a chance a process can -safely unmount it (as that only works for mounts owned by the same -user namespaces — or one further down the tree, but not any up the -tree). A simple, additional field in `/proc/self/mountinfo` -container the owning user namespace ID would probably already -suffice. - -**Use-Case:** the `systemd` system and service manager tries to unmount -all established mounts on shutdown. Inside of container -environments where specific mounts are established by the -container manager (and not the payload itself) this will -ultimately fail if user namespaces are enabled. In order to clean -up the shutdown logic it would be very good to be able to -determine whether a specific mount could even possibly be -unmounted or whether it's not worth the effort to include the -unmount in the system shutdown transaction. - -### Read-only propagation of mounts - -A way to mark mounts that receive mount propagation events from -elsewhere so that these propagated mounts are established -read-only implicitly. Right now, if a mount receives a mount -propagation event it will have the exact same `MS_RDONLY`, -`MS_NODEV`, … flags as it has where it originated. It would be -very useful if an `MS_RDONLY` could be ORed into the mount flags -automatically whenever propagated elsewhere. - -**Use-Case:** various mount namespace based sandboxes -(e.g. `systemd`'s `ProtectSystem=` option) mark large parts of the -host file hierarchy read-only via mounting it -`MS_RDONLY|MS_BIND|MS_REMOUNT`, but generally intend to leave the -file hierarchy besides that the way it is, and that includes they -typically still want to be able to receive mount events to -directories such as `/mnt/` and `/media/` in these sandboxed -environments. Right now, any such propagation then happens in -writable mode, even if the file hierarchy otherwise is almost -entirely read-only. To close this gap it would be great if such -propagated mounts could implicitly gain `MS_RDONLY` as they are -propagated. - -### Filtering on received file descriptors - -An alternative to the previous item could be if some form of filtering -could be enforced on the file descriptors suitable for enqueuing on -the `AF_UNIX` socket. i.e. allow filtering by superblock type or -similar, so that policies such as "only `memfd`s are OK to be -received" may be expressed. (BPF?). - -**Use-Case:** as above. - -### Excluding processes watched via `pidfd` from `waitid(P_ALL, …)` - -**Use-Case:** various programs use `waitid(P_ALL, …)` to collect exit -information of exited child processes. In particular PID 1 and -processes using `PR_SET_CHILD_SUBREAPER` use this as they may -collect unexpected children that have been reparented from dying -sub-processes, and that need to be reaped in order to clean up the -PID space. Currently, these programs cannot easily mix waiting for -specific sub-processes via `pidfd` with waiting for the other -*unexpected* children via `waitid(P_ALL, …)` since the latter also -reaps (and thus invalidates) the pidfd-tracked -children. Specifically, the `systemd` service manager would like -to use `pidfd`s to remove PID recycling security issues, but -currently cannot as it also needs to generically wait for such -unexpected children. - -### Asynchronous `close()` - -An asynchronous or forced `close()`, that guarantees that -userspace doesn't have to risk blocking for longer periods of time -when trying to get rid of unwanted file descriptors, possibly -received via `recvmsg()` + `SCM_RIGHTS` (see above). Currently, -`close()` of various file descriptors (for example those referring -to slow storage, e.g. non-responding NFS servers and such) might -take arbitrary amounts of time, potentially into the minute range -and more. This makes it risky accepting file descriptors on -publicly accessible `AF_UNIX` sockets, the way like IPC brokers -(e.g. D-Bus) do it: if a rogue client keeps sending file -descriptors that because unexpected must be closed immediately it -might cause the receiving process to effectively crawl, when it is -busy closing them all. A special form of `close()` that simply -detaches a file descriptor from the file descriptor table without -blocking on IO in any form would be great to close this issue. - -**Use-Case:** any program that receives file descriptors via `AF_UNIX` -from untrusted clients would benefit from this. e.g. D-Bus -brokers. - -### `CLOCK_MONOTONIC` network timestamps - -Currently network timestamps are exclusively in `CLOCK_REALTIME`, even -though for many (most?) a monotonic clock would be much preferable, as -calculations become easier when one doesn't have to think about clock -jumps and similar. - -**Use-Case:** `systemd-journald` collects implicit timestamps via -`AF_UNIX` time-stamping, in `CLOCK_REALTIME`, even though for its -internal logic only monotonic timestamps are used, as log records -are searched via bisection in ordered tables, that require -strictly increasing timestamps. In particular during boot (where -`CLOCK_REALTIME` is often not available, stable or subject to -corrections) it would be good to have reliable, monotonic -timestamps on all log records. - -### Immutable loopback block devices - -Truly immutable loopback block devices. Right now setting up a -loopback block device in read-only mode, backed by a read-only -file (stored on a regular read/write file system), and then -mounting it with `ext4` also in `MS_RDONLY`mode *will* result in -changes to the file, quite unexpectedly 🤯. Ideally, if a loopback -block device is set up in read-only mode this should guarantee -that the backing file remains unmodified by it. - -**Use-Case:** disk image build tools that want to reproducibly and -verifiable build images must be able to rely that mounting them in -read-only mode does not alter the images in any way. In particular -when working in computer forensics one must be able to rely that -file systems that are analyzed remain unmodified by the analysis. - -### A timeout for the `flock()` syscall - -A timeout for the `flock()` syscall. Faking the time-out in userspace -is nasty: most code does it with `alarm()` (or equivalent APIs), but -that's racy since on a heavily loaded system the timeout might trigger -before the `flock()` call is entered, in particular if short time-outs -shall be used. More accurate is to do the locking in a short-lived -child processed, but that's difficult already in C, and almost -impossible in languages that do not allow `fork()` without `execve()`. - -**Use-Case:** as mentioned above systemd-udev allows synchronizing -block device probing via flock(). Often userspace wants to wait -for that, but without risking to hang forever. - -### Extend `mount_setattr()` to allow changing mount properties ignoring any failures - -**Use-Case:** workloads that know that there are mounts in a mount tree -whose attributes cannot be changed by the caller don't want -`mount_settattr()` to fail on the first mount it failed to convert. Give -them a flag to request changes ignoring failures. - -### Upgrade masks in `openat2()` - -Add upgrade masks to `openat2()`. Extend `struct open_how` to allow -restrict re-opening of file descriptors. - -**Use-Case:** block services or containers from re-opening/upgrading an -`O_PATH` file descriptor through e.g. `/proc//fd/user_ns, -CAP_SYS_ADMIN)` caller to lock all mount properties. The mount properties -cannot be changed anymore. - -**Use-Case:** allowing processes to lock mount properties even for -privileged processes. Locking mount properties would currently involve -having to have the mount namespace of the container be owned by an ancestor -user namespace. But this doesn't just lock a single mount or mount subtree -it locks all mounts in the mount namespace, i.e., the mount table cannot be -altered. - -### Extend `setns()` to allow attaching to all new namespaces of a process - -Add an extension to `setns()` to allow attaching to all namespaces of -a process `SETNS_PIDFD_ALL` different from the caller's namespaces. -Currently specifying e.g., `CLONE_NEWUSER` fails if the caller is in the -same user namespace as the target process. This is very inconvenient. - -**Use-Case:** Make it trivial to attach to all namespaces of a process -without having to figure out whether the caller is already in the same -namespace or not. - -### Security hook for `mount_setattr()` - -(kAPI) Add security hook to `mount_setattr()`. - -**Use-Case:** Allow LSMs to make decisions about what mount properties to -allow and what to deny. - -### Per-cgroup limit for coredump sizes - -A per-cgroup knob for coredump sizes. Currently coredump size -control is strictly per process, and primarily under control of -the processes themselves. It would be good if we had a per-cgroup -knob instead, that is under control of the service manager. - -**Use-Case:** coredumps can be heavy to generate. For different -usecases it would be good to be able to opt-in or opt-out -dynamically from coredumps for specific services, at runtime -without restarting them. - -### Race-free creation and opening of non-file inodes - -A way to race-freely create an (non-file) inode and immediately -open it. For regular files we have open(O_CREAT) for creating a -new file inode, and returning a pinning fd to it. This is missing -for other inode types, such as directories, device nodes, -FIFOs. The lack of such functionality means that when populating a -directory tree there's always a race involved: the inodes first -need to be created, and then opened to adjust their -permissions/ownership/labels/timestamps/acls/xattrs/…, but in the -time window between the creation and the opening they might be -replaced by something else. Addressing this race without proper -APIs is possible (by immediately fstat()ing what was opened, to -verify that it has the right inode type), but difficult to get -right. Hence, mkdirat_fd() that creates a directory *and* returns -an O_DIRECTORY fd to it would be great. As would be mknodeat_fd() -that creates a device node, FIFO or (dead) socket and returns an -O_PATH fd to it. And of course symlinkat_fd() that creates a -symlink and returns an O_PATH fd to it. - -**Use-Case:** any program that creates/unpacks not just files, but -directories, device nodes, fifos, and wants to ensure that they -safely get the right attributes applied, even if other code might -simultaneously have access to the same directory tree. - -### Extend `io_uring` with classic synchronous system calls - -The `io_uring` subsystem is open to adding classic existing synchronous -system calls (e.g. `setns()` or `mount()` or other) to `io_uring`. -They also said they would support adding new functionality into -`io_uring` that is not exposed through system calls yet. - -### Map different uids/gids to the same uids/gids? - -Explore the idea of mapping different uids/gids to the same uids/gids, i.e. -65534:1000:1 50000:1000:1. This will only work if the mount is read-only as -the kernel wouldn't know what uid/gid would need to be put to disk -otherwise (65534? 50000? the first one that is mapped?). - -**Use-Case:** Delegate multiple {g,u}ids to the same user. Merging -ownership similar to how overlayfs merges files. Bindfs -(https://bindfs.org/docs/bindfs.1.html#sect3) allows this concept too. - -### blobfs - -[`blobfs`](https://fuchsia.dev/fuchsia-src/concepts/filesystems/blobfs) -for Linux. i.e. a minimalistic file system, that can store -authenticated (Verity) data files, that can be written once, and -not be modified after that, and provide stable handles (i.e. is -content-addressable) to them. - -**Use-Case:** This would deliver just about enough to place -trusted OS resources (binaries, kernels, initrds, fs trees, other -resources) in them, without having to trust the medium and IO -underneath. Should be simple enough to even implement in a boot -loader and similar, without making things vulnerable to rogue file -system image attacks. The OS and its payloads (apps, containers, -…) could then be composed from these resources, through means like -overlayfs, namespacing and more. - -### Namespaced loop and block devices - -Namespace-able loop and block devices, usable inside user namespaces. - -**Use-Case:** Allow mounting images inside nspawn containers, and using -RootImage= and friends in the systemd user manager. - -### Device cgroup guard to allow `mknod()` in non-initial userns - -If a container manager restricts its unprivileged (user namespaced) -children by a device cgroup, it is not necessary to deny `mknod()` -anymore. Thus, user space applications may map devices on different -locations in the file system by using `mknod()` inside the container. - -**Use-Case:** A use case for this, which is applied by users of GyroidOS, -is to run `virsh` for VMs inside an unprivileged container. `virsh` or -libvirt creates device nodes, e.g., `/var/run/libvirt/qemu/11-fgfg.dev/null` -which currently fails in a non-initial userns, even if a cgroup device white -list with the corresponding major, minor of `/dev/null` exists. Thus, in -this case the usual bind mounts or pre populated device nodes under `/dev` -are not sufficient. - -An initial group internal RFC exists in -(https://github.com/quitschbo/linux/tree/devcg_guard_rfc). -See commit message for more implementation specific details. - -### Race-free mounting of block devices - -Introduce a new struct to `fsconfig()` as an alternative to the -`source` property. The struct contains at least a pointer to a path, -possibly a device minor and major, and a diskseq number. The VFS can -expose a helper that filesystems can call and use the diskseq number -to verify that the block device they are intending to mount is indeed -the one they want to mount. - -**Use-Case:** Race-free mounting of block devices. - -### Ability to reopen a `struct block_device` - -Add ability to reopen a `struct block_device`. This would allow using -`blkdev_get_by_path()`/`blkdev_get_{part,whole}()` to claim a device -with `BLK_OPEN_READ` and later on reopen with -`BLK_OPEN_READ | BLK_OPEN_WRITE`. This in turn would opening block -devices at `fsconfig(FS_CONFIG_SET_*)` time and then at `fill_super()` -time we would be able to reopen in case the `!(fc->sb_flags & SB_RDONLY)`. -Overall this has the effect that we're able to open devices early -giving the user early errors when they set mount options rather than -very late when the superblock is created. - -### Specification of a keyring for dm-verity volume verification - -When activating a dm-verity volume allow specifying keyring to -validate root hash signature against. - -**Usecase:** In systemd, we'd like to authenticate Portable Service -images, system extension images, configuration images, container -images with different keys, as they typically originate from -different sources and it should not be possible to generate a -system extension with a key pair that is supposed to be good for -container images only. - -### Path-based ACL management in an LSM hook - -The LSM module API should have the ability to do path-based (not -just inode-based) ACL management. - -**Usecase:** This would be useful in BPF-LSM modules such as -systemd's `mntfsd` which allows unprivileged file system mounts in -some cases, and which would like to restrict ACL handling based on -the superblock involved. - -### Immutable layers for `overlayfs` - -`overlayfs` should permit *immutable* layers, i.e. layers whose -non-directory inodes may not be overriden in an upper writable -layer. - -**Usecase:** This would be useful when implementing `/etc/` as a -stack of overlayfs layers, each shipping configuration for a -different facet of the system, with a writable layer on the top for -local modifications. In such a scenario it would be useful to allow -the user to change any configuration it likes, except of the files -and other inodes shipped in the lower layers. - -### `ioctl()` API for `overlayfs` - -`overlayfs` should have an `ioctl()`-based API (or similar) for -querying information of the backing file systems/block devices - -**Usecase:** In systemd in various areas we automatically find the -block device backing the root file system and other file systems -(Example: `systemd-gpt-auto-generator` or `bootctl` wull try to find -auxiliary file systems of the OS image by looking in the GPT -partition table the root file system is located in). While this -logic is good enough to find the backing block devices of some more -complex storage such as dm-crypt, dm-verity or btrfs, once -`overlayfs` is used as backing for the root file system this logic -does not work anymore. It would be great if there was an API to -simply query `overlayfs` for the superblock information -(i.e. `.st_dev`) of the backing layers. - -### Automatic growing of `btrfs` filesystems - -An *auto-grow* feature in `btrfs` would be excellent. - -If such a mode is enabled, `btrfs` would automatically grow a file -system up to the size of its backing block devices. Example: btrfs -is created with 200M in size on a block device 2G in size. Once the -file system is filled up fully, `btrfs` would automatically grow the -file system as need in the increments it needs, up to the 2G that -the backing block device is in size. - -**Usecase:** This would allow creating minimal, compact file -systems: just create them small on a sparse block device, and copy -files into it, as needed, create subvolumes and whatever else is -desired. As long as only files are created and written (but not -modified) the resulting fs should be automatically minimal in size. -This would specifically be useful in `systemd-homed`, which -maintains per-user `btrfs` file systems backed by block -devices. Currently, `homed` grows the file systems manually on login -and then shrinks them again on logout, but this is less than ideal, -since btrfs places files all over the backing store, and thus the -shrinking will generate a lot of nonsensical IO that could be -reduced if the file system was always kept minimal in size anyway. - -### Add process by PIDFD to a cgroup - -At the moment the canonical way to add a process to a cgroup is by -echoing its PID into the `cgroup.procs` attribute in the target -cgroupfs directory of the cgroup. This is safe as long as the -process doing so just forked off the process it wants to migrate and -hence can control that it hasn't been reaped yet, and hence -guarantees the PID is valid. This is racy however if "foreign" -processes shall be moved into the cgroup. - -**Usecase:** In systemd, all user sessions are wrapped in scope -units which are backed by a cgroup. The session processes moved into -the scope unit are typically "foreign" processes, i.e. not children -of the service manager, hence doing the movement is subject to races -in case the process dies and its PID is quickly recycled. (This -assumes systemd can acquire a pidfd of the foreign process without -races, for example via `SCM_PIDFD` and `SO_PEERPIDFD` or similar.) - -### Open thread-group leader via `pidfd_open()` - -Extend `pidfd_open()` to allow opening the thread-group leader based on the -PID of an individual thread. Currently we do support: - -1. `pidfd_open(1234, 0)` on a thread-group leader PID -2. `pidfd_open(1234, PIDFD_THREAD)` on a thread - -Add an option to go from individual thread to thread-group leader. - -**Use-Case:** Allow for a race free way to go from individual thread -to thread-group leader pidfd. - -### Useful handling of LSM denials on SCM_RIGHTS - -Right now if some LSM such as SELinux denies an `AF_UNIX` socket peer -to receive an `SCM_RIGHTS` fd the `SCM_RIGHTS` fd array will be cut -short at that point, and `MSG_CTRUNC` is set on return of -`recvmsg()`. This is highly problematic behaviour, because it leaves -the receiver wondering what happened. As per man page `MSG_CTRUNC` is -supposed to indicate that the control buffer was sized too short, but -suddenly a permission error might result in the exact same flag being -set. Moreover, the receiver has no chance to determine how many fds -got originally sent and how many were suppressed. - -Ideas how to improve things: - -1. Maybe introduce a new flag `MSG_RIGHTS_DENIAL` or so which is set - on `recvmsg()` return, which tells us that fds where dropped from - the `SCM_RIGHTS` array because of an LSM error. This new flag could - be set in addition to `CMSG_CTRUNC`, for compatibility. - -2. Maybe, define a new flag `MSG_RIGHTS_FILTER` or so which when - passed to `recvmsg()` will ensure that the `SCM_RIGHTS` fd array is - always passed through in its full, original size. Entries for which - an LSM says no are suppressed, and replaced by a special value, for - example `-EPERM`. - -3. It would be good if the relevant man page would at least document - this pitfall, even if it right now cannot reasonably be handled. - -Ideally both ideas would be implemented, but of course, strictly -speaking the 2nd idea makes the 1st idea half-way redundant. - -**Use-Case:** Any code that uses `SCM_RIGHTS` generically (D-Bus and -so on) needs this, so that it can reasonably handle SELinux AVC errors -on received messages. - -### Reasonable EOF on SOCK_SEQPACKET - -Zero size datagrams cannot be distinguished from EOF on -`SOCK_SEQPACKET`. Both will cause `recvmsg()` to return zero. - -Idea how to improve things: maybe define a new MSG_XYZ flag for this, -which causes either of the two cases result in some recognizable error -code returned rather than a 0. - -**Use-Case:** Any code that wants to use `SOCK_SEQPACKET` and cannot -effort disallowing zero sized datagrams from their protocol. - -### Reasonable Handling of SELinux dropping SCM_RIGHTS fds - -Currently, if SELinux refuses to let some file descriptor through, it -will just drop them from the `SCM_RIGHTS` array. That's a terrible -idea, since applications rely on the precise arrangement of the array -to know which fd is which. By dropping entries silently, these apps -will all break. - -Idea how to improve things: leave the elements in the array in place, -but return a marker instead (i.e. negative integer, maybe `-EPERM`) that -tells userspace that there was an fd, but it was not allowed through. - -**Use-Case:** Any code that wants to use `SCM_RIGHTS` properly. - ---- - -## Finished Items - -### Namespace ioctl to translate a PID between PID namespaces - -[x] Namespace ioctl to translate a PID between PID namespaces - -**🙇 `ca567df74a28a9fb368c6b2d93e864113f73f5c2 ("nsfs: add pid translation ioctls")` 🙇** - -**Use-Case:** This makes it possible to e.g., figure out what a given PID in -a PID namespace corresponds to in the caller's PID namespace. For example, to -figure out what the PID of PID 1 inside of a given PID namespace is. - -### API to determine the parent process ID of a pidfd - -[x] API to determine the parent process ID of a pidfd - -An API to determine the parent process ID (ppid) of a pidfd would be -good. - -This information is relevant to code dealing with pidfds, since if -the ppid of a pidfd matches the process own pid it can call -`waitid()` on the process, if it doesn't it cannot and such a call -would fail. It would be very useful if this could be determined -easily before even calling that syscall. - -**🙇 `cdda1f26e74b ("pidfd: add ioctl to retrieve pid info")` 🙇** - -**Usecase:** systemd manages a multitude of processes, most of which -are its own children, but many which are not. It would be great if -we could easily determine whether it is worth waiting for -`SIGCHLD`/`waitid()` on them or whether waiting for `POLLIN` on -them is the only way to get exit notification. - -### Set `comm` field before `exec()` - -[x] Set `comm` field before `exec()` - -There should be a way to control the process' `comm` field if -started via `fexecve()`/`execveat()`. - -Right now, when `fexecve()`/`execveat()` is used, the `comm` field -(i.e. `/proc/self/comm`) contains a name derived of the numeric fd, -which breaks `ps -C …` and various other tools. In particular when -the fd was opened with `O_CLOEXEC`, the number of the fd in the old -process is completely meaningless. - -The goal is add a way to tell `fexecve()`/`execveat()` what Name to use. - -Since `comm` is under user control anyway (via `PR_SET_NAME`), it -should be safe to also make it somehow configurable at fexecve() -time. - -See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294, -https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81. - -**🙇 `543841d18060 ("exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case")` 🙇** - -**Usecase:** In systemd we generally would prefer using `fexecve()` -to safely and race-freely invoke processes, but the fact that `comm` -is useless after invoking a process that way makes the call -unfortunately hard to use for systemd. - -### Make statx() on a pidfd return additional info - -Make statx() on a pidfd return additional recognizable identifiers in -`.stx_btime`. - -**🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇** - -It would be fantastic if issuing statx() on any pidfd would return -the start time of the process in `.stx_btime` even after the process -died. - -These fields should in particular be queriable *after* the process -already exited and has been reaped, i.e. after its PID has already -been recycled. - -**Usecase:** In systemd we maintain lists of processes in a hash -table. Right now, the key is the PID, but this is less than ideal -because of PID recycling. By being able to use the `.stx_btime` -and/or `.stx_ino` fields instead would be perfect to safely -identify, track and compare process even after they ceased to exist. - -### Allow creating idmapped mounts from idmapped mounts - -[x] Allow creating idmapped mounts from idmapped mounts - -Add a new `OPEN_TREE_CLEAR` flag to `open_tree()` that can only be -used in conjunction with `OPEN_TREE_CLONE`. When specified it will clear -all mount properties from that mount including the mount's idmapping. -Requires the caller to be `ns_capable(mntns->user_ns)`. If idmapped mounts -are encountered the caller must be `ns_capable(sb->user_ns, CAP_SYS_ADMIN)` -in the filesystems user namespace. - -Locked mount properties cannot be changed. A mount's idmapping becomes -locked if it propagates across user namespaces. - -This is useful to get a new, clear mount and also allows the caller to -create a new detached mount with an idmapping attached to the mount. Iow, -the caller may idmap the mount afterwards. - -**🙇 `c4a16820d901 ("fs: add open_tree_attr()")` 🙇** - -**Use-Case:** A user may already use an idmapped mount for their home -directory. And once a mount has been idmapped the idmapping cannot be -changed anymore. This allows for simple semantics and allows to avoid -lifetime complexity in order to account for scenarios where concurrent -readers or writers might still use a given user namespace while it is about -to be changed. -But this poses a problem when the user wants to attach an idmapping to -a mount that is already idmapped. The new flag allows to solve this -problem. A sufficiently privileged user such as a container manager can -create a user namespace for the container which expresses the desired -ownership. Then they can create a new detached mount without any prior -mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this -mount. - -### Require a user namespace to have an idmapping when attached - -[x] Require a user namespace to have an idmapping when attached - -Enforce that the user namespace about to be attached to a mount must -have an idmapping written. - -**🙇 `dacfd001eaf2 ("fs/mnt_idmapping.c: Return -EINVAL when no map is written")` 🙇** - -**Use-Case:** Tighten the semantics. - -### Mount notifications without rescanning of `/proc/self/mountinfo` - -[x] Mount notifications without rescanning of `/proc/self/mountinfo` - -Mount notifications that do not require continuous rescanning of -`/proc/self/mountinfo`. Currently, if a program wants to track -mounts established on the system it can receive `poll()`able -events via a file descriptor to `/proc/self/mountinfo`. When -receiving them it needs to rescan the file from the top and -compare it with the previous scan. This is both slow and -racy. It's slow on systems with a large number of mounts as the -cost for re-scanning the table has to be paid for every change to -the mount table. It's racy because quickly added and removed -mounts might not be noticed. - -**🙇 `0f46d81f2bce ("fanotify: notify on mount attach and detach")` 🙇** - -**Use-Case:** `systemd` tracks the mount table to integrate the mounts -into it own dependency management. - -### Mount a subdirectory instead of the top-level directory - -[x] Mount a subdirectory instead of the top-level directory - -Ability to mount a subdirectory of a regular file system instead of -the top-level directory. E.e. for a file system `/dev/sda1` which -contains a sub-directory `/foobar` mount `/foobar` without having -to mount its parent directory first. Consider something like this: - -``` -mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar -``` - -(This is of course already possible via some mount namespacing -shenanigans, but this requires namespacing to be available, and is -not precisely obvious to implement. Explicit kernel support at mount -time would be much preferable.) - -**🙇 `c5c12f871a30 ("fs: create detached mounts from detached mounts")` 🙇** - -**Use-Case:** `systemd-homed` currently mounts a sub-directory of -the per-user LUKS volume as the user's home directory (and not the -root directory of the per-user LUKS volume's file system!), and in -order to implement this invisibly from the host side requires a -complex mount namespace exercise. - -### Unmounting of obstructed mounts - -[x] ability to unmount obstructed mounts. (this means: you have a stack -of mounts on the very same inode, and you want to remove a mount in -the middle. right now, you can only remove the topmost mount.) - -**🙇 instead of the ability to unmount obstructured mounts we gained -the ability to mount beneath an existing mount, with mostly -equivalent outcome. `6ac392815628f317fcfdca1a39df00b9cc4ebc8b -("fs: allow to mount beneath top mount")` 🙇** - -**use-case:** this is useful for replacing mounts atomically, for -example for upgrading versioned disk images: first an old version -of the image is mounted. then a new version is mounted over the -existing mount point, and then the lower mount point is -removed. One such software would be `systemd-sysext`. - -### `SCM_PIDFD` auxiliary socket message - -[x] `SCM_PIDFD` or similar auxiliary socket message, that is a modern -version of the `SCM_CREDS` message's `.pid` field, and provides a -`pidfd` file descriptor to the originating peer process. - -**🙇 `5e2ff6704a275be00 ("scm: add SO_PASSPIDFD and SCM_PIDFD)")` 🙇** - -**Use-Case:** security infrastructure (such as PolicyKit) can safely -reference clients this way without fearing PID -recycling. `systemd-journald` can acquire peer metadata this way in -a less racy fashion, in particular safe against PID recycling. - -### Take `IP_UNICAST_IF` into account for routing decisions - -[x] `IP_UNICAST_IF` should be taken into account for routing decisions -at UDP `connect()` time (currently it isn't, only `SO_BINDTOINDEX` -is, but that does so much more than just that, and one often -doesn't want that) - -**🙇 `0e4d354762cefd3e16b4cff8988ff276e45effc4 ("net-next: Fix -IP_UNICAST_IF option behavior for connected sockets")` 🙇** - -**Use-Case:** DNS resolvers that associate DNS configuration with -specific network interfaces (example: `systemd-resolved`) typically -want to preferably route DNS traffic to the per-interface DNS -server via that interface, but not make further restrictions on the -origins or received replies, and all that without -privileges. `IP_UNICAST_IF` fulfills this role fine for TCP, but -for UDP it is not taken into account for the `connect()` routing -decision. - -### A system call to query information about a mount - -[x] Implement a mount-specific companion to `statx()` that puts at least the -following information into `struct mount_info`: - -**🙇 `46eae99ef73302f9fb3dddcd67c374b3dffe8fd6 ("add statmount(2) syscall")` 🙇** - -* mount flags: `MOUNT_ATTR_RDONLY`, ... -* time flags: `MOUNT_ATTR_RELATIME`, ... - Could probably be combined with mount flags. -* propagation setting: `MS_SHARED)`, ... -* peer group -* mnt id of the mount -* mnt id of the mount's parent -* owning userns - -There's a bit more advanced stuff systemd would really want but which -I think is misplaced in a mountinfo system call including: -* list of primary and auxiliary block device major/minor -* diskseq value of those device nodes (This is a new block device feature - we added that allows preventing device recycling issues when e.g. - removing usb devices very quickly and is needed for udev.) -* uuid/fsid -* feature flags (`O_TMPFILE`, `RENAME_EXCHANGE` supported etc.) - -**Use-Case:** low-level userspace tools have to interact with advanced -mount information constantly. This is currently costly and brittel because -they have to go and parse `/proc//mountinfo`. - -### Security hook for `create_user_ns()`. - -[x] (kAPI) Add security hook to `create_user_ns()`. - -**🙇 `7cd4c5c2101c ("security, lsm: Introduce security_create_user_ns()")` 🙇** - -**Use-Case:** Allow LSMs to monitor user namespace creation. - -### Idmapped mounts for tmpfs - -[x] Support idmapped mounts for tmpfs - -**🙇 `7a80e5b8c6fa ("shmem: support idmapped mounts for tmpfs")` 🙇** - -**Use-Case:** Runtimes such as Kubernetes use a lot of `tmpfs` mounts of -individual files or directories to expose information to containers/pods. -Instead of having to change ownership permanently allow them to use an -idmapped mount instead. - -@rata and @giuseppe brought this suggestion forward. For Kubernetes it is -sufficient to support idmapped mounts of `tmpfs` instances mounted in the -initial user namespace. However, in the future idmapped -mounts of `tmpfs` instances mounted in user namespaces should be supported. -Other container runtimes want to make use of this. The kernel is able to -support this since at least `5.17`. - -Things to remember are that `tmpfs` mounts can serve as lower- or upper -layers in `overlayfs` and care needs to be taken that this remains safe if -idmapped mounts of `tmpfs` instances mounted in user namespaces are -supported. - -### Additional identifiers for pidfds in `statx()` - -[x] Make `statx()` on a pidfd return additional recognizable identifiers -in `.stx_ino`. - -**🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇** - -It would be fantastic if issuing statx() on any pidfd would return some -reasonably stable 64bit identifier for the process in `.stx_ino`. This would -be perfect to identify processes pinned by a pidfd, and compare them. - -### Namespace `binfmt_misc` filesystem - -[x] Make the `binfmt_misc` filesystem namespaced. - -**🙇 `21ca59b365c0 ("binfmt_misc: enable sandboxed mounts")` 🙇** - -**Use-Case:** Allow containers and sandboxes to register their own binfmt -handlers. - -### Support idmapped mounts for `overlayfs` - -[x] Support idmapped mounts for `overlayfs` - -**🙇 `bc70682a497c ("ovl: support idmapped layers")` 🙇** - -**Use-Case:** Allow containers to use `overlayfs` with idmapped mounts. - -### Disabling reception of `SCM_RIGHTS` for `AF_UNIX` sockets - -[x] Ability to turn off `SCM_RIGHTS` reception for `AF_UNIX` -sockets. - -**🙇 `77cbe1a6d8730a07f99f9263c2d5f2304cf5e830 ("af_unix: Introduce SO_PASSRIGHTS")` 🙇** - -Right now reception of file descriptors is always on when -a process makes the mistake of invoking `recvmsg()` on such a -socket. This is problematic since `SCM_RIGHTS` installs file -descriptors in the recipient process' file descriptor -table. Getting rid of these file descriptors is not necessarily -easy, as they could refer to "slow-to-close" files (think: dirty -file descriptor referring to a file on an unresponsive NFS server, -or some device file descriptor), that might cause the recipient to -block for a longer time when it tries to them. Programs reading -from an `AF_UNIX` socket currently have three options: - -1. Never use `recvmsg()`, and stick to `read()`, `recv()` and - similar which do not install file descriptors in the recipients - file descriptor table. - -2. Ignore the problem, and simply `close()` the received file descriptors - it didn't expect, thus possibly locking up for a longer time. - -3. Fork off a thread that invokes `close()`, which mitigates the - risk of blocking, but still means a sender can cause resource - exhaustion in a recipient by flooding it with file descriptors, - as for each of them a thread needs to be spawned and a file - descriptor is taken while it is in the process of being closed. - -(Another option of course is to never talk `AF_UNIX` to peers that -are not trusted to not send unexpected file descriptors.) - -A simple knob that allows turning off `SCM_RIGHTS` right reception -would be useful to close this weakness, and would allow -`recvmsg()` to be called without risking file descriptors to be -installed in the file descriptor table, and thus risking a -blocking `close()` or a form of potential resource exhaustion. - -**Use-Case:** any program that uses `AF_UNIX` sockets and uses (or -would like to use) `recvmsg()` on it (which is useful to acquire -other metadata). Example: logging daemons that want to collect -timestamp or `SCM_CREDS` auxiliary data, or the D-Bus message -broker and suchlike. - -### A reliable way to check for PID namespacing - -[x] A reliable (non-heuristic) way to detect from userspace if the -current process is running in a PID namespace that is not the main -PID namespace. PID namespaces are probably the primary type of -namespace that identify a container environment. While many -heuristics exist to determine generically whether one is executed -inside a container, it would be good to have a correct, -well-defined way to determine this. +Requires [Hugo](https://gohugo.io/) (extended edition). -**🙇 The inode number of the root PID namespace is fixed (0xEFFFFFFC) -and now considered API. It can be used to distinguish the root PID -namespace from all others. 🙇** +## License -**Use-Case:** tools such as `systemd-detect-virt` exist to determine -container execution, but typically resolve to checking for -specific implementations. It would be much nicer and universally -applicable if such a check could be done generically. It would -probably suffice to provide an `ioctl()` call on the `pidns` file -descriptor that reveals this kind of information in some form. +[MIT](LICENSE) diff --git a/website/README.md b/website/README.md index a2c57c0..33eb354 100644 --- a/website/README.md +++ b/website/README.md @@ -1,29 +1,41 @@ -# Static website generation for UAPI group specifications +# Website -This repository uses Hugo for static HTML generation. -See https://gohugo.io/getting-started/quick-start/ for a brief intro. +This directory contains the Hugo-based static site for the kernel feature wishlist. -The website uses the [hugo-book](https://github.com/alex-shpak/hugo-book) theme; it is included in this repo as a git submodule. -After cloning this repo please run `git submodule init; git submodule update`. -If you check out a branch or tag, make sure the submodule is up to date by running `git submodule update`. +## Theme -## Website repo layout +The site uses the [hugo-book](https://github.com/alex-shpak/hugo-book) theme, +included as a git submodule. After cloning, run: -Content resides in the [content](content/) folder. -The top-level README.md is soft-linked to `content/_index.md` and serves as index page. +```sh +git submodule init && git submodule update +``` + +## Content layout -To add content, either put files into the `content/docs` folder. -Documents there are automatically added to the naviation menu on the left. +``` +content/ + _index.md Landing page + in-progress/ Features being actively worked on + wishlist/ Ideas and proposals + completed/ Features merged into the kernel +``` -## Making changes and testing +Each feature is a separate Markdown file with YAML front matter containing +`title`, `status`, `categories`, and optionally `commit` (for completed items). -You'll need [hugo installed](https://gohugo.io/getting-started/installing/) for rendering changes. +## Local development -First, make your edits. -Then, start hugo locally (in the repo's `website` directory)to review your changes: +From this directory: -```shell -$ hugo server --minify --disableFastRender +```sh +hugo server --minify --disableFastRender ``` -Review your changes at http://localhost:1313/kernel/ . +Review at http://localhost:1313/kernel-features/ . + +## Build + +```sh +hugo --minify -d ../public +``` From ed90aa58095cbc7bb71c7dad8b3bba99b1fb7b0b Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Mon, 9 Mar 2026 14:50:34 +0100 Subject: [PATCH 5/6] Add contribution guidelines and GitHub templates Add CONTRIBUTING.md with instructions for proposing features, claiming wishlist items, and marking items as completed. Include the list of recognized categories and local development setup. Add a pull request template with type checkboxes and a front matter checklist. Add a structured GitHub issue form for proposing new kernel feature ideas. Signed-off-by: Christian Brauner --- .github/ISSUE_TEMPLATE/feature-idea.yml | 53 +++++++++++++++++ .github/PULL_REQUEST_TEMPLATE.md | 17 ++++++ CONTRIBUTING.md | 77 +++++++++++++++++++++++++ 3 files changed, 147 insertions(+) create mode 100644 .github/ISSUE_TEMPLATE/feature-idea.yml create mode 100644 .github/PULL_REQUEST_TEMPLATE.md create mode 100644 CONTRIBUTING.md diff --git a/.github/ISSUE_TEMPLATE/feature-idea.yml b/.github/ISSUE_TEMPLATE/feature-idea.yml new file mode 100644 index 0000000..ca72c2a --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature-idea.yml @@ -0,0 +1,53 @@ +name: Feature Idea +description: Propose a new kernel feature idea +labels: ["wishlist"] +body: + - type: input + id: title + attributes: + label: Feature title + description: A short, descriptive title for the feature. + placeholder: "e.g., Support detached mounts with pivot_root()" + validations: + required: true + + - type: textarea + id: description + attributes: + label: Description + description: Describe the feature and how it would work. + validations: + required: true + + - type: textarea + id: usecase + attributes: + label: Use-Case + description: Why is this feature valuable? What problem does it solve? + validations: + required: true + + - type: dropdown + id: category + attributes: + label: Category + description: Which subsystem area does this relate to? + multiple: true + options: + - mounts + - pidfd + - namespaces + - filesystems + - sockets + - cgroups + - block-devices + - security + - io-uring + - processes + - other + + - type: textarea + id: links + attributes: + label: Related links + description: Links to mailing list threads, patches, documentation, etc. diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md new file mode 100644 index 0000000..51eb9c0 --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -0,0 +1,17 @@ +## Feature + + + +## Type + + +- [ ] New wishlist item +- [ ] Moving item to in-progress +- [ ] Marking item as completed +- [ ] Other (describe below) + +## Checklist + +- [ ] Front matter is complete (`title`, `status`, `categories`) +- [ ] Use-case section is included +- [ ] Cc: Christian Brauner noted if implementing a feature from this list diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..c171fe3 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,77 @@ +# Contributing + +Thank you for your interest in the UAPI Group Kernel Feature Wishlist. This +document explains how to propose new features, claim existing ones, or mark +items as completed. + +## Proposing a new feature + +1. Create a new Markdown file in the appropriate section directory + under `website/content/`: + - `wishlist/` for new ideas + - `in-progress/` for features you are actively working on + +2. Use the following front matter template: + + ```yaml + --- + title: "Short descriptive title" + weight: 10 + status: wishlist + categories: + - mounts + - namespaces + --- + ``` + +3. Include a clear description of the feature and a **Use-Case** section + explaining why this would be valuable. + +4. Open a pull request. + +## Claiming an item + +To indicate you are working on a wishlist item: + +1. Move the file from `website/content/wishlist/` to + `website/content/in-progress/`. +2. Update the `status` field in the front matter to `in-progress`. +3. Open a pull request with your GitHub handle or email address noted. + +## Marking an item as completed + +When a feature has been merged into the kernel: + +1. Move the file from its current location to `website/content/completed/`. +2. Update the `status` field to `completed`. +3. Add a `commit` field to the front matter with the commit SHA. +4. Open a pull request. + +## Attribution + +**When implementing ideas on this list or ideas inspired by this list, +please point that out explicitly and clearly in the associated patches +and Cc `Christian Brauner `.** + +## Categories + +Use one or more of the following categories in front matter: + +- `mounts` — mount namespaces, pivot_root, move_mount, statmount +- `pidfd` — pidfd xattrs, CLONE_PIDFD, SCM_PIDFD +- `namespaces` — user namespaces, PID namespaces, mount namespaces +- `filesystems` — nullfs, blobfs, overlayfs, tmpfs, binfmt_misc +- `sockets` — AF_UNIX, SCM_RIGHTS, SCM_PIDFD +- `cgroups` — device cgroups, coredump limits +- `block-devices` — loop devices, dm-verity, diskseq +- `security` — LSM hooks, user namespace restrictions +- `io-uring` — io_uring extensions +- `processes` — pidfd, clone3, waitid, prctl + +## Local development + +```sh +git clone --recurse-submodules https://github.com/uapi-group/kernel-features.git +cd kernel-features/website +hugo server +``` From db8c153ff1ce3a8498c6ba14e4c4ac4abc0d898a Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Mon, 9 Mar 2026 14:50:43 +0100 Subject: [PATCH 6/6] ci: upgrade checkout action and add front matter validation Upgrade actions/checkout from v3 to v4. Add a validation step that checks all feature Markdown files have proper YAML front matter, catching malformed contributions before deploy. Signed-off-by: Christian Brauner --- .github/workflows/gh-pages.yml | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/.github/workflows/gh-pages.yml b/.github/workflows/gh-pages.yml index 7c33ca7..d094f70 100644 --- a/.github/workflows/gh-pages.yml +++ b/.github/workflows/gh-pages.yml @@ -13,7 +13,7 @@ jobs: concurrency: group: ${{ github.workflow }}-${{ github.ref }} steps: - - uses: actions/checkout@v3 + - uses: actions/checkout@v4 with: submodules: true fetch-depth: 0 @@ -27,6 +27,20 @@ jobs: - name: Build run: cd website && hugo --minify -d ../public + - name: Validate front matter + run: | + errors=0 + for f in $(find website/content -name '*.md' ! -name '_index.md'); do + if ! head -1 "$f" | grep -q '^---$'; then + echo "ERROR: $f missing front matter" + errors=$((errors + 1)) + fi + done + if [ "$errors" -gt 0 ]; then + echo "$errors file(s) with missing front matter" + exit 1 + fi + - name: Deploy uses: peaceiris/actions-gh-pages@v3 if: ${{ github.ref == 'refs/heads/main' }}