Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-26921_lts_cos/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
## Overview

Let's look at what we need to perform the attack.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a bit of summary / "battle plan" type of paragraph. The reader would need the overview of what we're trying to do.

How about smth like this:

Suggested change
Let's look at what we need to perform the attack.
To exploit this Use-After-Free, we must trigger the bug and overwrite the freed memory entirely within the synchronous execution path of ip_local_out(). The high-level strategy is:
1. Send a locally generated, fragmented IP packet so ip_send_skb() pushes its sk_buff (skb) into the netfilter hooks via ip_local_out().
2. Have ip_defrag() process the skb, dropping the final reference to its associated socket (skb->sk) and freeing the socket's memory.
3. Use a subsequent netfilter hook to immediately allocate over the freed skb->sk memory with our controlled payload.
4. Let a later netfilter hook dereference the forged skb->sk object to gain execution control.
Let's look at what we need to perform the attack.


### Socket to send the packet through

Different socket families have different handling of the routing and fragmentation issues.
We do not want to use upper layer protocols like TCP or UDP, because they perform their own fragmentation and we need to trigger fragmentation at the IP layer.

Second thing to consider is the kmalloc cache used to allocate struct sock. Most socket families have a dedicated cache, but some use a regular kmalloc(), giving us a simple way to reallocate the freed object without performing a cross-cache attack.

And finally, some sockets use a SOCK_RCU_FREE flag which causes sk_destruct() to wait for an RCU grace period before freeing the sock object and this would also make exploitation much harder.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
And finally, some sockets use a SOCK_RCU_FREE flag which causes sk_destruct() to wait for an RCU grace period before freeing the sock object and this would also make exploitation much harder.
And finally, some sockets use a SOCK_RCU_FREE flag which causes sk_destruct() to wait for an RCU grace period before freeing the sock object. Because the vulnerable "use" happens synchronously later in the exact same netfilter pipeline, an RCU delay means the memory wouldn't actually be returned to the allocator until after the kernel dereferences the dangling pointer. This would make exploitation much harder, as we would need to artificially stall the packet's execution to wait for the grace period.


The socket family that fulfills all those requirements is AF_PACKET (used for sending raw packets at layer 2).

This means we need to set our own layer 2 and layer 3 headers and choose an output device for the packet.
No routing will be done, the packet will go straight to the output queue of a selected device.

### Device driver to call ip_local_out()

Because we send our packets at layer 2, ip_send_skb() won't be called and we need to find another way to trigger ip_local_out().
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Because we send our packets at layer 2, ip_send_skb() won't be called and we need to find another way to trigger ip_local_out().
Because `AF_PACKET` injects frames directly at Layer 2 (skipping the Layer 3 IP output stack entirely), functions like `ip_send_skb()` are bypassed. If sent through a standard interface, our packet would never hit the netfilter hooks. We need an alternative way to force the kernel to pass our crafted `skb` into `ip_local_out()`.
The IPvlan driver provides the perfect mechanism. When processing outbound IPv4 traffic, the driver manually sets up the routing and directly invokes `ip_local_out()`:

Fortunately, it is used by IPvlan driver:
```
static int ipvlan_process_v4_outbound(struct sk_buff *skb)
{
...
skb_dst_set(skb, &rt->dst);

memset(IPCB(skb), 0, sizeof(*IPCB(skb)));

err = ip_local_out(net, skb->sk, skb);
...
```

So our packets will be sent out of the IPvlan interface.
IPvlan needs a master ethernet device and we used the veth interface for that.


### A way to close the socket fd before the ip_defrag() call

When our packet reaches ip_defrag(), the socket won't be freed if it is still referenced by the open file descriptor.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When our packet reaches ip_defrag(), the socket won't be freed if it is still referenced by the open file descriptor.
When our `skb` reaches `ip_defrag()`, the function drops a reference to the socket (`skb->sk`). However, the socket memory is only freed if this drops the reference count (`sk_refcnt`) to zero. If the user-space file descriptor (fd) is still open, it holds an active reference, preventing the allocation from being freed.

We can call close() only after sendmsg() returns. The syscall returns after the packets is enqueued to the output device, so we might be able to try a race condition to close the fd in time, but there is a simpler way.

sch_plug queuing discipline can be used to stop the packets from being dequeued from a network device until a command to "unplug" is received through the netlink API.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sch_plug queuing discipline can be used to stop the packets from being dequeued from a network device until a command to "unplug" is received through the netlink API.
To achieve deterministic execution, we use the `sch_plug` queuing discipline. `sch_plug` can be attached to a network device to pause its egress queue, holding outbound packets until an explicit "unplug" command is received via the Netlink API. This allows us to cleanly suspend the packet's journey right before it enters the vulnerable `ip_local_out()` path.


So the steps of our exploit are:
1. "plug" the ipvlan interface
2. Send a packet
3. Close the socket
4. "unplug" the ipvlan interface

These are actually all the steps needed to exploit the vulnerability, if we exclude the setup needed beforehand.

### Network tools

The exploit needs external iptables and ip (from iproute2 package) binaries to set up rules and network interfaces.
These tools are not available in the current kernelCTF root image, so the tar archive with binaries and supporting libraries is attached to the exploit binary as a custom ELF section and extracted using objcopy during execution.

## Triggering the IPv4 fragmentation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we introduced veth in the "Device driver" section, how is it connected with ipv0 and ipv1?

What is ipv0 and ipv1? How do we setup those interfaces?

The obvious idea is to send the MTU on the outgoing interface (ipv1) to a low value, but then our send() will just return a "Message too long" error.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The obvious idea is to send the MTU on the outgoing interface (ipv1) to a low value, but then our send() will just return a "Message too long" error.
The most straightforward approach to force fragmentation would be to set a small Maximum Transmission Unit (MTU) on our initial outgoing interface. However, if we attempt to send a packet larger than the interface's MTU, the `sendmsg()` syscall checks the routing table synchronously and immediately returns an `EMSGSIZE` ("Message too long") error. The packet is dropped before it ever reaches the Netfilter hooks.

Instead, we must reroute our packet to another interface with a low MTU (ipv0). This is done using a DNAT rule.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should we set high MTU to ipv1?

Does this gives a bit more context?

To bypass this early sanity check, we must successfully inject a large packet into an interface with a sufficiently high MTU, and then force the kernel to dynamically reroute it *after* the `skb` has entered the Netfilter pipeline.


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show ip and iptables commands used in exploit for the setup of necessary interfaces and rules.

## Triggering ip_defrag()

Because we already have DNAT rules, the conntrack defrag hooks are installed and ip_defrag() will be called for each of our fragments, triggering the release of the sock object at the last fragment.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it explains situation correctly?

Suggested change
Because we already have DNAT rules, the conntrack defrag hooks are installed and ip_defrag() will be called for each of our fragments, triggering the release of the sock object at the last fragment.
When we configured the `DNAT` rule in the previous step to force routing, the kernel automatically enabled the `conntrack` module to manage the NAT translations. Because `conntrack` relies on transport-layer headers (which are only present in the first fragment) to track connection states, it cannot process fragmented packets.
To resolve this, Netfilter automatically registers the `ipv4_conntrack_defrag` hook to intercept and reassemble IP fragments *before* they reach the NAT or `conntrack` hooks.
Therefore, as our dynamically fragmented packets traverse the `OUTPUT` chain, they are caught by this defrag hook. The hook calls `ip_defrag()` to process each fragment. When the final fragment is processed, `ip_defrag()` erroneously drops the final reference to the originating socket (`skb->sk`), freeing the socket object directly into the `kmalloc` cache while the packet continues down the Netfilter pipeline.


## Reallocating the victim object

To replace the victim object all we have to do is allocate from the kmalloc-2k cache on the same CPU.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What structure we consider as "victim object"? My guess struct sock but please specify.

This must be done before all the hooks finish, so there is no way to make them from the user space.
However, we can use whatever netfilter modules we want. There's a lot of them and some are bound to make new allocations.
This line of thinking leads us to a TEE target:
> The TEE target will clone a packet and redirect this clone to another machine on the local network segment.

Cloning a packet sounds great, as it involves copying the data we passed to the send() function.
There is a problem, though. Our packet's data needs to be larger then than 1024 bytes to be allocated from kmalloc-2k and skb stores larger packets like that using a fragment list. When TEE clones the skb, pskb_copy() is called and only space for the head is allocated from the regular kmalloc, the rest is zero-copied by cloning the fraglist.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "fraglist"? What is "space for the head"? What are the structures used?


Fortunately, some netfilter modules need to look at the whole packet data in one piece (e.g. to search for patterns) instead of dealing with skb fragments.

One such example is a conntrack SIP helper. It calls skb_linearize() which transforms a fragmented skb to linear one, which involves allocating buffer for all the data using kmalloc and copying it there, which finally gives us a way to allocate from kmalloc-2k and overwrite the victim sock object with our data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One such example is a conntrack SIP helper. It calls skb_linearize() which transforms a fragmented skb to linear one, which involves allocating buffer for all the data using kmalloc and copying it there, which finally gives us a way to allocate from kmalloc-2k and overwrite the victim sock object with our data.
One such example is a conntrack SIP helper. Because SIP is a text-based protocol, the helper must search the packet payload for specific string patterns. To do this safely, it cannot deal with fragmented memory pages. Instead, it calls `skb_linearize()`. The skb_linearize() transforms a fragmented skb to linear one, which involves allocating buffer for all the data using kmalloc and copying it there, which finally gives us a way to allocate from kmalloc-2k and overwrite the victim sock object with our data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly does skb_linearize() guarantee a kmalloc-2k hit?

You mentioned the packet should be larger than 1024 bytes. Could you elaborate on the math here, how we ensure that newly linearized buffer lands squarely in the kmalloc-2k?


To summarize, by combining the TEE and SIP conntrack helper we are able to overwrite the victim sock object that will be used by the netfilter hooks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly are the TEE target and SIP helper chained together? Please explicitly state the pipeline/iptables rule sequence.


## Getting RIP control

Controlling the struct sock object may seem like an instant win at first, but we soon discover that netfilter hooks rarely use the socket context and never call function pointers from that object.

The solution is the ip_route_me_harder() function which is called in the mangle table if some IPv4 parameters like src/dst address, TOS or mark change after mangle rules are executed:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What specific iptables mangle rule do you use in the exploit?


```
static unsigned int
ipt_mangle_out(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
{
...
/* Save things which could affect route */
mark = skb->mark;
iph = ip_hdr(skb);
saddr = iph->saddr;
daddr = iph->daddr;
tos = iph->tos;

ret = ipt_do_table(priv, skb, state);
/* Reroute for ANY change. */
if (ret != NF_DROP && ret != NF_STOLEN) {
iph = ip_hdr(skb);

if (iph->saddr != saddr ||
iph->daddr != daddr ||
skb->mark != mark ||
iph->tos != tos) {
err = ip_route_me_harder(state->net, state->sk, skb, RTN_UNSPEC);
...
```

state->sk here is the pointer to our sock object.

ip_route_me_harder() calls xfrm_lookup() which examines sk->sk_policy and if the policy matches the current connection it eventually calls dst_alloc().
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the exact function call chain from xfrm_lookup() to dst_alloc()?

dst_alloc() calls the gc function pointer of the netns_xfrm.dst_ops struct and the netns_xfrm comes from the xfrm policy which is under our control.

So if we are able to craft a valid struct xfrm_policy that matches our connection, we will be able to get RIP control.

This policy is prepared in the prepare_policy().
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain in details how and which policy you prepare here in the writeup

The fake object for the sock itself is simple - we just need to set the sk_policy pointer and sk_mark value.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does sk->sk_mark play into this? Please elaborate


The policy object takes a lot of space and has pointer to other objects like netns_xfrm, so we used the [direct mapping storage technique](../../CVE-2024-26923_lts_cos/docs/novel-techniques.md) to place it at a known address in the kernel address space.

## Pivot to ROP

When the gc pointer is called in the dst_alloc() the RDI register contains a pointer to dst_ops which is part of our fake netns_xfrm object.

Following gadgets were used to pivot to the ROP chain placed at dst_ops + 0x10 (our gc pointer is at dst_ops+0x08).

```
mov r8,QWORD PTR [rdi+0xc8]
mov eax,0x1
test r8,r8
je ffffffff82185d21
mov rsi,rdi
mov rcx,r14
mov rdi,rbp
mov rdx,r15
call ffffffff82427a60 <__x86_indirect_thunk_r8>
```

This copies RDI to RSI

```
push rsi
jmp qword ptr [rsi + 0x39]
```

and finally

```
pop rsp
pop rbp
pop rbx
ret
```

## Second pivot

To get more room for our ROP chain we move to a second location in the direct mapping using a simple pop rsp ; ret gadget.

## Privilege escalation

Our ROP is executed from the ksoftirqd context, so we can't do a traditional commit_creds() to modify the current process's privileges.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the ROP execution happens in ksoftirqd, it would be great to explicitly tie this back to the add_qdisc_plug unplug command in the text. A simple sentence explaining that unplugging the queue defers the packet processing to the softirq context perfectly bridges the gap.


We could try locating our exploit process and changing its privileges, but we decided to go with a different approach - we patch the kernel creating a backdoor that will grant root privileges to any process that executes a given syscall.

We chose a rarely used kexec_file_load() syscall and overwrote its code with our get_root function that does all traditional privileges escalation/namespace escape stuff: commit_creds(init_cred), switch_task_namespaces(pid, init_nsproxy) etc.

This function also returns a special value (0x777) that our user space code can use to detect if the system was already compromised.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text mentions that get_root returns 0x777 for user-space to check. However, looking at the inline assembly for get_root and the syscall invocation in main(), this logic seems to have been removed from the code. You should probably delete this sentence from the writeup so readers aren't looking for code that isn't there.


Patching the kernel function is done rop_patch_kernel_code() - it calls set_memory_rw() on destination memory and uses copy_user_generic() to write new code there.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The writeup mentions using copy_user_generic(), but your code actually stages the payload via xattrs into the direct mapping and uses memcpy().

65 changes: 65 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-26921_lts_cos/docs/vulnerability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
## Requirements to trigger the vulnerability

- CAP_NET_ADMIN in a namespace is required
- Kernel configuration: CONFIG_INET
- User namespaces required: Yes

## Commit which introduced the vulnerability

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7026b1ddb6b8d4e6ee33dc2bd06c0ca8746fa7ab

## Commit which fixed the vulnerability

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18685451fc4e546fc0e718580d32df3c0e5c8272

## Affected kernel versions

Introduced in 4.1. Fixed in 6.6.25, 5.10.226 and other stable trees.

## Affected component, subsystem

net/ipv4

## Description

ip_local_out() is a function responsible for sending the locally generated IPV4 packets.
It will call the NF_INET_LOCAL_OUT netfilter hooks and eventually the dst_output().

The usual call to ip_local_out() looks like this:
```
int ip_send_skb(struct net *net, struct sk_buff *skb)
{
int err;

err = ip_local_out(net, skb->sk, skb);
if (err) {
if (err > 0)
err = net_xmit_errno(err);
if (err)
IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
}

return err;
}
```

Pointer to the socket associated with the skb is passed as an argument to ip_local_out() and then to all the netfilter hooks:

```
int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
{
...
return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT,
net, sk, skb, NULL, skb_dst(skb)->dev,
dst_output);

}
```

skb holds a reference to a socket. In normal conditions, skb is released only after its output path is finished or until the skb is received by the upper layers of the input stack (in scenarios when the outgoing packet is routed back to a local interface).
This ensures the associated socket is valid while the netfilter hooks are executing.

ip_defrag() is most often called in the input path and it calls skb_orphan()/kfree_skb() on the fragment skb, assuming it is no longer needed.
However, ip_defrag() can be also called in the output path by the netfilter conntrack hook ipv4_conntrack_defrag().

If that happens, the skb will be released and if it is a last reference to the socket, it will be released as well, causing a use-after-free when next hooks are called and in the ip_finish_output().
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
INCLUDES = -I/usr/include/libnl3
LIBS = -L. -pthread -lnl-cli-3 -lnl-route-3 -lnl-3 -ldl
CFLAGS = -fomit-frame-pointer -static -fcf-protection=none

exploit: exploit.c kernelver_16919.450.26.h
gcc -o $@ exploit.c $(INCLUDES) $(CFLAGS) $(LIBS)
objcopy --add-section tools=tools.tar.gz $@

prerequisites:
sudo apt-get install libnl-cli-3-dev libnl-route-3-dev
Binary file not shown.
Loading