kernelCTF: add CVE-2024-26921_lts_cos#310
kernelCTF: add CVE-2024-26921_lts_cos#310lambdasprocket wants to merge 1 commit intogoogle:masterfrom
Conversation
| "push %r12\n" | ||
| "push %rbx\n" | ||
| "push %rbp\n" | ||
| "lea -0x1838f1(%rip), %r15\n" |
There was a problem hiding this comment.
Magic number offset used in inline assembly. Add a comment explaining what the offset -0x1838f1 computes.
Check the 'Name and/or comment numeric constants' section of the style guide.
|
|
||
|
|
||
| if (!action) | ||
| rtnl_qdisc_plug_set_limit(qdisc, 0x100000); |
There was a problem hiding this comment.
Magic number used for qdisc plug limit. Use a named constant or add a comment explaining the limit.
See the 'Name and/or comment numeric constants' section of the style guide.
| add_qdisc_plug(0x10000, 0, 0); | ||
| add_qdisc_plug(0x10000, 0, 3); |
There was a problem hiding this comment.
Successive calls to the same modification function without explanation. Add a comment explaining why two sequential actions are required on the Qdisc.
See the 'Explain duplicated lines' section of the style guide.
| iph->saddr = inet_addr("10.77.77.1"); | ||
| iph->daddr = inet_addr("10.6.0.1"); | ||
| iph->id = 0; | ||
| iph->tos = 0x99; |
There was a problem hiding this comment.
Magic number 0x99 used for IP TOS. Add a comment or use a define for the TOS value.
Check the 'Name and/or comment numeric constants' section of the style guide.
| unsigned int xattr_fd_idx = 0; | ||
| char fname[512]; | ||
|
|
||
| g_payload_location = g_page_offset_base + 0x50000020; |
There was a problem hiding this comment.
Magic number offset 0x50000020 used for payload target calculation. Document why this offset is chosen (e.g., physmap spray reliability).
See the 'Name and/or comment numeric constants' section of the style guide.
| if (pid) { | ||
| set_cpu(0); | ||
| int sock = send_packet(); | ||
| sleep(10000); |
There was a problem hiding this comment.
Uncommented sleep() call for 10000 seconds. Explain the purpose of this long wait, or use a synchronization primitive instead.
See the 'Sleeping & waiting' section of the style guide.
| sleep(10000); | ||
| } | ||
|
|
||
| sleep(1); |
There was a problem hiding this comment.
Uncommented sleep() call. Add a comment explaining what state change is expected during this 1 second.
See the 'Sleeping & waiting' section of the style guide.
|
|
||
| asm volatile( | ||
| "movq 0x820(%%r13), %%r14\n" | ||
| "movq $0x10, (%%r14)\n" |
There was a problem hiding this comment.
Magic number used for assigning a struct pointer/offset. Add a comment explaining what field is being overwritten and what $0x10 means.
See the 'Name and/or comment numeric constants' section of the style guide.
| ); | ||
|
|
||
| asm volatile( | ||
| "movq 0x780(%%r13), %%r14\n" |
There was a problem hiding this comment.
Hardcoded struct offset 0x780 used to access nsproxy. Document the magic offset with a comment explaining it accesses nsproxy within task_struct.
Check the 'Name and/or comment numeric constants' section of the style guide.
| ); | ||
|
|
||
| asm volatile( | ||
| "movq 0x820(%%r13), %%r14\n" |
There was a problem hiding this comment.
Magic number offset (0x820) used in assembly for accessing a structure field. Add a comment explaining what field the offset refers to.
See the 'Name and/or comment numeric constants' section of the style guide.
|
|
||
| Second thing to consider is the kmalloc cache used to allocate struct sock. Most socket families have a dedicated cache, but some use a regular kmalloc(), giving us a simple way to reallocate the freed object without performing a cross-cache attack. | ||
|
|
||
| And finally, some sockets use a SOCK_RCU_FREE flag which causes sk_destruct() to wait for an RCU grace period before freeing the sock object and this would also make exploitation much harder. |
There was a problem hiding this comment.
| And finally, some sockets use a SOCK_RCU_FREE flag which causes sk_destruct() to wait for an RCU grace period before freeing the sock object and this would also make exploitation much harder. | |
| And finally, some sockets use a SOCK_RCU_FREE flag which causes sk_destruct() to wait for an RCU grace period before freeing the sock object. Because the vulnerable "use" happens synchronously later in the exact same netfilter pipeline, an RCU delay means the memory wouldn't actually be returned to the allocator until after the kernel dereferences the dangling pointer. This would make exploitation much harder, as we would need to artificially stall the packet's execution to wait for the grace period. |
| @@ -0,0 +1,176 @@ | |||
| ## Overview | |||
|
|
|||
| Let's look at what we need to perform the attack. | |||
There was a problem hiding this comment.
Please add a bit of summary / "battle plan" type of paragraph. The reader would need the overview of what we're trying to do.
How about smth like this:
| Let's look at what we need to perform the attack. | |
| To exploit this Use-After-Free, we must trigger the bug and overwrite the freed memory entirely within the synchronous execution path of ip_local_out(). The high-level strategy is: | |
| 1. Send a locally generated, fragmented IP packet so ip_send_skb() pushes its sk_buff (skb) into the netfilter hooks via ip_local_out(). | |
| 2. Have ip_defrag() process the skb, dropping the final reference to its associated socket (skb->sk) and freeing the socket's memory. | |
| 3. Use a subsequent netfilter hook to immediately allocate over the freed skb->sk memory with our controlled payload. | |
| 4. Let a later netfilter hook dereference the forged skb->sk object to gain execution control. | |
| Let's look at what we need to perform the attack. |
|
|
||
| ### Device driver to call ip_local_out() | ||
|
|
||
| Because we send our packets at layer 2, ip_send_skb() won't be called and we need to find another way to trigger ip_local_out(). |
There was a problem hiding this comment.
| Because we send our packets at layer 2, ip_send_skb() won't be called and we need to find another way to trigger ip_local_out(). | |
| Because `AF_PACKET` injects frames directly at Layer 2 (skipping the Layer 3 IP output stack entirely), functions like `ip_send_skb()` are bypassed. If sent through a standard interface, our packet would never hit the netfilter hooks. We need an alternative way to force the kernel to pass our crafted `skb` into `ip_local_out()`. | |
| The IPvlan driver provides the perfect mechanism. When processing outbound IPv4 traffic, the driver manually sets up the routing and directly invokes `ip_local_out()`: |
|
|
||
| ### A way to close the socket fd before the ip_defrag() call | ||
|
|
||
| When our packet reaches ip_defrag(), the socket won't be freed if it is still referenced by the open file descriptor. |
There was a problem hiding this comment.
| When our packet reaches ip_defrag(), the socket won't be freed if it is still referenced by the open file descriptor. | |
| When our `skb` reaches `ip_defrag()`, the function drops a reference to the socket (`skb->sk`). However, the socket memory is only freed if this drops the reference count (`sk_refcnt`) to zero. If the user-space file descriptor (fd) is still open, it holds an active reference, preventing the allocation from being freed. |
| When our packet reaches ip_defrag(), the socket won't be freed if it is still referenced by the open file descriptor. | ||
| We can call close() only after sendmsg() returns. The syscall returns after the packets is enqueued to the output device, so we might be able to try a race condition to close the fd in time, but there is a simpler way. | ||
|
|
||
| sch_plug queuing discipline can be used to stop the packets from being dequeued from a network device until a command to "unplug" is received through the netlink API. |
There was a problem hiding this comment.
| sch_plug queuing discipline can be used to stop the packets from being dequeued from a network device until a command to "unplug" is received through the netlink API. | |
| To achieve deterministic execution, we use the `sch_plug` queuing discipline. `sch_plug` can be attached to a network device to pause its egress queue, holding outbound packets until an explicit "unplug" command is received via the Netlink API. This allows us to cleanly suspend the packet's journey right before it enters the vulnerable `ip_local_out()` path. |
| So if we are able to craft a valid struct xfrm_policy that matches our connection, we will be able to get RIP control. | ||
|
|
||
| This policy is prepared in the prepare_policy(). | ||
| The fake object for the sock itself is simple - we just need to set the sk_policy pointer and sk_mark value. |
There was a problem hiding this comment.
How does sk->sk_mark play into this? Please elaborate
|
|
||
| So if we are able to craft a valid struct xfrm_policy that matches our connection, we will be able to get RIP control. | ||
|
|
||
| This policy is prepared in the prepare_policy(). |
There was a problem hiding this comment.
Please explain in details how and which policy you prepare here in the writeup
|
|
||
| ## Privilege escalation | ||
|
|
||
| Our ROP is executed from the ksoftirqd context, so we can't do a traditional commit_creds() to modify the current process's privileges. |
There was a problem hiding this comment.
Since the ROP execution happens in ksoftirqd, it would be great to explicitly tie this back to the add_qdisc_plug unplug command in the text. A simple sentence explaining that unplugging the queue defers the packet processing to the softirq context perfectly bridges the gap.
|
|
||
| We chose a rarely used kexec_file_load() syscall and overwrote its code with our get_root function that does all traditional privileges escalation/namespace escape stuff: commit_creds(init_cred), switch_task_namespaces(pid, init_nsproxy) etc. | ||
|
|
||
| This function also returns a special value (0x777) that our user space code can use to detect if the system was already compromised. |
There was a problem hiding this comment.
The text mentions that get_root returns 0x777 for user-space to check. However, looking at the inline assembly for get_root and the syscall invocation in main(), this logic seems to have been removed from the code. You should probably delete this sentence from the writeup so readers aren't looking for code that isn't there.
|
|
||
| This function also returns a special value (0x777) that our user space code can use to detect if the system was already compromised. | ||
|
|
||
| Patching the kernel function is done rop_patch_kernel_code() - it calls set_memory_rw() on destination memory and uses copy_user_generic() to write new code there. |
There was a problem hiding this comment.
The writeup mentions using copy_user_generic(), but your code actually stages the payload via xattrs into the direct mapping and uses memcpy().
No description provided.